(PSL) Naive Bayes¶

In [6]:
import numpy as np
import pandas as pd

from sklearn.naive_bayes import  GaussianNB
from sklearn.metrics import confusion_matrix

Spam Data¶

Let's demonstrate Naive Bayes using the GaussianNB command from the klearn, with the 'spam' dataset serving as our example.

The 'spam' dataset is designed for binary classification, specifically designed to determine whether an email is spam or not based on features like word frequency.

In [36]:
url = "https://liangfgithub.github.io/Data/spam.txt"
spam = pd.read_table(url, sep="\s+", header=None)
spam[1:5]
Out[36]:
0 1 2 3 4 5 6 7 8 9 ... 48 49 50 51 52 53 54 55 56 57
1 0.21 0.28 0.50 0.0 0.14 0.28 0.21 0.07 0.00 0.94 ... 0.00 0.132 0.0 0.372 0.180 0.048 5.114 101 1028 1
2 0.06 0.00 0.71 0.0 1.23 0.19 0.19 0.12 0.64 0.25 ... 0.01 0.143 0.0 0.276 0.184 0.010 9.821 485 2259 1
3 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.137 0.0 0.137 0.000 0.000 3.537 40 191 1
4 0.00 0.00 0.00 0.0 0.63 0.00 0.31 0.63 0.31 0.63 ... 0.00 0.135 0.0 0.135 0.000 0.000 3.537 40 191 1

4 rows × 58 columns

Using `GaussianNB¶

After loading the data, try utilizing the GaussianNB command. The default option sets var_smoothing = 1e-09; also try var_smoothing = 0.

In [5]:
X = spam.drop(57, axis=1)
Y = spam[57]

NBfit = GaussianNB().fit(X, Y)
Y_pred = NBfit.predict(X)

confusion_matrix(Y_pred, Y)
Out[5]:
array([[2047,   74],
       [ 741, 1739]])
In [7]:
NBfit = GaussianNB(var_smoothing=0).fit(X, Y)
Y_pred = NBfit.predict(X)

confusion_matrix(Y_pred, Y)
Out[7]:
array([[2034,   75],
       [ 754, 1738]])

Our Custom Implementation¶

We've also developed a custom Naive Bayes implementation.

  1. First compute the normal parameters for each dimension and each class using. Group means and standard deviations are stored in group_mean and group_sd, two 2-by-p matrices.
In [9]:
group_mean = spam.groupby(57).mean().to_numpy()
group_sd = spam.groupby(57).std().to_numpy()
w = (Y == 0).mean()
  1. Compute $\Delta_0(x)$ and $\Delta_1(x)$, then compute the posterior probability for class 1.
$$ P(Y = 1| x) = \frac{e^{\Delta_1(x)}}{e^{\Delta_0(x)} + e^{\Delta_1(x)}} = \frac{1}{1 + e^{\Delta_0(x) - \Delta_1(x)}} $$
In [38]:
tmp0 = -np.log(group_sd[0]) - (X - group_mean[0]) ** 2.0 / (2.0 * group_sd[0] ** 2.0)
tmp0 = tmp0.sum(axis=1)
tmp0 += np.log(w)

tmp1 = -np.log(group_sd[1]) - (X - group_mean[1]) ** 2.0 / (2.0 * group_sd[1] ** 2.0)
tmp1 = tmp1.sum(axis=1)
tmp1 += np.log(1 - w)

diff = tmp0 - tmp1
In [39]:
diff = np.clip(diff, -200, 200)
Y_prob_new = 1/(1 + np.exp(diff))
Y_pred_new = Y_prob_new > 0.5
confusion_matrix(Y_pred_new, Y)
Out[39]:
array([[2034,   75],
       [ 754, 1738]])

Our prediction mirros the one from GaussianNB

In [40]:
confusion_matrix(Y_pred_new, Y_pred)
Out[40]:
array([[2109,    0],
       [   0, 2492]])
In [ ]: