import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
Let's demonstrate Naive Bayes using the GaussianNB
command from the klearn
, with the 'spam' dataset serving as our example.
The 'spam' dataset is designed for binary classification, specifically designed to determine whether an email is spam or not based on features like word frequency.
url = "https://liangfgithub.github.io/Data/spam.txt"
spam = pd.read_table(url, sep="\s+", header=None)
spam[1:5]
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | ... | 0.00 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | 1 |
2 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | ... | 0.01 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | 1 |
3 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.00 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.00 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
4 rows × 58 columns
`GaussianNB
¶After loading the data, try utilizing the GaussianNB
command. The default option sets var_smoothing = 1e-09
; also try var_smoothing = 0
.
X = spam.drop(57, axis=1)
Y = spam[57]
NBfit = GaussianNB().fit(X, Y)
Y_pred = NBfit.predict(X)
confusion_matrix(Y_pred, Y)
array([[2047, 74], [ 741, 1739]])
NBfit = GaussianNB(var_smoothing=0).fit(X, Y)
Y_pred = NBfit.predict(X)
confusion_matrix(Y_pred, Y)
array([[2034, 75], [ 754, 1738]])
We've also developed a custom Naive Bayes implementation.
group_mean
and group_sd
, two 2-by-p matrices.group_mean = spam.groupby(57).mean().to_numpy()
group_sd = spam.groupby(57).std().to_numpy()
w = (Y == 0).mean()
tmp0 = -np.log(group_sd[0]) - (X - group_mean[0]) ** 2.0 / (2.0 * group_sd[0] ** 2.0)
tmp0 = tmp0.sum(axis=1)
tmp0 += np.log(w)
tmp1 = -np.log(group_sd[1]) - (X - group_mean[1]) ** 2.0 / (2.0 * group_sd[1] ** 2.0)
tmp1 = tmp1.sum(axis=1)
tmp1 += np.log(1 - w)
diff = tmp0 - tmp1
diff = np.clip(diff, -200, 200)
Y_prob_new = 1/(1 + np.exp(diff))
Y_pred_new = Y_prob_new > 0.5
confusion_matrix(Y_pred_new, Y)
array([[2034, 75], [ 754, 1738]])
Our prediction mirros the one from GaussianNB
confusion_matrix(Y_pred_new, Y_pred)
array([[2109, 0], [ 0, 2492]])