I’d like to emphasize a subtle issue that arises during the prediction stage of parametric Naive Bayes. When making predictions for a point x, we compute the following function for each k:

\[ \Delta_k(x) = \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{k1}}} e^{-\frac{(x_1 - \mu_{k1})^2}{2 \sigma^2_{k1}}} \cdots \frac{1}{\sqrt{\sigma^2_{kp}}} e^{-\frac{(x_p - \mu_{kp})^2}{2 \sigma^2_{kp}}} \Big ] = \sum_{j=1}^p \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{kj}}} e^{-\frac{(x_j - \mu_{kj})^2}{2 \sigma^2_{kj}}} \Big ] \] Or alternatively, \[ \Delta_k(x) = \log \pi_k - \frac{1}{2} \sum_{j=1}^p \Big [ \log \sigma^2_{kj} + \frac{(x_j - \mu_{kj})^2}{\sigma_{kj}^2} \Big ] \]

Some Naive Bayes packages use the first approach, i.e., by evaluating the normal probability density function (pdf) and then taking the logarithm.

When \(x_j\) is far from the center or when \(\sigma^2_{kj}\) is very small, the normal pdf can become very small, leading to numerical issues when applying the logarithm.

To mitigate this problem, many Naive Bayes packages truncate the value of the normal pdf, for instance, setting some small values to be a pre-specified threshold. However, this truncation can result in incorrect predictions.

Spam Data

Let’s illustrate this issue using the NaiveBayes command from the klaR package in R, with the ‘spam’ dataset serving as our example.

The ‘spam’ dataset is designed for binary classification, specifically designed to determine whether an email is spam or not based on features like word frequency.

library(klaR)
## Loading required package: MASS
spam = read.table("https://liangfgithub.github.io/Data/spam.txt")

spam.name=read.table("https://liangfgithub.github.io/Data/spam_name.txt")
for(i in 1:(ncol(spam)-1)) names(spam)[i]=as.character(spam.name[i,]);

names(spam)[ncol(spam)]="Y"
spam$Y = as.factor(spam$Y)

Using NaiveBayes from klaR

After loading the data, try utilizing the NaiveBayes command from the klaR package.

Without suppressWarnings, you’ll encounter warning messages, which stem from R’s attempt to alert you about occurrences like “Numerical 0 probability for all classes with observation 1.

suppressWarnings({
  NBfit = NaiveBayes( Y ~ ., data= spam)
  
  names(NBfit)   # check the output from NaiveBayes

  y.class = predict(NBfit, newdata = spam)$class
  y.prob = predict(NBfit, newdata = spam)$post
  y.prob = y.prob[,1]  # prob of in class 1
  table(spam$Y, y.prob>0.5)
})
##    
##     FALSE TRUE
##   0  1224 1564
##   1  1718   94

Our Custom Implementation

We’ve also developed a custom Naive Bayes implementation.

  • First compute the normal parameters for each dimension and each class using. Group means and standard deviations are stored in mymean and mysd, two 2-by-p matrices. These values should align with what you’d obtain from NBfit$tables returned by NaiveBayes.
X = spam[, -58]
Y = spam$Y

p=dim(X)[2]; # p: number of predictors
y.levels = levels(Y)
K= length(y.levels) # number of groups

mymean = matrix(0, K, p)  
mysd = matrix(0, K, p)    

for(k in 1:K){
  mymean[k,] = apply(X[Y == y.levels[k],], 2, mean)
  mysd[k,] = apply(X[Y == y.levels[k],], 2, sd)
}
w=mean(Y==y.levels[1])
  • Compute \(\Delta_1(x)\) and \(\Delta_2(x)\), then compute the posterior probability for class 1:

\[ P(Y = 1| x) = \frac{e^{\Delta_1(x)}}{e^{\Delta_1(x)} + e^{\Delta_2(x)}} = \frac{1}{1 + e^{\Delta_2(x) - \Delta_1(x)}} \]

tmp = matrix(0, length(Y), 2)
tmp[, 1] = log(w)
tmp[, 2] = log(1-w)
for(i in 1:p){
  tmp[, 1] = tmp[, 1] - log(mysd[1,i]) - (X[,i] - mymean[1,i])^2/(2*mysd[1,i]^2)
  tmp[, 2] = tmp[, 2] - log(mysd[2,i]) - (X[,i] - mymean[2,i])^2/(2*mysd[2,i]^2)
}

diff = tmp[, 2] - tmp[, 1]
my.prob = 1/(1 + exp(diff)) 

Discrepancy

Our custom code is faster and operates and encounters no warning messages. You’ll notice that our predictions diverge from those returned by klaR’s NaiveBayes

id = which(round(my.prob, dig = 1) != round(y.prob, dig=1))
length(id)
## [1] 499
id[1:5]
## [1] 177 346 373 546 707

For your reference, you can inspect the raw code for NaiveBayes on [Github Link]. The computation within it aligns with the code provided below. You’ll find that my.prob2 mirrors y.prob returned by NaiveBayes

tmp = matrix(0, length(Y), 2)

for(i in 1:p){
  prob = dnorm(X[,i], mymean[1,i], mysd[1,i])
  prob[prob==0] = 0.001
  tmp[, 1] = tmp[, 1] + log(prob)
  prob = dnorm(X[,i], mymean[2,i], mysd[2,i])
  prob[prob==0] = 0.001
  tmp[, 2] = tmp[, 2] + log(prob)
}
tmp[, 1] = tmp[, 1] + log(w)
tmp[, 2] = tmp[, 2] + log(1-w)
my.prob2 = 1/(1 + exp(tmp[, 2] - tmp[, 1]))  
cbind(y.prob, my.prob2, my.prob)[id[1:5], ]
##      y.prob my.prob2 my.prob
## [1,]      1        1       0
## [2,]      1        1       0
## [3,]      1        1       0
## [4,]      1        1       0
## [5,]      1        1       0

The primary source of discrepancy stems from the magnitude of the exponential component, which can become remarkably large for specific observations within specific dimensions. NaiveBayes opts to truncate these contributions, leading to errors.

To illustrate, let’s examine the 177th sample.

spam[177, ]
x = X[177, ]
p1 = (x - mymean[, 1])^2/(2*mysd[,1]^2)
p2 = (x - mymean[, 2])^2/(2*mysd[,2]^2)

It’s worth noting that for this sample, the ‘Wcredit’ feature (the 20th feature) yields a very small PDF value, which is subsequently truncated. Consequently, the final prediction from NaiveBayes incorrectly favors class 1.

p1
p2
which(exp(-p1) == 0)
which(exp(-p2) == 0)