I’d like to emphasize a subtle issue that arises during the prediction stage of parametric Naive Bayes. When making predictions for a point x, we compute the following function for each k:
\[ \Delta_k(x) = \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{k1}}} e^{-\frac{(x_1 - \mu_{k1})^2}{2 \sigma^2_{k1}}} \cdots \frac{1}{\sqrt{\sigma^2_{kp}}} e^{-\frac{(x_p - \mu_{kp})^2}{2 \sigma^2_{kp}}} \Big ] = \sum_{j=1}^p \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{kj}}} e^{-\frac{(x_j - \mu_{kj})^2}{2 \sigma^2_{kj}}} \Big ] \] Or alternatively, \[ \Delta_k(x) = \log \pi_k - \frac{1}{2} \sum_{j=1}^p \Big [ \log \sigma^2_{kj} + \frac{(x_j - \mu_{kj})^2}{\sigma_{kj}^2} \Big ] \]
Some Naive Bayes packages use the first approach, i.e., by evaluating the normal probability density function (pdf) and then taking the logarithm.
When \(x_j\) is far from the center or when \(\sigma^2_{kj}\) is very small, the normal pdf can become very small, leading to numerical issues when applying the logarithm.
To mitigate this problem, many Naive Bayes packages truncate the value of the normal pdf, for instance, setting some small values to be a pre-specified threshold. However, this truncation can result in incorrect predictions.
Let’s illustrate this issue using the NaiveBayes command
from the klaR package in R, with the ‘spam’ dataset serving
as our example.
The ‘spam’ dataset is designed for binary classification, specifically designed to determine whether an email is spam or not based on features like word frequency.
library(klaR)## Loading required package: MASS
spam = read.table("https://liangfgithub.github.io/Data/spam.txt")
spam.name=read.table("https://liangfgithub.github.io/Data/spam_name.txt")
for(i in 1:(ncol(spam)-1)) names(spam)[i]=as.character(spam.name[i,]);
names(spam)[ncol(spam)]="Y"
spam$Y = as.factor(spam$Y)NaiveBayes from klaRAfter loading the data, try utilizing the NaiveBayes
command from the klaR package.
Without suppressWarnings, you’ll encounter warning
messages, which stem from R’s attempt to alert you about occurrences
like “Numerical 0 probability for all classes with observation 1.
suppressWarnings({
NBfit = NaiveBayes( Y ~ ., data= spam)
names(NBfit) # check the output from NaiveBayes
y.class = predict(NBfit, newdata = spam)$class
y.prob = predict(NBfit, newdata = spam)$post
y.prob = y.prob[,1] # prob of in class 1
table(spam$Y, y.prob>0.5)
})##
## FALSE TRUE
## 0 1224 1564
## 1 1718 94
We’ve also developed a custom Naive Bayes implementation.
mymean and mysd, two 2-by-p matrices. These
values should align with what you’d obtain from
NBfit$tables returned by NaiveBayes.X = spam[, -58]
Y = spam$Y
p=dim(X)[2]; # p: number of predictors
y.levels = levels(Y)
K= length(y.levels) # number of groups
mymean = matrix(0, K, p)
mysd = matrix(0, K, p)
for(k in 1:K){
mymean[k,] = apply(X[Y == y.levels[k],], 2, mean)
mysd[k,] = apply(X[Y == y.levels[k],], 2, sd)
}
w=mean(Y==y.levels[1])\[ P(Y = 1| x) = \frac{e^{\Delta_1(x)}}{e^{\Delta_1(x)} + e^{\Delta_2(x)}} = \frac{1}{1 + e^{\Delta_2(x) - \Delta_1(x)}} \]
tmp = matrix(0, length(Y), 2)
tmp[, 1] = log(w)
tmp[, 2] = log(1-w)
for(i in 1:p){
tmp[, 1] = tmp[, 1] - log(mysd[1,i]) - (X[,i] - mymean[1,i])^2/(2*mysd[1,i]^2)
tmp[, 2] = tmp[, 2] - log(mysd[2,i]) - (X[,i] - mymean[2,i])^2/(2*mysd[2,i]^2)
}
diff = tmp[, 2] - tmp[, 1]
my.prob = 1/(1 + exp(diff)) Our custom code is faster and operates and encounters no warning
messages. You’ll notice that our predictions diverge from those returned
by klaR’s NaiveBayes
id = which(round(my.prob, dig = 1) != round(y.prob, dig=1))
length(id)## [1] 499
id[1:5]## [1] 177 346 373 546 707
For your reference, you can inspect the raw code for
NaiveBayes on [Github
Link]. The computation within it aligns with the code provided
below. You’ll find that my.prob2 mirrors
y.prob returned by NaiveBayes
tmp = matrix(0, length(Y), 2)
for(i in 1:p){
prob = dnorm(X[,i], mymean[1,i], mysd[1,i])
prob[prob==0] = 0.001
tmp[, 1] = tmp[, 1] + log(prob)
prob = dnorm(X[,i], mymean[2,i], mysd[2,i])
prob[prob==0] = 0.001
tmp[, 2] = tmp[, 2] + log(prob)
}
tmp[, 1] = tmp[, 1] + log(w)
tmp[, 2] = tmp[, 2] + log(1-w)
my.prob2 = 1/(1 + exp(tmp[, 2] - tmp[, 1])) cbind(y.prob, my.prob2, my.prob)[id[1:5], ]## y.prob my.prob2 my.prob
## [1,] 1 1 0
## [2,] 1 1 0
## [3,] 1 1 0
## [4,] 1 1 0
## [5,] 1 1 0
The primary source of discrepancy stems from the magnitude of the
exponential component, which can become remarkably large for specific
observations within specific dimensions. NaiveBayes opts to
truncate these contributions, leading to errors.
To illustrate, let’s examine the 177th sample.
spam[177, ]
x = X[177, ]
p1 = (x - mymean[, 1])^2/(2*mysd[,1]^2)
p2 = (x - mymean[, 2])^2/(2*mysd[,2]^2)It’s worth noting that for this sample, the ‘Wcredit’ feature (the
20th feature) yields a very small PDF value, which is subsequently
truncated. Consequently, the final prediction from
NaiveBayes incorrectly favors class 1.
p1
p2
which(exp(-p1) == 0)
which(exp(-p2) == 0)