I’d like to emphasize a subtle issue that arises during the prediction stage of parametric Naive Bayes. When making predictions for a point x, we compute the following function for each k:
\[ \Delta_k(x) = \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{k1}}} e^{-\frac{(x_1 - \mu_{k1})^2}{2 \sigma^2_{k1}}} \cdots \frac{1}{\sqrt{\sigma^2_{kp}}} e^{-\frac{(x_p - \mu_{kp})^2}{2 \sigma^2_{kp}}} \Big ] = \sum_{j=1}^p \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{kj}}} e^{-\frac{(x_j - \mu_{kj})^2}{2 \sigma^2_{kj}}} \Big ] \] Or alternatively, \[ \Delta_k(x) = \log \pi_k - \frac{1}{2} \sum_{j=1}^p \Big [ \log \sigma^2_{kj} + \frac{(x_j - \mu_{kj})^2}{\sigma_{kj}^2} \Big ] \]
Some Naive Bayes packages use the first approach, i.e., by evaluating the normal probability density function (pdf) and then taking the logarithm.
When \(x_j\) is far from the center or when \(\sigma^2_{kj}\) is very small, the normal pdf can become very small, leading to numerical issues when applying the logarithm.
To mitigate this problem, many Naive Bayes packages truncate the value of the normal pdf, for instance, setting some small values to be a pre-specified threshold. However, this truncation can result in incorrect predictions.
Let’s illustrate this issue using the NaiveBayes
command
from the klaR
package in R, with the ‘spam’ dataset serving
as our example.
The ‘spam’ dataset is designed for binary classification, specifically designed to determine whether an email is spam or not based on features like word frequency.
library(klaR)
## Loading required package: MASS
= read.table("https://liangfgithub.github.io/Data/spam.txt")
spam
=read.table("https://liangfgithub.github.io/Data/spam_name.txt")
spam.namefor(i in 1:(ncol(spam)-1)) names(spam)[i]=as.character(spam.name[i,]);
names(spam)[ncol(spam)]="Y"
$Y = as.factor(spam$Y) spam
NaiveBayes
from klaR
After loading the data, try utilizing the NaiveBayes
command from the klaR
package.
Without suppressWarnings
, you’ll encounter warning
messages, which stem from R’s attempt to alert you about occurrences
like “Numerical 0 probability for all classes with observation 1.
suppressWarnings({
= NaiveBayes( Y ~ ., data= spam)
NBfit
names(NBfit) # check the output from NaiveBayes
= predict(NBfit, newdata = spam)$class
y.class = predict(NBfit, newdata = spam)$post
y.prob = y.prob[,1] # prob of in class 1
y.prob table(spam$Y, y.prob>0.5)
})
##
## FALSE TRUE
## 0 1224 1564
## 1 1718 94
We’ve also developed a custom Naive Bayes implementation.
mymean
and mysd
, two 2-by-p matrices. These
values should align with what you’d obtain from
NBfit$tables
returned by NaiveBayes
.= spam[, -58]
X = spam$Y
Y
=dim(X)[2]; # p: number of predictors
p= levels(Y)
y.levels = length(y.levels) # number of groups
K
= matrix(0, K, p)
mymean = matrix(0, K, p)
mysd
for(k in 1:K){
= apply(X[Y == y.levels[k],], 2, mean)
mymean[k,] = apply(X[Y == y.levels[k],], 2, sd)
mysd[k,]
}=mean(Y==y.levels[1]) w
\[ P(Y = 1| x) = \frac{e^{\Delta_1(x)}}{e^{\Delta_1(x)} + e^{\Delta_2(x)}} = \frac{1}{1 + e^{\Delta_2(x) - \Delta_1(x)}} \]
= matrix(0, length(Y), 2)
tmp 1] = log(w)
tmp[, 2] = log(1-w)
tmp[, for(i in 1:p){
1] = tmp[, 1] - log(mysd[1,i]) - (X[,i] - mymean[1,i])^2/(2*mysd[1,i]^2)
tmp[, 2] = tmp[, 2] - log(mysd[2,i]) - (X[,i] - mymean[2,i])^2/(2*mysd[2,i]^2)
tmp[,
}
= tmp[, 2] - tmp[, 1]
diff = 1/(1 + exp(diff)) my.prob
Our custom code is faster and operates and encounters no warning
messages. You’ll notice that our predictions diverge from those returned
by klaR
’s NaiveBayes
= which(round(my.prob, dig = 1) != round(y.prob, dig=1))
id length(id)
## [1] 499
1:5] id[
## [1] 177 346 373 546 707
For your reference, you can inspect the raw code for
NaiveBayes
on [Github
Link]. The computation within it aligns with the code provided
below. You’ll find that my.prob2
mirrors
y.prob
returned by NaiveBayes
= matrix(0, length(Y), 2)
tmp
for(i in 1:p){
= dnorm(X[,i], mymean[1,i], mysd[1,i])
prob ==0] = 0.001
prob[prob1] = tmp[, 1] + log(prob)
tmp[, = dnorm(X[,i], mymean[2,i], mysd[2,i])
prob ==0] = 0.001
prob[prob2] = tmp[, 2] + log(prob)
tmp[,
}1] = tmp[, 1] + log(w)
tmp[, 2] = tmp[, 2] + log(1-w)
tmp[, = 1/(1 + exp(tmp[, 2] - tmp[, 1])) my.prob2
cbind(y.prob, my.prob2, my.prob)[id[1:5], ]
## y.prob my.prob2 my.prob
## [1,] 1 1 0
## [2,] 1 1 0
## [3,] 1 1 0
## [4,] 1 1 0
## [5,] 1 1 0
The primary source of discrepancy stems from the magnitude of the
exponential component, which can become remarkably large for specific
observations within specific dimensions. NaiveBayes
opts to
truncate these contributions, leading to errors.
To illustrate, let’s examine the 177th sample.
177, ]
spam[= X[177, ]
x = (x - mymean[, 1])^2/(2*mysd[,1]^2)
p1 = (x - mymean[, 2])^2/(2*mysd[,2]^2) p2
It’s worth noting that for this sample, the ‘Wcredit’ feature (the
20th feature) yields a very small PDF value, which is subsequently
truncated. Consequently, the final prediction from
NaiveBayes
incorrectly favors class 1.
p1
p2which(exp(-p1) == 0)
which(exp(-p2) == 0)