9.6. Naive Bayes Classifiers

What is Naive Bayes, especially in the context of multi-class classification? The “Naive” part of Naive Bayes comes from the assumption of feature independence. It assumes that the features are conditionally independent of each other given the class label. This simplifies the computation and makes it computationally efficient.

Recall the discriminant or decision function:

\[\arg\max_k P(Y=k | X=x) = \arg\max_k \pi_k f_k(x).\]

In this equation, \(f_k\) represents the density function for a p-dimensional vector \(x = (x_1, \dots, x_p)\). In Naive Bayes, we factorize the joint distribution function into individual one-dimensional distribution functions for these p elements. This concept can be seen as a way to introduce independence as a form of regularization for high-dimensional problems.

\[f_k(x) = f_{k1}(x_1) \times f_{k2}(x_2) \cdots \times f_{kp}(x_p)\]

We proceed by estimating the one-dimensional distribution function for each dimension and each class. For instance, for discrete features, we can histograms or count-based methods; for numerical features, we can use kernel density functions since it’s a one-dimensional problem. This approach is known as non-parametric Naive Bayes. Alternatively, for numerical features, we can employ a one-dimensional normal density function to estimate each of the \(f_{kj}\), which is termed parametric Naive Bayes.

To test your understanding: When employing parametric Naive Bayes, how many parameters must we estimate? We indeed need p means, one for each dimension, and p variances, one for each dimension, totaling 2p parameters.

I’d like to emphasize a subtle issue that arises during the prediction stage of parametric Naive Bayes. When making predictions for a new test point \(x^*\), we compute the following decision or discriminant function for each k:

\[\begin{split}d_k(x^*) &= - \log \Big [ \pi_k \frac{1}{\sqrt{\sigma^2_{k1}}} e^{-\frac{(x^*_1 - \mu_{k1})^2}{2 \sigma^2_{k1}}} \cdots \frac{1}{\sqrt{\sigma^2_{kp}}} e^{-\frac{(x^*_p - \mu_{kp})^2}{2 \sigma^2_{kp}}} \Big ] \\ &= - \log \pi_k + \frac{1}{2} \sum_{j=1}^p \Big [ \log \sigma^2_{kj} + \frac{(x^*_j - \mu_{kj})^2}{\sigma_{kj}^2} \Big ]\end{split}\]

Some Naive Bayes packages may compute this term using the first line, i.e., by evaluating the normal probability density function (pdf) and then taking the logarithm. When \(x^*\) is far from the center, the normal pdf can become very small, leading to numerical issues when applying the logarithm. To mitigate this problem, many Naive Bayes packages truncate the value of the normal pdf, and then proceed to take the logarithm. However, this truncation can result in incorrect predictions. It’s important to note that we are primarily interested in the relative magnitudes of \(d_k\), determining which k gives us the highest \(d_k\). Truncation erases the differences among the \(d_k\) values. For instance, all \(d_k\) values may get truncated, but \(d_1\) might be the smallest among them.

See our code page for an illustration of this issue using the ‘naive’ command from the ‘klaR’ package in R on the ‘spam’ dataset. The GaussianNB command from the sklearn package in Python does not have this issue.