Naive Bayes - Mathematical Theory

Sections

The Non-Naive Bayes Classifier
Lowest Possible Classification Error
Setting up Naive Bayes with Bayes’ Theorem
The Naive Assumption
Decision Rule and Model Training
Pseudo-count and Additive Smoothing
Surprising Real World Performance

The naive Bayes classifier is a simple method for classifying data points.

A very common example is when the data point is the contents of an email. A naive Bayes classifier can be used to decide whether to label the email as spam or not spam (also known as ham), based on its contents. This is actually a very popular use case that has been around since 1996!

Despite its simplicity, there is some interesting math behind its derivation, and it has surprisingly good performance in real-world applications.

The Non-Naive Bayes Classifier

Let us first start with the (non-naive) Bayes classifier and the fact that it is actually possible to derive a theoretical lowest possible error for any classification method!

Let \( Y \) be a random variable representing the class labels, and \( X \) be a random variable representing the data point. The Bayes classifier \( b: X \to Y \) assigns a data point \( x = (x_1,...,x_n) \) to class \( b(x) \) using the rule

\[ b(x) = \underset{y}{\text{argmax}} \,\, P(Y = y \mid X = x). \]

Note that this rule requires us to know the probability \( P(Y \mid X) \), which is usually impossible in practical settings. For the binary \( Y = 0 \text{ or } 1 \) case, this rule is equivalent to these two conditions:

\( b(x) = 0 \) if \( P(Y = 1 \mid X) < 0.5 \)
\( b(x) = 1 \) if \( P(Y = 1 \mid X) >= 0.5 \)

One way of writing the error or misclassification rate of the Bayes classifier is

\[ P(Y \neq b(X)). \]

This is known as the Bayes error rate. It is simply the probability that the Bayes classifier assigns a data point to the wrong class.

Lowest Possible Classification Error

We can actually show that the Bayes error rate is the lowest misclassification error any classification rule \( h: X \to Y \) can achieve!

The proof is actually not difficult, but there are a lot of details to work through. The clever tricks used to rewrite various quantities certainly do not help. Please refer to this set of lecture notes for the full derivation.

The key step is to rewrite the misclassification rate as a sum,

\[ P(Y \neq h(x)) = P(Y = 1, \, h(X) = 0) + P(Y = 0, \, h(X) = 1). \]

Then, using very clever tricks to rewrite this sum as,

\[ \mathbb{E}_X [ \,\, \mathbb{1}_{h(X)=0} \cdot P(Y = 1 \mid X) \,\, + \,\, \mathbb{1}_{h(X)=1} \cdot (1 - P(Y = 1 \mid X)) \,\, ]. \]

The \( \mathbb{1}_{h(X)=k} \) refers to the indicator function,

\[ \begin{align} \mathbb{1}_{h(X)=k} = \begin{cases} 0 \text{,} \,\,\,\, \text{if } h(X) \neq k \\[.5em] 1 \text{,} \,\,\,\, \text{if } h(X) = k. \end{cases} \end{align} \]

It should be clear from this formulation that the rule for the Bayes classifier in the previous section would minimize this quantity. For example, if \( P(Y = 1 \mid X) < 0.5 \), then having \( h(X) = 0 \) would minimize the quantity, which is exactly what the Bayes classifier recommends.

Setting up Naive Bayes with Bayes’ Theorem

I am going to derive the naive Bayes classifer using a similar approach as the Scikit-learn documentation.

Given a data point \( x = (x_1,...,x_n) \) containing \( n \) features, the probability that the class will be a particular value \( y \) given these features is

\[ P(Y = y \mid X = x). \]

We want to pick the label \( y \) that maximizes this quantity. First, we use Bayes’ Theorem to write this probability differently.

\[ P(Y = y \mid X = x) = \frac{ P(Y = y) \cdot P(X = x \mid Y = y) } { P(X = x) }. \]

For a fixed data point \( x = (x_1,...,x_n) \), the denominator is a constant over all the \( y \) classes. Therefore, it is not relevant for maximizing \( P(Y = y \mid X = x) \). Thus, we will focus on the \( P(Y = y) \cdot P(X = x \mid Y = y) \) term in the numerator.

The Naive Assumption

The word “naive” in the name of the algorithm refers to the key assumption that the \( n \) features \( (x_1,...,x_n) \) are conditionally independent given the class \( y \). Of course, this assumption is most probably not going to hold in the real world. Members of a class will almost always have dependencies among their features. For example, if we have a set of data on physical features (body fat, height, weight, foot size, etc), and two classes “male” and “female”. There will certainly be correlations between the features within each class.

However, this “naive” assumption makes the math a lot easier by allowing us to factor the numerator, using the chain rule for probability, into a product of marginals

\[ P(Y = y) \cdot P(X = x \mid Y = y) = P(Y = y) \cdot \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y). \]

We can then focus on selecting the label \( Y = y \) that maximizes this quantity. Note that \( P(X_i = x_i) \) is the probability that the \( i \) feature has value \( x_i \).

Decision Rule and Model Training

Under the “naive” assumption, the decision rule for labeling a data point \( x \) with the label \( \hat{y} \) is

\[ \hat{y} = \underset{y}{\text{argmax}} \, P(y) \cdot \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y). \]

We have to make yet another assumption, this time about \( P(X_i = x_i \mid Y = y) \), in order to fit the model to the data. There are a variety of ways to do so, depending on the kind of data we have.

In our spam email example, each data point \( x = ( x_1, x_2, \dots, x_n ) \) could contain binary variables indicating whether the \( i \)-th word occurs in the email. We can then assume that, given the label \( Y \), each \( X_i \) is generated by a Bernoulli distribution. That is,

\[ P(X = x_i \mid Y = y) = p_i x_i + (1-p_i)(1-x_i). \]

We then train the model, or fit the model to the data, by estimating the probability \( p_i \) that the \( i \)-th word appears. Note that multiplying many small numbers can be computationally unstable, but turning the product into a sum by taking the logarithm would make it stable.

If the \( X_i \) are categorical variables, then we can instead estimate the conditional probability using

\[ P(X = x_i \mid Y = y) = \frac{ \text{ number of times category } x_i \text{ appears in class } y }{ \text{ total number of data points with class label } y }. \]

Finally, if the \( X_i \) are continuous variables, a common choice is to assume that, given the label \( Y \), the variables are normally distributed.

Pseudo-count and Additive Smoothing

There is yet another problem that we have to overcome, but thankfully, it is also the last one. The problem is that the quantity \( \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y) = 0 \) the moment a single \( P(X_i = x_j \mid Y = y) = 0 \).

This means discarding all the information provided by all other \( P(X_i = x_i \mid Y = y) > 0 \), simply because a single one is zero. For example, this can occur in practice if a particular word appears in non-spam emails but never occurs in spam emails.

In the case when features \( X_i \) are discrete, we can simply add a “pseudocount” to make sure that each possible feature value is counted as appearing at least once, for each class label. These kinds of methods are known as additive smoothing. In the normal distribution case, when the \( x_i \) are continuous, we just need to be careful not to use features with zero variance.

Surprising Real World Performance

We have seen that the Bayes classifier has the lowest possible error among classification algorithms. We also talked about the fact that the naive Bayes classifier assumes that the features are conditionally independent and thus allows us to factor \( P(X = x \mid Y = y) \) into the product \( \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y) \).

Since features in real-world data points tend to be dependent, we might expect the naive assumption to lead to significant performance degradation. However, the naive Bayes classifier is known to perform surprisingly well in real-world applications!

An explanation for this is given by this paper titled “The Optimality of Naive Bayes”. It shows that the dependencies among feature variables could cancel each other out, or be distributed so evenly among the classes that they do not have a big impact.