Naive Bayes Classification of Amazon Employees

Part 1 - Mathematical Theory


The naive bayes classifier is a simple method for labeling / classifying data points.

For example, the data point could be the contents of an email. A naive bayes classifier could be used to decide whether to label the email as spam or not spam (also known as ham), based on the content of the email. This is actually a very popular use case that has been around since the 1990s!

Despite its simplicity, there are some interesting math behind its derivation, and it has surprisingly good performance in real world applications.

The Non-Naive Bayes Classifier


I am going to start with the (non-naive) bayes classifier, and the fact that it is actually possible to derive a theoretical lowest possible error for any classification method!

Let \( Y \) be a random variable representing the class labels, and \( X \) be a random variable representing the data point. The bayes classifier \( b: X \to Y\) assigns a data point \( x = (x_1,...,x_n) \) to class \( b(x) \) using the rule

\[ b(x) = \underset{y}{\text{argmax}} \,\, P(Y = y \mid X = x). \]
Note that this rule requires us to know the probability \( P(Y \mid X) \), which is impossible in practical settings. For the binary \( Y = 0 \text{ or } 1 \) case, this rule is equivalent to these two conditions,


One way of writing the bayes error rate is \[ P(Y \neq b(X)). \]
This is simply the probability that the bayes classifier assigns a data point to the wrong class.

Lowest Possible Classification Error


We can show that the bayes error rate is the lowest error achievable by any classification rule \( h: X \to Y \)! The proof is actually not difficult, but there are a lot of details to work through. The clever tricks used to rewrite various quantities certainly does not help. Please refer to this set of lecture notes for the full derivation.

The key step is to rewrite the error rate as a sum,

\[ P(Y \neq h(x)) = P(Y = 1, h(X) = 0) + P(Y = 0, h(X) = 1). \]
Then, using very clever tricks to rewrite this sum as,

\[ \mathbb{E}_X [ \,\, \mathbb{1}_{h(X)=0} \cdot P(Y = 1 \mid X) \,\, + \,\, \mathbb{1}_{h(X)=1} \cdot (1 - P(Y = 1 \mid X)) \,\, ]. \]
The \( \mathbb{1} \) refers to the indicator function, \[ \mathbb{1}_{h(X)=k} = \begin{cases} 0 \text{,} \,\,\,\, \text{if } h(X) \neq k \\[.5em] 1 \text{,} \,\,\,\, \text{if } h(X) = k. \end{cases} \]
It should be clear from this formulation that the rule for the bayes classifier in the previous section would minimize this quantity. For example, if \( P(Y = 1 \mid X) < 0.5 \), then having \( h(X) = 0 \) would minimize the quantity, which is exactly what the bayes classifier recommends.

Setting up Naive Bayes with Bayes Theorem


I am going to derive the naive bayes classifer using a similar approach to the scikit-learn documentation.

Given a data point \( x = (x_1,...,x_n) \) containing \( n \) features, the probability that the class will be a particular value \( y \), given these features is

\[ P(Y = y \mid X = x). \]
We want to pick the label \( y \) that maximizes this quantity. First, we use Bayes' Theorem to write this probability in a different way.

\[ P(Y = y \mid X = x) = \frac{ P(Y = y) \cdot P(X = x \mid Y = y) } { P(X = x) }. \]
For a fixed data point \( x = (x_1,...,x_n) \), the denominator is a constant over all the \( y \) classes. Therefore, it is not relevant for maximizing \( P(Y = y \mid X = x) \). Thus, we will focus on the \( { P(Y = y) \cdot P(X = x \mid Y = y) } \) term in the numerator.

The Naive Assumption


The word "naive" in the name of the algorithm refers to the key assumption that the \( n \) features \( x_1,...,x_n \) are conditionally independent, given the class \( y \).

Of course, this assumption will often not hold in the real world. Members of a class will almost always have dependencies among their features. For example, if we have a set of data on physical features (body fat, height, weight, feet size etc), and divide the dataset into two classes, "male" and "female". There will certainly be correlations between the features within each class.

However this "naive" assumption makes the math a lot easier by allowing us to factor the numerator using the chain rule for probability, into a product of marginals, \[ P(Y = y) \cdot P(X = x \mid Y = y) = P(Y = y) \cdot \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y). \] We can then focus on selecting the label \( Y = y \) that maximizes this quantity. Note that \( P(X_i = x_i) \) is the probability that the \( i \)-th feature has value \( x_i \).

Decision Rule & Training the Model


Under the "naive" assumption, the decision rule for labeling a data point \( x \) with the label \( \hat{y} \) is

\[ \hat{y} = \underset{y}{\text{argmax}} P(y) \cdot \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y). \]
We have to make yet another assumption, this time about \( P(X_i = x_i \mid Y = y) \), in order to fit the model to data. There are a variety of ways to do so depending on the kind of data we have.

  • In our spam email example, each data point \( x = ( x_1, x_2, \dots, x_n ) \) could contain binary variables indicating whether the \( i \)-th word occurs in the email.

    We then assume that, given the label \( Y \), each \( X_i \) is generated by a bernoulli distribution. That is,

    \[ P(X = x_i \mid Y = y) = p_i x_i + (1-p_i)(1-x_i). \]
    We then train the model, or fit the model to the data, by estimating the probability \( p_i \) that the \( i \)-th word appears.

  • If the \( X_i \) are categorial variables, then we can instead estimate the conditional probability using

    \[ P(X = x_i \mid Y = y) = \frac{ \text{ number of times category x_i appears in class y } } { \text{ total number of data points with class label y } }. \]
  • Finally, if the \( X_i \) are continuous variables, a common choice is to assume that, given the label \( Y \), the variables are normally distributed.

    Pseudocount and Additive Smoothing


    There is yet another problem that we have to overcome, but thankfully it is also the last one.

    The problem is that the quantity \( \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y) = 0 \) the moment a single \( P(X_i = x_j \mid Y = y) = 0 \). This means throwing out all the information given by all of the other \( P(X_i = x_i \mid Y = y) > 0 \), just because a single one is zero.

    For example, this can happen in practice if in a dataset of emails, a particular word occurs in non-spam emails, but never occurs in spam emails. In the case when features \( X_i \) are discrete, we can simply add a "pseudocount" to make sure that each possible feature value is counted as appearing at least once, for each class label. These kinds of methods is known as additive smoothing.

    In the normal distribution case, when \( x_i \) are continuous, we just need to be careful to not use features with zero variance.

    Surprising Real World Performance


    So far, we have seen that the bayes classifier has the lowest possible error among classification algorithms. We also talked about the fact that the naive bayes classifier makes the assumption that the features are conditionally independent, and thus allows us to factor \( P(X = x \mid Y = y) \) into the product \( \prod_{i = 1}^{n} P(X_i = x_i \mid Y = y) \).

    Since features in real world data points tend to be independent, we might expect the naive assumption to lead to significant performance degradation. However, the naive bayes classifier is known to perform surprisingly well in real world applications! An explanation for this is given by the paper "The Optimality of Naive Bayes". It was shown that the dependencies among feature variables could cancel each other out, or be distributed so evenly among the classes that they do not have a big impact.


    We will use the naive bayes classifer to label a set of real world data in part 2 of this article.