Naive Bayes Classification of Amazon Employees

Part 2 - Classifying Amazon Employees

Python code used: encode_fit.txt + confusion_roc_auc.txt + embedded snippets.
I went through the mathematics behind the navie bayes classifer in part one of this article. Here, I will use the naive bayes classifier to label Amazon employees. I will also touch on the topics of imbalanced data. And also talk about using the ROC curve and AUC to judge model fit.

Data Exploration and Preprocessing


The data for this article comes from an Amazon Kaggle competition: https://www.kaggle.com/c/amazon-employee-access-challenge. The Kaggle page has a great description of the columns in the data. The target label that we want to assign to each employee is the "ACTION" column.




The page also provides a useful snapshot of what the values are like, and how they are distributed. The figure below shows a truncated example of this.




Imbalanced Data

Because this is a competition ran by Amazon, the data is very tidy. No missing values or extreme outliers. However, the target labels in the training dataset are rather imbalanced.




Only around \( 5.79 \% \) have "ACTION" label equal to \( 0 \). So, labeling every single data point with a \( 1 \) would be correct \( 94.21 \% \) of the time! This is well known problem. But as we shall see later, this would not have a big impact on the naive bayes classifier.

Ordinal Encoder

Since the columns are categorical, I will be using sklearn's naive bayes module for categorical features. The module requires each feature's categories to be represented by whole numbers \( \{ 0, \dots, n-1 \} \). And for \( n \) to be the maximum number of categories that the feature has. To preprocess the data into the required format, I will be using the ordinal encoder module as suggested by the documentation. The python code is simple.


amazon = pd.read_csv('train.csv')

y = amazon['ACTION']
x = amazon.drop(columns=['ACTION'])

enc = OrdinalEncoder()
x = enc.fit_transform(x)


Naive Bayes Classification


After the converting the categorical data into the required format in the previous step, fitting the categorical naive bayes classifier in sklearn is easy.

from sklearn.naive_bayes import CategoricalNB

clf = CategoricalNB()
clf.fit(x, y)


A "score" is also provided by the module. It shows the proportion of data that were labeled correctly.




That \( 91 \% \) score might look amazing, but remember that this is an imbalanced dataset. As mentioned above, labeling every data point with a \( 1 \) would give us a \( 94.21 \% \) score. The python code for what we have done so far can be found here: encode_fit.txt

Confusion Matrix, ROC Curve, AUC


How else should I evaluate the model? One thing we could do is to look at the confusion matrix. Sklearn has a module that conveniently computes it for us.




In this zero-indexed matrix, the \( (i,j) \) entry contains the number of data points that has true label \( i \), but is classified \( j \) by our model. In particular, only around \( 55 \% \) of the data points with label \( 0 \) have been labeled correctly. I would consider this performance rather weak. This can be improved, with some engineering of features that have more predictive power. However, that will have to wait for another article.

The receiver operating characteristic (ROC curve) is another tool for evaluating the model. Again, sklearn provides an easy way to do this.




An area under curve (AUC) of \( 0.74 \) is OK but not fantastic. As mentioned above, we will improve this in another article. Another thing to note is that the precision recall curve is actually a better choice for imbalanced data. But I am sticking to the ROC curve used by the kaggle competition for evaluation.

Python code for this part of the article can be found here: confusion_roc_auc.txt