Labeling Recipes with Logistic Regression
Part 3 - Lunch / Dinner Labels, More Insights
Python code in this project:
lunch_label_code.txt
+
dinner_label_code.txt
+
embedded snippets.
Having cleaned the data, dealt with possible multicollinearity issues,
and looked at breakfast labels
in part 2,
let's move onto lunch and dinner labels.
Multiclass Logistic Regression
In part 1 we talked about logistic regression for just two classes \( y = 0 \) and \( y = 1 \). However, since I am trying to choose between three labels (breakfast, lunch, dinner) here, I would need to think about how to generalize to more than two classes. There are many ways to go about doing this. A few common methods are:
- One-vs-all: run a separate logistic regression for each label, in which that particular label is set to \( y = 1 \), while other labels are all set to \( y = 0 \). This is what we have done for breakfast labels in part 2. Then, pick the label with the highest probability. Also known as one-vs-rest.
- One-vs-one: run a logistic regression for all possible pairs of labels. Then decide the label using majority voting. This can get very computationally expensive.
- Softmax: use the softmax function, which generalizes the logistic function to multiple dimensions. This is also known as multinomial logistic regression.
Just like what I did for breakfast labels, I am going to continue using the one-vss-all method, and fit a separate logistic regression model for lunch and dinner labels.
Note that the logistic regression module in sklearn offers both the one-vs-rest and multinomial schemes: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
Lunch Labels
The Python code for lunch labels can be found here: lunch_label_code.txt. This is not a perfect separation like for breakfast labels, but the fit is surprisingly close to perfect!
Just like for breakfast labels, we look at top 30 and bottom 30 explanatory variables, in terms of the value of the fitted coefficients.
Dinner Labels
Python code for dinner labels: dinner_label_code.txt. Dinner labels performed the worse, but I was extremely surprised to see that it is close to perfect!Again, the top 30 and bottom 30 explanatory variables, in terms of the value of the fitted coefficients.
Using Unlabeled Recipes as Test Set
As I mentioned before in part 1, the large majority of the recipes in this dataset are missing breakfast/lunch/dinner labels. We can actually use these as a kind of test set in cross validation cross validation,Cross validation involves training a machine learning model on different sets of training data, then validating the accuracy on a disjoint set of test data. The goal is to avoid overfitting to particular patterns unique to any one training data set, while capturing patterns that are generalizable.
It is actually quite similar to the concept of bootstrapping in statistics, which is a technique for studying how a statistical estimator varies across different sample datasets. This classic paper has an excellent write up about it: http://www.medicine.mcgill.ca/epidemiology/hanley/bios601/Mean-Quantile/EfronDiaconisBootstrap.pdf,
This section is a work in progress that will be finished soon!