Labeling Recipes with Logistic Regression

Part 2 - Data Cleaning, Multicollinearity, Breakfast Labels

Python code used: code_for_cleaning.txt + breakfast_label_code.txt + embedded snippets.
In part 1 of this article, I introduced this project, and briefly went through some of the mathematics behind logistic regression. In this part, I will continue to work on using logistic regression to label recipes from www.epicurious.com, as breakfast, lunch or dinner.

Cleaning the Dataset


We will be working with this set of data that was generously uploaded to Kaggle and made publicly available: https://www.kaggle.com/hugodarwood/epirecipes/.

The csv file provided has 20052 rows and 679 columns. The rows are the recipes, while the columns are a mix of nutritional information (calories, sodium, fat etc), and binary variables indicating whether an ingredient or a label is present in a recipe.

Missing Data Entries

Many of the recipes are missing nutritional information, seemingly at random as far as I can tell. I chose to simply eliminate these rows. We will see from the results later that this ended up not having much of an impact. I might revisit the topic of missing data and imputation in a future article. After this, we are left with 15864 recipes.

Extreme Outliers

Next, we plot the four nutrition columns and see that there are some extreme outliers.



We can plot the number of recipes with nutrition values above various thresholds, to get an idea of what the problematic data points are like. We can see from the protein and fat plots that part of the problem is the handful of recipes with values of over 1000 for protein and/or fat.



Just like before, these errors appeared to be random, as far as I could tell. So, I used a quick and dirty solution of simply removing all recipes that have calories or sodium values 5000 or above. This ended up also removing recipes with extreme values for protein and fat. Again, we will see later that this would not have much of an impact.

data = data[data['calories'] < 5000]
data = data[data['sodium'] < 5000]

sum(data['protein']>600) # result = 0
sum(data['fat']>600) # result = 0

Breakfast, Lunch, Dinner Labels

Lastly, we trim the data down to just recipes that have one and only one label. There could be valid reasons for having more than one label. For example, some dishes can be considered both lunch and dinner. However, for this project, I am just going to focus on recipes with only one label.

label_sum = (data['breakfast'] + data['lunch'] + data['dinner'])
data = data[(label_sum > 0) & (label_sum < 2)]

len(data) # 2460
sum(data['breakfast']==1) + sum(data['lunch']==1) + sum(data['dinner']==1) #2460

data.to_csv('bld.csv') # save a copy for next step

After all this cleaning, 130 columns ended up containing nothing but zeroes, and so were dropped. We also drop the "brunch" column since it just repeats information from the breakfast and lunch labels.

The final dataset contains 2460 recipes and 549 columns. The python code for everything we have done up to this point can be found here: code_for_cleaning

Multicollinearity


There is one last thing that we need to do, before we can finally run our logistic regression analysis, is to look at the correlations between our independent variables \( \{ x_1,...,x_n \} \).

The \(i\)-th regression coefficient \( \beta_i \) represents the unit change in \( y_i \) with respect to a unit change in \( x_i \), while keeping all other variables constant. \( \beta_i \) essentially isolates the effect of \( x_i \), which is difficult to do when it is highly correlated with another variable \( x_j \).

When that happens, they tend to move together. Any change in \( x_i \) would result in a change in \( x_j \). This makes it hard to isolate the effect of \( x_i \), while keeping all other variables constant. This problem is known as multicollinearity.

There is no official definition of "highly correlated". For this project, we will just go with an arbitrary threshold of \( 0.85 \) for the pearson correlation coefficient.

import pandas as pd

# data generated by our cleaning process
data = pd.read_csv('bld.csv')

C = data.drop(columns = ['breakfast','lunch','dinner'])
C = C.corr()
C[C == 1.0] = 0.0 # disregard the 1s on diagonals

# see how many variable pairs meet our criteria
(C > 0.85).sum().sum() # result = 6
(C < -0.85).sum().sum() # result = 0

high_corr = C[(C>0.85).sum()==1]
row_names = list(high_corr.index)

# sub-matrix consisting of only > 0.85 correlation
high_corr[row_names]


The pandas package has convenient tools for generating and working with the correlation matrix, which we used to find that none of the variables have \( < -0.85 \) correlation, and that six variables have \( > 0.85 \) correlation with some other variables. These six variables are shown below.


It makes sense that recipes that are high in fat are also high in calories, and portland is a city in oregon. However, the high correlation between being a drink and being non-alcoholic is interesting, and tells us that most of the beverage recipes in the dataset are non-alcoholic.

Out of each pair of highly correlated variables, I chose to drop the more narrow ones (fat, portland, and non-alcoholic), while keeping the more general ones.

Labeling Breakfast Recipes


Before I end this part of the article, let's take a quick look at the results for breakfast labels. The code for this part can be found here: breakfast_label_code.txt


Surprisingly, we were able to get a perfect separation! This would explain why I had some minor problems getting the solver to converge. Iterative solvers have a hard time converging because there are infinite ways to select the coefficients while still getting zero errors.

Let's take a look at top 30 and bottom 30 explanatory variables for predicting breakfast labels, in terms of the value of the fitted coefficients.






Most of these are exactly what we would expect to be in or not in breakfast recipes. However, the columns containing locations like "washington" and "maryland" should probably be removed.

There were a few I thought were mistakes or inaccuracies, but ended up being correct! For example, "hot pepper" turns out to be common in scrambled eggs and egg-based breakfast recipes in the dataset. I also learned that semolina porridge and tofu scramble are breakfast items, and that sake salmon is a Japanese breakfast dish!

Keep in mind that having an ingredient in the "top 30" list does not automatically make a recipe breakfast. For example, out of 269 recipes containing "egg", 98 of them were lunch or dinner. This is a multivariate model, and the final output depends on multiple variables, and not just one.

One thing to note is that p-values are not reported by sklearn's "linear_model" module, even if regularization is disabled. The statsmodel package is an alternative that does provide p-values.



We will look at these results in more details, and also predict lunch and dinner labels in part 3 of this article.