Labeling Recipes with Logistic Regression

Part 1 - Introduction and Mathematical Theory

Python code in this project: code.txt
I have a great deal of interest in food; both cooking them, and eating them. I guess you can call me a foodie. I also happen to enjoy analyzing data.

In recent years, online recipe repositories have exploded in both size and popularity. This gave me a very nice opportunity to combine both of my interests. In this project, I will be using logistic regression to label a set of recipes compiled from www.epicurious.com, with breakfast, lunch or dinner labels.

There are two main uses for this. First, by looking at important features, we can understand more about consumer food preferences. Second, out of the 15710 recipes in the dataset, only 3337 were labelled as breakfast, lunch or dinner. Our classifier could help label the rest.

Logistic Regression


Suppose we want to predict a label \( \hat{y}_i \) based on input features \( \{ x_1,...,x_n \} \). If we use linear regression, the model that we are fitting to the data is,

\[ \hat{y}_i = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n. \]

Let's concentrate on just the breakfast labels for now. For example, suppose we have \( y_i = 1 \) when recipe \( i \) is breakfast, and \( y_i = 0 \) if recipe \( i \) is not breakfast. However, the linear regression equation will most probably not produce values of \( \hat{y}_i \) that are exactly equal to \( 1 \) or \( 0 \). In fact, they might end up much greater than \( 1 \), or very far below \( 0 \).

To get around this problem, we could instead try to fit the logistic regression model,

\[ \hat{p}_i = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + ... + \beta_n x_n)}} = \frac{e^{\beta_0 + \beta_1 x_1 + ... + \beta_n x_n}}{1 + e^{\beta_0 + \beta_1 x_1 + ... + \beta_n x_n}}. \]

Here, \( \hat{p}_i \) is the model's estimate of the probability that \( y_i = 1 \). This estimate is guaranteed to lie between \( 0 \) and \( 1 \). The two terms following \( \hat{p}_i \) above are simply two different ways of writing the logistic function. Another way of looking at this formula is that the natural logarithm of the odds ratio \( \frac{p}{1-p} \) is a linear combination of input \( \{ x_1,...,x_n \} \). \[ \ln(\frac{p}{1-p}) = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n. \] As far as I can tell, there is no mathematical reason why this model is the "correct" one. In fact, the logistic function is very similar to the probit function, which is another popular choice for handling \( y \) taking only \( 0 \) or \( 1 \) values.

To the best of my knowledge, this is a great example of the saying "all models are wrong, but some are useful".

Deriving the Logit Model


The main motivation for the logit model was the desire to apply linear regression methods to probabilities. The following train of thought might approximate what the creators of the model had in mind.

This set of lecture notes contains a discussion on this, and goes through the derivation of logistic regression in much greater mathematical details.

The "best" set of parameters \( \{ \beta_0, \beta_1, ..., \beta_n \} \) is defined to be the one that maximizes the probability of generating our data. This is known as maximum likelihood estimation. However, there is no way to solve for these parameters exactly. Numerical methods such as Newton-Raphson have to be used.

Why is logistic regression considered a linear model? Because \( y_i \) is set to \( 1 \) if and only if \( \hat{p}_i > 0.5 \). But \( \hat{p}_i > 0.5 \) if and only if \( \beta_0 + \beta_1 x_1 + ... + \beta_n x_n > 0 \). This \( \beta_0 + \beta_1 x_1 + ... + \beta_n x_n = 0 \) boundary is a linear combination of \( \{ x_1,...,x_n \} \), which is why the model is considered linear.

Comparison with Linear Regression


Why not just use linear regression? Well, I gave three reasons in the previous section. Another possible reason could be the desire to obtain a closer fit to the data, as shown by the figure below.




With the introduction and mathematical foundations out of the way, we can start working with the data in part 2 of this article.