Simple Linear Regression

Part 1 - Mathematical Theory


Linear regression is a statistical method for estimating how a variable \( Y \) can depend linearly on a set of variables \( \{ X_1, X_2, ..., X_n \} \). We will explain what "linear" means in the next section.

When there is only a single independent variable \( X \), we say that the linear regression is simple. This article will briefly discuss the mathematics of simple linear regression, before exploring two real world applications.

Mathematical Model


Suppose that we are interested in \( Y = \) annual income of a working adult in the USA, and how it depends on \( X = \) the years of education that they have. Also, suppose that we have a set of data \( \{ (x_i,y_i) \mid i = 1,2,...,n \} \), consisting of \( n \) such data points. Strictly speaking, the model assumes that each \( y_i \) in our data is generated according to this equation. \[ y_i = a + bx_i + e_i. \] The \( e_i \) are random error terms that represent noise, measurement errors, as well as other variables affecting the value of \( y_i \). Ideally, these error terms should be independent and identically distributed. And each of them should be drawn from the same normal distribution, with a mean of zero.

These ideal conditions on the random error terms are not strictly necessary, but they are required for a powerful result known as the Guass-Markov theorem, which we will briefly touch on later. In practice, these conditions are usually not completely satisfied, but linear regression is still used as a rough approximation.

Fitting the Model to Data


The model that we are trying to fit to the data is the equation \( \hat{y}_i = \hat{a} + \hat{b}x_i \). Note that we are using a "hat" symbol to denote estimated quantities. We want to use our data to choose estimates \( \hat{a} \) and \( \hat{b} \), such that a certain error term (to be defined later) is minimized.

There is a bit of nuance in calling something "linear". For example, linear functions in calculus are not the same as linear maps in linear algebra. Strictly speaking, they are affine maps if they do not pass through the origin (summary here).

However, the simple regression model is linear because \( \hat{y}_i \) is a linear combination of parameters \( \hat{a} \) and \( \hat{b} \).

The Method of Least Squares


So far, we have assumed that each \( y_i \) in our data is generated according to \( y_i = a + bx_i + e_i \), and that the model we want to fit is \( \hat{y}_i = \hat{a} + \hat{b}x_i \). Now, let us define the "residual" error as \( \hat{e}_i = y_i - \hat{y}_i \). This is the difference between the value of \( y_i \) in the data, and the value estimated by our model. The residual errors also serve as an estimate of the actual error term \( e_i \).

Intuitively, we might want to construct a model that minimizes the absolute values of these errors. That way, our model's estimate of each \( y_i \) is as close to the data as possible. However, absolute values are difficult to work with using calculus. For example, the function \( y = \lvert x\rvert \) has no derivative at \( x = 0 \). So, mathematicians minimize the sum of squared residual errors instead, which is \[ \Sigma (y_i - \hat{y}_i)^2. \] This method is known as "least squares" and was made famous by Carl Friedrich Gauss's calculation of celestial orbits. In the case of simple linear regression, we can find estimates \( \hat{a} \) and \( \hat{b} \) that minimizes the sum of squared errors \( \Sigma_i (y_i - \hat{y}_i)^2 \) with basic calculus. The resulting solutions are called the normal equations,

\[ \hat{b} = \frac{n \, \Sigma x_i y_i - \Sigma x_i \Sigma y_i }{n \, \Sigma x_i^2 - (\Sigma x_i)^2}, \] \[ \hat{a} = \frac{1}{n} \Sigma y_i - \hat{b} \frac{1}{n} \Sigma X_i. \]

The Guass-Markov Theorem


The famous Guass-Markov theorem can be invoked if the random error terms \( \{ e_i \} \) satisfy these conditions:

Note that the error terms do not need to be normally distributed, or even independent and identically distributed.

Then, the Guass-Markov theorem say that least squares will produce estimators that are "BLUE". The acronym BLUE stands for "Best Linear Unbiased Estimator". "Best" means that least squares estimators have the smallest variance among all linear unbiased estimators.

Notice that we do not have the true \( a \) and \( b \) values. So, there is no way to tell for sure how well our error terms satisfy the Gauss-Markov assumptions. However, we can plot the residuals errors \( \hat{e}_i \) generated by our model, and see if there are any patterns that would indicate a problem. This kind of residual analysis could also indicate whether a linear model is appropriate for our data.

Coefficient of Determination


The "goodness of fit" refers to how well our fitted model predicts the values of \( y_i \) in our data. The coefficient of determination, usually denoted \( r^2 \) or \( R^2 \), can be used as a rough gauge of this goodness of fit. The square in the notation refers to the fact that it is equivalent to the square of the correlation coefficient between \( X \) and \( Y \).

A few definitions are required to explain what exactly \( R^2 \) means.


It can be shown that TSS = ESS + RSS. This means that the total squared variation can be broken down into two components: one that is explained by our model, and a residual error our model fails to capture. With these, we can finally define \( R^2 \). \[ R^2 = \frac{\text{ESS}}{\text{TSS}}.\] This is a number between \( 0 \) and \( 1 \) that gives us a rough idea of how good our model fit the data. It is literally the proportion of squared variation that is explained by our model. If our model is a perfect fit with no errors, then \( RSS \) will be zero, which makes \( TSS = ESS \), and so \( R^2 = 1 \).

Note that if \( R^2 = 0 \), then all the \( \hat{y}_i \) values must be equal to \( \overline{y} \), which is required for \( ESS \) to be zero.
We will see all of the concepts in this article used on a practical real world example in part 2 of this article.