Simple Linear Regression - Part 1

Mathematical Theory

Sections

Mathematical Model
Why Is This Linear?
Fitting the Model to the Data
The Method of Least Squares
The Gauss-Markov Theorem
Coefficient of Determination

Linear regression is a statistical model for how a variable \( Y \) can depend on a set of variables \( X_1, X_2, ..., X_n \). I will explain what “linear” means in the sections below.

When there is only a single independent variable \( X \), we say that the linear regression is simple. Part 1 of this series will briefly discuss the mathematics of simple linear regression. I will explore two real-world applications in Parts 2 & 3.

Mathematical Model

Suppose that we are interested in \( Y = \) “annual income of a working adult in the USA”, and how it depends on \( X = \) “the years of education that they have”. Also, suppose that we have a set of data \( \{ (y_1,x_1),...(x_n,y_n) \} \) for this scenario.

Simple linear regression assumes that each \( y_i \) in our data is generated according to this equation:

\[ y_i = a + bx_i + e_i. \]

The \( e_i \) are random error terms that represent noise, measurement errors, as well as other variables affecting the value of \( y_i \). If these random error terms satisfy certain assumptions, then we can invoke a powerful result known as the Gauss-Markov theorem. We will briefly touch on this theorem and its assumptions later.

In practice, these conditions are usually not completely satisfied, or cannot even be verified, but linear regression is still often used as a rough approximation of the relationship between \( Y \) and \( X \).

Why Is This Linear?

Note that there is a little bit of a nuance in calling something linear. For example, linear functions in calculus are not the same as linear maps in linear algebra. Strictly speaking, they are affine maps when they do not pass through the origin (a summary can be found here).

So, the \( y_i \) in our model is, strictly speaking, not linear in \( x_i \). Simple linear regression is still considered linear because the general regression model is,

\[ y_i = f(x_i,\beta) + e_i, \]

where \( \beta \) is a vector of parameters. For our simple linear regression model, \( \beta = (a,b) \). \( f(x_i,\beta) \) is known as the regression function. The simple regression model is linear because the regression function \( f \) is a linear combination of the parameters \( a \) and \( b \).

Fitting the Model to the Data

Fitting the simple linear regression model to the data means estimating parameters \( a \) and \( b \) from the data. These estimates are denoted \( \hat{a} \) and \( \hat{b} \). The “hat” symbol is used to denote quantities that are estimated from the data.

We can use these estimates \( \hat{a} \) and \( \hat{b} \) to estimate or “predict” what \( y_i \) would be for any given value of \( x_i \). This is done using the equation,

\[ \hat{y}_i = \hat{a} + \hat{b} x_i. \]

There are technically infinite ways to estimate the parameters. However, the most famous one is known as least squares.

The Method of Least Squares

Given the estimate of \( y_i \) above, let us define the residual error as \( \hat{e}_i = y_i - \hat{y_i} \) . This is the difference between the true value of \( y_i \) and the value \( \hat{y_i} \) estimated by our model. When calculated using our data, the residual errors also serve as an estimate of the actual error term \( e_i \).

Intuitively, we might want to construct a model that minimizes the absolute values of these errors. That way, our model’s estimate of each \( y_i \) is as close to the data as possible. However, absolute values are historically difficult to work with using calculus. For example, the function \( f(x) = |x| \) does not have a derivative at \( x = 0 \).

So, mathematicians minimize the sum of squared residual errors instead, which is

\[ \Sigma (y_i - \hat{y}_i)^2. \]

This method is known as least squares and was made famous by Carl Friedrich Gauss’s calculation of celestial orbits. In the case of simple linear regression, we can find estimates \( \hat{a} \) and \( \hat{b} \) that minimizes the sum of squared errors \( \Sigma (y_i - \hat{y}_i)^2 \) with basic calculus. The resulting solutions are

\[ \hat{b} = \frac{n \, \Sigma x_i y_i - \Sigma x_i \Sigma y_i }{n \, \Sigma x_i^2 - (\Sigma x_i)^2}, \]

\[ \hat{a} = \frac{1}{n} \Sigma y_i - \hat{b} \frac{1}{n} \Sigma x_i. \]

The Gauss-Markov Theorem

The famous Gauss-Markov theorem can be invoked if the random error terms \( e_i \) satisfy these conditions:

Error terms have an expected value of zero.
Distinct error terms are not correlated with each other.
All of them have the same finite variance (homoscedastic).

Note: error terms do not need to be normally distributed or even independent and identically distributed.

Then, the Gauss-Markov theorem says that least squares will produce estimators that are “BLUE”. The acronym BLUE stands for Best Linear Unbiased Estimator. “Best” means that least squares estimators have the smallest variance among all linear unbiased estimators.

Note that there is usually no way to tell for sure how well our error terms satisfy the Gauss-Markov assumptions. However, we can plot the residuals errors \( \hat{e}_i \) estimated by our model to see if there are any patterns that would indicate a problem. This kind of residual analysis could also give a rough idea of whether a linear model is suitable for our data.

Coefficient of Determination

The “goodness of fit” refers to how well our fitted model predicts the values of \( y_i \) in the data. The coefficient of determination, usually denoted \( r^2 \) or \( R^2 \), can be used as a rough gauge of this goodness of fit.

The square in the notation refers to the fact that it is equivalent to the square of the correlation coefficient between \( X \) and \( Y \). When there are more than one explanatory variables, \( R^2 \) can be instead interpreted as the squared correlation coefficient between observed and predicted values of \( y \).

A few definitions are required to explain what exactly does \( R^2 \) mean.

\( \overline{y} = \frac{1}{n} \Sigma y_i. \)
Total Sum Of Squares (TSS) \( = \Sigma (y_i - \overline{y})^2. \)
Explained Sum Of Squares (ESS) \( = \Sigma (\hat{y}_i - \overline{y})^2. \)
Residual Sum of Squares (RSS) \( = \Sigma (y_i - \hat{y}_i)^2. \)

It can be shown that TSS = ESS + RSS. This means that the total squared variation can be broken down into two components: one that is explained by our model, and a residual error our model fails to capture. With these, we can finally define \( R^2 \) as

\[ R^2 = \frac{\text{ESS}}{\text{TSS}}. \]

This is a number between \( 0 \) and \( 1 \) that gives us a rough idea of how good our model fit the data. It is literally the proportion of squared variation that is explained by our model. If our model is a perfect fit with no errors, then RSS will be zero, which makes the TSS \( = \) ESS, and so \( R^2 = 1 \).

Note that if \( R^2 = 0 \), then all the \( \hat{y_i} \) values must be equal to \( \bar{y} \), which is required for ESS to be zero.

We will see all of the concepts in this article used on a practical real world example in part 2 of this article.