Bias and Variance in Statistics

Sections

The Statistical Model
Bias of an Estimator
Variance of an Estimator
The Mean Square Error
Stein’s Paradox

This is the first of two articles in which I explore the concept of bias and variance in both statistics and machine learning. These terms are derived using similar methods in both fields, but actually refer to different concepts. I will first start with how these terms are used in statistics.

I will talk about the machine learning case in part two of this article, which will also touch on the observation of a tradeoff between bias and variance in machine learning algorithms. We will also see how this tradeoff can be absent from modern algorithms such as neural networks.

The Statistical Model

This article is based on a standard statistics scenario:

We have some population distribution $ P $.
We want to estimate some parameter $ y $ of distribution $ P $.
To do so, we take an IID sample, consisting of $ n $ data points, from $ P $.
Let $ D_n = \{ x_1,...,x_n \} $ be our sample. We will write it as just $ D $ to reduce clutter.
Let $ h_D $ be an estimator calculated from this sample.

Note that I am using the non-standard notation $ h_D $, which stands for “Hypothesis calculated from Data”. This is so that our notation is consistent with what we will be using in part 2 of this article.

An example of an estimator that is often calculated from samples would be the sample mean,

\[ h_D = \frac{1}{n} \sum_{i=1}^{n} X_i. \]

Example - Sampling Housing Price

To illustrate what this scenario looks like in real life, I will pretend that a set of data on public housing resale prices in Singapore is the population $ P $, from which I took an IID sample $ D $ of $ n = 30 $ data points.

Of course, this is a very artificial example, since we can already calculate our estimator on the entire set of data without the need for sampling. Also, the data was downloaded from this Singapore government website by selecting the “Jan-2017 onwards” option at the time of writing. The data will most certainly be different from what I had if you downloaded it now.

Because each data point in the sample $ D $ is randomly drawn from the population $ P $, the sample mean will vary each time we sample. For this particular draw, our sample mean was lower than the population mean. The blue histogram in the figure below is of the population $ P $.

We can see that, even with just $ n = 30 $, the sample mean is “visually close” to the population mean in the figure. The Python code and data source used in this figure can be found in the article_code.txt attachment located at the top of this article.

Bias of an Estimator

The bias of estimator $ h_D $ is the difference between its expected value $ \mathbb{E}[h_{D}] $ and the true parameter value $ y $. Note that this expected value taken over all possible $ D $, which are sets of IID sample data with size $ n $.

$$ \text{Bias}(h_D) = \mathbb{E}[h_D - y]. $$

It is known that the sample mean is an unbiased estimator of the population mean. Which means this quantity is actually zero when $ h_{D} $ is the sample mean. However, the bias might not be zero for other estimators.

Consistency

Note that the bias is defined here for a data set $ D $ with fixed finite size $ n $. There is a similar concept for the estimator’s behavior as the number of data points $ n $ goes to infinity.

Of course, we would like the estimator to get closer and closer to the true parameter value as $ n $ increases. Ideally, we want to have $ h_D = y $ if we somehow managed to collect an infinite number of samples. This is known as “consistency” and estimators with this property are called consistent.

Note that an unbiased estimator can still be inconsistent, and a biased estimator can still be consistent. Examples of these can be found here, but it might be a fun exercise to try and think of some simple ones!

Variance of an Estimator

The variance of an estimator is its expected squared difference from its own expected value,

\[ \text{Var}(h_D) = \mathbb{E}(h_D - \mathbb{E}[h_D])^2. \]

The variance formalizes the notion of how $ h_D $ can vary from its expected value every time we calculate it from a fresh set of IID sample data $ D $.

The variance is defined in terms of the squared difference, rather than the absolute difference. This is a common paradigm in statistics and machine learning. Two possible reasons for the prevalence of this paradigm could be:

The absolute value function $ g(x) = |x| $ is not differentiable at $ x = 0 $, but the square function is differentiable everywhere.
For a fix set of IID sample data $ D $, the sample mean is the quantity that minimizes the sum of squared deviations $ \text{Var}(h_D) $.

One thing to note is that the square function $ f(x) = x^2 $ increases exponentially with $ x $. So, large deviations from the true parameter value $ y $ contributes disproportionally more to the variance than smaller deviations.

Standard Deviation

The standard deviation is the square root of the variance. This is often used over the variance as an indicator of how much values can vary. A famous example is the 68-95-99.7 rule which estimates that 68% / 95% / 99.7% of an approximately normally distributed set of data has values within 1 / 2 / 3 standard deviations from the mean.

The Mean Square Error

The mean square error (MSE) is the expected value of the squared difference between the estimator and the parameter it estimates.

\[ \mathbb{E}(h_D - y)^2. \]

We can decompose the MSE into two terms, by using the “adding zero” trick to add and also subtract a $ \mathbb{E}[h_D] $ term before expanding the square.

\[ \begin{align} \mathbb{E}[(h_D - y)^2] &= \mathbb{E}(h_D - \mathbb{E}[h_D] + \mathbb{E}[h_D] - y)^2 \\[0.5em] &= \mathbb{E}(h_D - \mathbb{E}[h_D])^2 + (\mathbb{E}[h_D] - y)^2 \\[0.5em] &= \text{Var}(h_D) + \text{Bias}^2(h_D). \end{align} \]

The derivation is not difficult, but can get messy. A full derivation can be found here.

Stein’s Paradox

After reading my example of estimating the population mean using the sample mean, you might be wondering if the sample mean is always the best estimator for the population mean, in terms of minimizing the mean square error.

Surprisingly, it turns out that the answer is “no”!

If the population is a multivariate normal distribution, the James-Stein estimator outperforms the sample mean. An even more shocking fact is that no one knows what estimator of the multivariate normal population mean minimizes the mean square error!

This open problem is closely related to Stein’s Paradox. A great discussion about the history and the mystery surrounding the search for the best estimator of the population mean can be found in a classic 1977 statistics paper by Bradley Efron and Carl Morris.