Bias and Variance in Statistics and Machine Learning
Part 2 - Machine Learning
Python code in this project
:
figure1.txt
+
embedded snippets.
In part one
of this article, I went through the concepts of bias, variance,
and mean square error in a statistical model.
In this part of the article, I will explore these terms for a machine learning algorithm.
We will see that, while they share similarities in both fields, there are significant deviations.
I will also touch on the idea of a tradeoff between bias and variance in older machine learning algorithms. Interestingly, this tradeoff need not exist for more modern algorithms such as neural networks and random forests.
The Machine Learning Setting
I am using the approach and notation from this excellent set of lecture notes. I have also used the same notation in part one of this article, so as to make it easier to draw parallels between the two contexts.
Suppose we have a set of data \( D_n = \{ (x_1,y_1),...,(x_n,y_n) \} \) of size \( n \), where each point \( (x_i,y_i) \) is drawn independently from a distribution \( P(X,Y) \). Just like for part one, to reduce the clutter, I am suppressing the \( n \) in \( D_n \), and writing it as just \( D \). Let us call \( x_i \) the feature vector, and \( y_i \) the corresponding label.
Now, let \( h_D \) be a model generated by an algorithm when using input data \( D \). \( h_D \) stands for the "Hypothesis calculated from Data" in the lecture notes. A linear regression line could be an example of model \( h_D \). \[ \hat{y}_i = \alpha + \beta x_i. \] The coefficients \( \alpha \) and \( \beta \) are estimated from the data. And \( \hat{y}_i \) is the model's prediction of \( y_i \), given the input \( x_i \).
Expected Label
This is where we start diverging from the statistics setting.
In statistics, we were trying to estimate a fix parameter of the distribution \( P \). Here, given a fixed feature vector \( x \), we want to try and predict the corresponding label \( y \). However, the distribution \( P \) might allow different labels to be generated with the same \( x \). We define the expected label given \( x \) as \[ \overline{y}(x) = \int_y y \, P(y|x) \, dy. \] The figure below shows a visualization of an illustration of the concepts we have covered so far, generated using mock data. The code for this figure can be found here.
The blue dots represents our dataset \( D \). The black line is a linear regression line, which is an example of model \( h_D \) that is estimated from the data. For input \( x = 5 \), there is a range of possible labels \( y \). The rough position of the expected label \( \overline{y}(x) \) for this input is given by the red circle.
Note that this setup is also consistent with the case where each feature vector \( x \) correspond to exactly one label \( y \).
Training and Test Dataset, Expected Classifier
Another difference from statistics is that we are often working with two different sets of data. The training dataset \( D \) is first used to build the algorithm \( h_D \). Then, we randomly draw values \( (x,y) \) from distribution \( P \) to test the accuracy of our model. The latter values form the test dataset.
This separation between two sets of data affects how we take expectations. For example, given a fixed training set \( D \), we might look at the expected prediction of \(h_D\) on the training set
\[ \mathbb{E}_{(x,y)}[h_D(x)]. \]
We can also take expectations over the training set \( D \). For example, given a fixed input value \( x \), we can look at the output of the expected classifier,
\[ \overline{h}(x) = \mathbb{E}_{D}[h_D(x)] = \int_D h_D(x) \, P(D) \, dD. \]
The Mean Square Error
Unlike for the statistics setting, it is easier for me to first decompose the mean square error (MSE), before defining bias and variance. In this case, the mean square error, also known as the expected test error, is
Important note: there is no unanimous agreement that the expected test error should be defined this way. We would not be able to perform the same bias-variance decomposition in this article, if we defined the test error with a 0-1 loss function, instead of the squared deviation. See this paper for an example.
Like before, we can decompose the MSE into two terms by using the "adding zero" trick, to both add and subtract a \( \overline{h}_D(x) \) term into the expectation, and then expanding the square
But unlike before, the second term is not the bias!
Getting the Bias Term
To get the bias term, we apply the "adding zero" trick once again. But this time, we add and subtract the average label \( \overline{y}(x) \) instead. Notice that both the expected classifier and expected label does not depend on \( D \), and hence the expectation is just take over elements \( (x,y) \) from the test set.
There is one last thing to mention before we finish up. In the previous statistical setting. the bias part of the decomposition is the squared bias \( (\mathbb{E}(h_D) - y)^2 \). It is possible to get something similar if this derivation was applied for a fixed choice of input \( x = x_0 \). Then, the expectation goes away, and we can get a similar squared bias expression,
Putting It All Together
Combining everything from before into one equation, we get the well known additive bias-var decomposition for the expected test error of a machine learning algorithm.
Bias-Variance Tradeoff
For older machine learning models like linear regression or decision trees, there is this notion of a tradeoff between a model with high bias versus a model with high variance.
- High variance: If you make a complicated classifier that fits the training set very well, the deviation from the expected classifier between datasets is likely to be high.
- High bias: If you make a simple classifier that only roughly fits the training set, the deviation from the expected classifier would now be low, but it is likely to have higher distance between predicted and expected labels.
Here is a list of recent research papers on this phenomenon: link to papers.