Simple Linear Regression - Part 3
GDP and Life Expectancy
Sections
I applied simple linear regression to physical quantities in an engineering problem in part 2 of this series. Here, we will see how simple linear regression performs on a more complicated problem in social science.
Data Cleaning and Exploration
My main objective here is to construct a simple linear regression model that predicts \( Y = \) life expectancy in years.
I am using a set of data from this kaggle webpage. Quoted from that page : the data was collected from the WHO and the United Nations website with the help of Deeksha Russell and Duan Wang.
Before I can proceed, I poked around the dataset to see what it is like, and if there are issues with it. One of the first things I had to fix was the inconsistent usage of whitespace in the dataset’s header.
1f = open('Life Expectancy Data.csv','r') # open file
2x = f.readlines() # read into x
3f.close()
4
5# get rid of excess whitespaces in header
6x[0] = x[0].replace(' ,',',')
7x[0] = x[0].replace(', ',',')
8x[0] = x[0].replace(' ',' ')
9
10g = open('life.csv','w')
11
12for index in range(0,len(x)):
13 g.write(x[index])
14
15g.close()There are more issues with the data, such as missing values. But let’s first load the csv file with Pandas, compute the correlations, and take a look.
1data = pd.read_csv('life.csv')
2life = data['Life expectancy']
3corr = data.corr()
We want our explanatory variable to be somewhat highly correlated with \( Y \). A few of the variables meet this rough criteria. For example, schooling has a correlation coefficient of \( 0.751975 \). However, this might be due to the fact that the longer someone lives, the more years of schooling they can get. “Income composition of resources” is highly correlated as well, but there are too many unknowns with regards to how this number is calculated.
In the end, I went with the classic choice of \( X = \) GDP for the explanatory variable However, there are \( 453 \) rows that are missing GDP values. The life expectancy values of these rows appear to be random, and do not appear to introduce bias when removed. So, these rows are removed. Rows without the necessary \( Y = \) life expectancy values are also removed.
1gdp_nan = data['GDP'].isna()
2life_nan = data['Life expectancy'].isna()
3nan = gdp_nan | life_nan # bitwise OR operator
4not_nan = ~nan # bitwise NOT operator
5data = data[not_nan]Fitting The Model
Let’s take a look at the scatterplots of GDP and life expectancy.
1plt.rcParams.update({'font.size': 18})
2plt.subplot(1,2,1)
3plt.scatter(data['GDP'],data['Life expectancy'])
4plt.xlabel('GDP')
5plt.ylabel('life expectancy')
6
7plt.subplot(1,2,2)
8plt.scatter(np.log(data['GDP']),data['Life expectancy'])
9plt.xlabel('log GDP')
10plt.ylabel('life expectancy')
11
12plt.show()
The scatterplot on the left suggests that these two variables have a logarithmic relationship. The right scatterplot, which is log transformed, looks much better for fitting a simple linear regression model.
So, instead of fitting a regression model with variables \( X \) and \( Y \), we are instead fitting the model \( Y = a + bZ \), where \( Z = \log(X) \), and \( \log \) is the natural logarithm. Just as with part 2 of this article, it is easy to fit a simple linear regression model with sklearn in Python.
1# sort data for residual analysis later
2data.sort_values(by=['GDP'],inplace=True)
3
4x = np.log(data['GDP'])
5x = x.to_numpy()
6x = x.reshape(-1,1)
7
8y = data['Life expectancy']
9y = y.to_numpy()
10y = y.reshape(-1, 1)
11
12reg = lr()
13reg.fit(x, y)
14
15reg.score(x, y)
16reg.coef_
17reg.intercept_The coefficient of determination \( R^2 \), and the estimated values of \( a \) and \( b \), are printed out below as reg.score, reg.coef_ and reg.intercept_ respectively.
1>>> reg.score(x,y)
20.35802838694707295
3
4>>> reg.intercept_
5array([46.43152011])
6
7>>> reg.coef_
8array([[3.07175769]])Visualizing The Fit
Just like in part 2 of this article, we plot the regression line over the data points to visually gauge how good the fit is. It is probably not a surprise that higher GDP leads to higher life expectancy.
While the regression line looks like a nice approximation to the general trend, there is clear heterogeneity in the variance of the residual errors. Countries with high GDP tend to have less volatile high life expectancy. While countries with lower GDP tend to have life expectancy that is much more volatile.
Residual Analysis
The residual errors are plotted below. Note that \( Y \) has already been sorted from smallest to largest in a previous step.
1y_predicted = reg.predict(x)
2
3e = y - y_predicted
4t = np.arange(0,len(e))
5plt.scatter(t,e)
As expected, the residuals are heterogeneous. Like we mentioned before, residual errors are higher for countries with low GDP, on the left side of the figure. And they are lower for countries with high GDP, on the right side of the figure. This is not a big problem if we are only interested in using the regression line as a rough indication of how the two variables are related.