Simple Linear Regression - Part 2
Estimating Yacht Hydrodynamics
Having briefly gone through the mathematical theory in part 1, let’s demonstrate how simple linear regression can be used in practice.
Data Description
The dataset that I will be using comes from https://archive.ics.uci.edu/ml/datasets/yacht+hydrodynamics, and is contributed by Dr Roberto Lopez (Ship Hydromechanics Laboratory, Maritime and Transport Technology Department, Technical University of Delft ). If the link above becomes broken, please Google for “yacht hydrodynamics dataset”.
The link above has a description of the variables in the dataset:
- Longitudinal position of the center of buoyancy.
- Prismatic coefficient.
- Length-displacement ratio
- Beam-draught ratio
- Length-beam ratio
- Froude number
- Residuary resistance per unit weight of displacement
The variable I am trying to predict is
Cleaning and CSV Conversion
The data comes in a text file with a .data extension. It is space separated in an inconsistent way, and is also missing headers for each column. Let’s rename the file to “yacht.txt”, add a row of headers, clean up the excess whitespace, and turn it into csv format.
f = open('yacht.txt','r')
x = f.readlines()
f.close()
g = open('yacht.csv','w')
g.write('buoyancy,prismatic,length-displace,beam-draught,length-beam,froude,resist\n') # write header row
for i in range(0,len(x)):
y = x[i].strip() # remove leading/trailing whitespace
y = y.replace(' ',' '); # some rows had double spacing
y = y.replace(' ',','); # replace single spaces with comma
g.write(y + '\n') # added a line break after each row
g.close()
Preliminary Exploration
Let’s start by looking at the scatter plots of
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('yacht.csv')
for i in range(0,6):
plt.subplot(2,3,i+1)
plt.scatter(data.iloc[:,i], data.iloc[:,6])
plt.show()
While some of these variables look like good candidates for splitting the data set into clusters, only the bottom right plot (
To do so, we transform
Choices made in this transformation process are not rigorous and does create problems mathematically. Here is a research paper on this topic. However, based on the results we will see later, I believe we should be fine here.
import numpy as np
plt.rcParams.update({'font.size': 14})
plt.scatter(data.iloc[:,5], np.log(data.iloc[:,6]+0.2))
plt.xlabel('Froude number')
plt.ylabel('log residuary resistance')
plt.show()
The dots in this plot of
Fitting the Model
Next, I fitted the simple linear regression model
import pandas as pd # data structures module
import numpy as np # n-dimensional array module
from sklearn.linear_model import LinearRegression as lr
data = pd.read_csv('yacht.csv')
x = data.iloc[:,5]
x = x.to_numpy().reshape(-1, 1)
y = np.log(data.iloc[:,6]+0.2)
y = y.to_numpy().reshape(-1, 1)
reg = lr()
reg.fit(x, y)
We can get the intercept
>>> reg.coef_ # slope coefficient
array([[15.5346844]])
>>> reg.intercept_ # intercept value
array([-3.17263269])
>>> reg.score(x,y) # coefficient of determination
0.9907102525754272
We can plot the regression line over a scatter plot of the data to get a rough idea of how good the fit is.
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})
plt.scatter(x,y,color='b')
plt.plot(x,reg.predict(x),color='k')
plt.legend(['regression line','original data'])
plt.xlabel('Froude number')
plt.ylabel('log residuary resistance')
plt.show()
The fit looks amazing! We can recover an equation containing just
This equation suggests that
Residual Analysis
Let’s also plot the residual errors to see if there are any obvious problems. We will need to sort the data by the x-axis values.
data.sort_values(by=['froude'],inplace=True)
Note that the numpy array
import matplotlib.pyplot as plt # for visualization
plt.rcParams.update({'font.size': 18})
e = y - reg.predict(x)
t = np.arange(0,len(e))
plt.subplot(1,2,1)
plt.scatter(t,e)
plt.title('residual scatter')
plt.subplot(1,2,2)
plt.hist(e)
plt.title('residual histogram')
The left subplot of the residual errors shows clear heterogeneity. Errors corresponding to values in the middle are skewed downwards. While errors corresponding to values away from the middle are skewed upwards. That being said, our regression line still gave an excellent fit, and should be a fine approximation of the true relationship.
The histogram on the right subplot is roughly symmetric and does not look too skewed. Note that there is no objective quantitative method for doing residual analysis. It is entirely subjective!
I found out later that the Froude number is an important Physics concept for determining resistance. This highlights a particular situation that linear regression excels at: estimating physical laws of nature, similar to how Carl Friedrich Gauss used the method of least squares to estimate celestial orbits.
We will see another application of simple linear regression to a more messy example in part 3 of this article.