Simple Linear Regression - Part 2

Estimating Yacht Hydrodynamics

Having briefly gone through the mathematical theory in part 1, let’s demonstrate how simple linear regression can be used in practice.

Data Description

The dataset that I will be using comes from https://archive.ics.uci.edu/ml/datasets/yacht+hydrodynamics, and is contributed by Dr Roberto Lopez (Ship Hydromechanics Laboratory, Maritime and Transport Technology Department, Technical University of Delft ). If the link above becomes broken, please Google for “yacht hydrodynamics dataset”.

The link above has a description of the variables in the dataset:

Longitudinal position of the center of buoyancy.
Prismatic coefficient.
Length-displacement ratio
Beam-draught ratio
Length-beam ratio
Froude number
Residuary resistance per unit weight of displacement

The variable I am trying to predict is $ Y = $ residuary resistance per unit weight of displacement.

Cleaning and CSV Conversion

The data comes in a text file with a .data extension. It is space separated in an inconsistent way, and is also missing headers for each column. Let’s rename the file to “yacht.txt”, add a row of headers, clean up the excess whitespace, and turn it into csv format.

f = open('yacht.txt','r')		
x = f.readlines()
f.close()

g = open('yacht.csv','w')
g.write('buoyancy,prismatic,length-displace,beam-draught,length-beam,froude,resist\n') # write header row

for i in range(0,len(x)):
  y = x[i].strip() # remove leading/trailing whitespace
  y = y.replace('  ',' '); # some rows had double spacing
  y = y.replace(' ',','); # replace single spaces with comma
  g.write(y + '\n') # added a line break after each row

g.close()

Preliminary Exploration

Let’s start by looking at the scatter plots of $ Y $ against the rest of the variables.

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('yacht.csv')

for i in range(0,6):
    plt.subplot(2,3,i+1)
    plt.scatter(data.iloc[:,i], data.iloc[:,6])

plt.show()

While some of these variables look like good candidates for splitting the data set into clusters, only the bottom right plot ( $ Y $ vs Froude number) shows some potential for a simple linear regression model. The relationship shown in the scatterplot is clearly not linear, but we can apply a logarithmic transform to the $ Y $ variable to get a more linear looking scatterplot.

To do so, we transform $ Y $ into $ Z = \log(Y+0.2) $ , where $ \log $ is the natural logarithm. I added an arbitrary $ 0.2 $ because some values of $ Y $ were close to zero, which behaves badly under logarithmic transformation.

Choices made in this transformation process are not rigorous and does create problems mathematically. Here is a research paper on this topic. However, based on the results we will see later, I believe we should be fine here.

import numpy as np

plt.rcParams.update({'font.size': 14})
plt.scatter(data.iloc[:,5], np.log(data.iloc[:,6]+0.2))

plt.xlabel('Froude number')
plt.ylabel('log residuary resistance')
plt.show()

The dots in this plot of $ Z $ against the Froude number appear to lie on a line. This suggests that a simple linear regression model should be a good fit for the data.

Fitting the Model

Next, I fitted the simple linear regression model $ Z = a + bX $ to our data, using Python.

import pandas as pd # data structures module
import numpy as np # n-dimensional array module
from sklearn.linear_model import LinearRegression as lr

data = pd.read_csv('yacht.csv')

x = data.iloc[:,5]
x = x.to_numpy().reshape(-1, 1)

y = np.log(data.iloc[:,6]+0.2)
y = y.to_numpy().reshape(-1, 1)

reg = lr()
reg.fit(x, y)

We can get the intercept $ a $ , the slope coefficient $ b $ , and the coefficient of determination $ R^2 $ .

>>> reg.coef_ # slope coefficient
array([[15.5346844]])

>>> reg.intercept_ # intercept value
array([-3.17263269])

>>> reg.score(x,y) # coefficient of determination
0.9907102525754272

We can plot the regression line over a scatter plot of the data to get a rough idea of how good the fit is.

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})

plt.scatter(x,y,color='b')
plt.plot(x,reg.predict(x),color='k')

plt.legend(['regression line','original data'])
plt.xlabel('Froude number')
plt.ylabel('log residuary resistance')
plt.show()

The fit looks amazing! We can recover an equation containing just $ Y $ and $ X $ by substituting $ Z = \log(Y + 0.2) $ into the transformed equation $ Z = a + bX $ . This gives us $ \log(Y + 0.2) = a + bX $ . We can then take exponential on both sides to get

$$ Y = e^{a + bX} - 0.2 = e^{a} e^{bX} - 0.2. $$

This equation suggests that $ X $ has a multiplicative relationship with $ Y $ , and not an additive one.

Residual Analysis

Let’s also plot the residual errors to see if there are any obvious problems. We will need to sort the data by the x-axis values.

data.sort_values(by=['froude'],inplace=True)

Note that the numpy array $ y $ in the code below is from the sorted dataset.

import matplotlib.pyplot as plt # for visualization
plt.rcParams.update({'font.size': 18})

e = y - reg.predict(x)
t = np.arange(0,len(e))

plt.subplot(1,2,1)
plt.scatter(t,e)
plt.title('residual scatter')

plt.subplot(1,2,2)
plt.hist(e)
plt.title('residual histogram')

The left subplot of the residual errors shows clear heterogeneity. Errors corresponding to values in the middle are skewed downwards. While errors corresponding to values away from the middle are skewed upwards. That being said, our regression line still gave an excellent fit, and should be a fine approximation of the true relationship.

The histogram on the right subplot is roughly symmetric and does not look too skewed. Note that there is no objective quantitative method for doing residual analysis. It is entirely subjective!

I found out later that the Froude number is an important Physics concept for determining resistance. This highlights a particular situation that linear regression excels at: estimating physical laws of nature, similar to how Carl Friedrich Gauss used the method of least squares to estimate celestial orbits.

We will see another application of simple linear regression to a more messy example in part 3 of this article.