Simple Linear Regression - Part 2

Estimating Yacht Hydrodynamics

Sections

Data Description
Cleaning and CSV Conversion
Preliminary Exploration
Fitting the Model
Residual Analysis

Having briefly gone through the mathematical theory in part 1, let’s demonstrate how simple linear regression can be used in practice.

Data Description

The dataset that I will be using comes from this page of the UCI repository. If the link becomes broken, an internet search for “yacht hydrodynamics dataset” should find it.

The link above has a description of the variables in the dataset:

Longitudinal position of the center of buoyancy.
Prismatic coefficient.
Length-displacement ratio
Beam-draught ratio
Length-beam ratio
Froude number
Residuary resistance per unit weight of displacement

The variable I am trying to predict is $ Y = $ residuary resistance per unit weight of displacement.

Cleaning and CSV Conversion

The data comes in a text file with a “.data” extension. It is space separated in an inconsistent way, and is also missing headers for each column. Let’s rename the file to “yacht.txt”, add a row of headers, clean up the excess whitespace, and turn it into csv format.

 1f = open('yacht.txt','r')		
 2x = f.readlines()
 3f.close()
 4
 5g = open('yacht.csv','w')
 6
 7# write header row
 8g.write('
 9	buoyancy,
10	prismatic,
11	length-displace,
12	beam-draught,
13	length-beam,
14	froude,
15	resist \n
16')
17
18for i in range(0,len(x)):
19  y = x[i].strip() # remove leading/trailing whitespace
20  y = y.replace('  ',' '); # some rows had double spacing
21  y = y.replace(' ',','); # replace single spaces with comma
22  g.write(y + '\n') # added a line break after each row
23
24g.close()

Preliminary Exploration

Let’s start by looking at the scatter plots of $ Y $ against the rest of the variables.

 1import pandas as pd
 2import matplotlib.pyplot as plt
 3
 4data = pd.read_csv('yacht.csv')
 5
 6for i in range(0,6):
 7    plt.subplot(2,3,i+1)
 8    plt.scatter(data.iloc[:,i], data.iloc[:,6])
 9
10plt.show()

While some of these variables look like good candidates for splitting the data set into clusters, only the bottom right plot ($ Y $ vs Froude number) shows some potential for a simple linear regression model. The relationship shown in the scatterplot is clearly not linear, but we can apply a logarithmic transform to the $ Y $ variable to get a more linear looking scatterplot.

To do so, we transform $ Y $ into $ Z = \log(Y+0.2) $, where $ \log $ is the natural logarithm. I added an arbitrary $ 0.2 $ because some values of $ Y $ were close to zero, which behaves badly under logarithmic transformation.

Choices made in this transformation process are not rigorous and does create problems mathematically. Here is a research paper on this topic. However, based on the results we will see later, I believe we should be fine here.

1import numpy as np
2
3plt.rcParams.update({'font.size': 14})
4plt.scatter(data.iloc[:,5], np.log(data.iloc[:,6]+0.2))
5
6plt.xlabel('Froude number')
7plt.ylabel('log residuary resistance')
8plt.show()

The dots in this plot of $ Z $ against the Froude number appear to lie on a line. This suggests that a simple linear regression model should be a good fit for the data.

Fitting the Model

Next, I fitted the simple linear regression model $ Z = a + bX $ to our data in Python.

 1import pandas as pd # data structures module
 2import numpy as np # n-dimensional array module
 3from sklearn.linear_model import LinearRegression as lr
 4
 5data = pd.read_csv('yacht.csv')
 6
 7x = data.iloc[:,5]
 8x = x.to_numpy().reshape(-1, 1)
 9
10y = np.log(data.iloc[:,6]+0.2)
11y = y.to_numpy().reshape(-1, 1)
12
13reg = lr()
14reg.fit(x, y)

We can get the intercept $ a $, the slope coefficient $ b $, and the coefficient of determination $ R^2 $.

1>>> reg.coef_ # slope coefficient
2array([[15.5346844]])
3
4>>> reg.intercept_ # intercept value
5array([-3.17263269])
6
7>>> reg.score(x,y) # coefficient of determination
80.9907102525754272

We can plot the regression line over a scatter plot of the data to get a rough idea of how good the fit is.

 1import matplotlib.pyplot as plt
 2plt.rcParams.update({'font.size': 18})
 3
 4plt.scatter(x,y,color='b')
 5plt.plot(x,reg.predict(x),color='k')
 6
 7plt.legend(['regression line','original data'])
 8plt.xlabel('Froude number')
 9plt.ylabel('log residuary resistance')
10plt.show()

The fit looks amazing! We can recover an equation containing just $ Y $ and $ X $ by substituting $ Z = \log(Y + 0.2) $ into the transformed equation $ Z = a + bX $. This gives us $ \log(Y + 0.2) = a + bX $. We can then take exponential on both sides to get

\[ Y = e^{a + bX} - 0.2 = e^{a} e^{bX} - 0.2. \]

This equation suggests that $ X $ has a multiplicative relationship with $ Y $ , and not an additive one.

Residual Analysis

Let’s also plot the residual errors to see if there are any obvious problems. We will need to sort the data by the x-axis values.

1data.sort_values(by=['froude'],inplace=True)

Note that the numpy array $ y $ in the code below is from the sorted dataset.

 1import matplotlib.pyplot as plt # for visualization
 2plt.rcParams.update({'font.size': 18})
 3
 4e = y - reg.predict(x)
 5t = np.arange(0,len(e))
 6
 7plt.subplot(1,2,1)
 8plt.scatter(t,e)
 9plt.title('residual scatter')
10
11plt.subplot(1,2,2)
12plt.hist(e)
13plt.title('residual histogram')

The left subplot of the residual errors shows clear heterogeneity. Errors corresponding to values in the middle are skewed downwards. While errors corresponding to values away from the middle are skewed upwards. That being said, our regression line still gave an excellent fit and should be a fine approximation of the true relationship.

The histogram on the right subplot is roughly symmetric and does not look too skewed. Note that there is no objective quantitative method for doing residual analysis. It is entirely subjective!

I found out later that the Froude number is an important Physics concept for determining resistance. This highlights a particular situation that linear regression excels at: estimating physical laws of nature, similar to how Carl Friedrich Gauss used the method of least squares to estimate celestial orbits.

We will see another application of simple linear regression to a more messy example in part 3 of this article.