PSL: Linear Regression (II)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as lm

Load Data

In [2]:
myData = pd.read_table("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")
myData.drop(myData.columns[0], axis=1, inplace=True)
myData.drop('train', axis=1, inplace=True)
myData.columns.values[8] = 'Y'
myData.head()
Out[2]:
lcavol lweight age lbph svi lcp gleason pgg45 Y
0 -0.579818 2.769459 50 -1.386294 0 -1.386294 6 0 -0.430783
1 -0.994252 3.319626 58 -1.386294 0 -1.386294 6 0 -0.162519
2 -0.510826 2.691243 74 -1.386294 0 -1.386294 7 20 -0.162519
3 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 0 -0.162519
4 0.751416 3.432373 62 -1.386294 0 -1.386294 6 0 0.371564

Training vs Test Errors

In linear regression models, as we introduce more variables, training error (measured by RSS or MSE) consistently decreases. However, this reduction in training error does not guarantee a corresponding decrease in test error, which measures prediction accuracy on independent test data.

To demonstrate this, we randomly split our dataset into 60% training and 40% test portions, progressively adding predictors. This approach highlights how additional predictors can lower training error while test error may not follow the same trend.

In [3]:
n, p = myData.shape
p_range = np.arange(p)
p_range = p_range[1:]
np.random.seed(20)

from sklearn.model_selection import train_test_split

mytrain_size = 0.6
train_MSE = []
test_MSE =[]

xTrain, xTest, yTrain, yTest = train_test_split(myData.iloc[:, :-1], \
                                                    myData['Y'], \
                                                    train_size = mytrain_size)
In [4]:
for i in p_range:
    myfit = lm()
    myfit.fit(xTrain.iloc[:, :i], yTrain)
    train_Y_pred = myfit.predict(xTrain.iloc[:, :i])
    train_MSE.append(np.mean((yTrain - train_Y_pred)**2))
    
    test_Y_pred = myfit.predict(xTest.iloc[:, :i])
    test_MSE.append(np.mean((yTest - test_Y_pred)**2))
In [5]:
plt.scatter(p_range, train_MSE, c = "blue", label="Train")
plt.plot(p_range, train_MSE, c = "blue")
plt.scatter(p_range, test_MSE, c = "red", label="Test")
plt.plot(p_range, test_MSE, c = "red")
plt.xlabel('# of variables')
plt.ylabel('MSE')
plt.legend()
plt.show()

You can remove np.random.seed(2) and run the code above multiple time. You’ll observe the following patterns:

  • In most instances, the blue line (representing training error) tends to be positioned below the red line (representing test error), indicating that the training error is generally superior to the test error. However, due to the inherent randomness, there might be instances where the red line falls below the blue line, indicating that the test error outperforms the training error occasionally.

  • In each iteration, the blue line consistently exhibits a monotonically decreasing trend, signifying the reduction of training error as more predictors are incorporated. On the other hand, the red line’s trajectory may not uniformly decrease. Upon closer examination of the differences between adjacent terms, you’ll observe negative values consistently for the blue line, but for the red line, these differences might encompass some positive values.

In [6]:
np.vstack((np.diff(test_MSE), np.diff(train_MSE))).transpose()
Out[6]:
array([[-0.08321623, -0.06892564],
       [-0.00997633, -0.00311116],
       [ 0.03980787, -0.03006297],
       [-0.03887046, -0.07546234],
       [ 0.01135871, -0.00128435],
       [-0.00269569, -0.00889526],
       [-0.00157113, -0.01073944]])

Interpret LS Coefficients

How to interpret LS coefficients? LS coefficients express how changes in predictors affect the response while holding all other predictors constant. For instance, the coefficient for age quantifies the average change in the response variable for each year increase in age, holding all other predictors constant.

It’s noteworthy that results from Simple Linear Regression (SLR), which involves only one non-intercept predictor, might differ from results in Multiple Linear Regression (MLR), which incorporates multiple predictors. This divergence can arise due to correlations among predictors.

For instance, consider the case where age is positively correlated with some predictors that have a positive effect on the response. In a joint model (MLR), the coefficient for age appears negative, seemingly contrary to its positive effect in an isolated model (SLR). This adjustment compensates for the positive contribution introduced by other predictors, ensuring that the overall model accurately represents the relationships between predictors and the response variable.

In [7]:
import statsmodels.formula.api as smf
tmpfit = smf.ols(formula = "Y ~ age", data = myData).fit()
tmpfit.summary()
Out[7]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.029
Model: OLS Adj. R-squared: 0.019
Method: Least Squares F-statistic: 2.813
Date: Sat, 19 Aug 2023 Prob (F-statistic): 0.0968
Time: 00:32:54 Log-Likelihood: -149.64
No. Observations: 97 AIC: 303.3
Df Residuals: 95 BIC: 308.4
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.7991 1.008 0.793 0.430 -1.202 2.800
age 0.0263 0.016 1.677 0.097 -0.005 0.057
Omnibus: 2.538 Durbin-Watson: 0.067
Prob(Omnibus): 0.281 Jarque-Bera (JB): 2.096
Skew: 0.152 Prob(JB): 0.351
Kurtosis: 3.653 Cond. No. 558.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [29]:
all_covariates = list(myData.columns)
all_covariates.remove('Y')
my_formula = "Y ~ " + "+".join(all_covariates)
lmfit2 = smf.ols(formula = my_formula, data = myData).fit()
lmfit2.summary()
Out[29]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.663
Model: OLS Adj. R-squared: 0.633
Method: Least Squares F-statistic: 21.68
Date: Sat, 19 Aug 2023 Prob (F-statistic): 7.65e-18
Time: 01:10:15 Log-Likelihood: -98.248
No. Observations: 97 AIC: 214.5
Df Residuals: 88 BIC: 237.7
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.1816 1.321 0.137 0.891 -2.443 2.806
lcavol 0.5643 0.088 6.425 0.000 0.390 0.739
lweight 0.6220 0.201 3.096 0.003 0.223 1.021
age -0.0212 0.011 -1.917 0.058 -0.043 0.001
lbph 0.0967 0.058 1.670 0.098 -0.018 0.212
svi 0.7617 0.241 3.158 0.002 0.282 1.241
lcp -0.1061 0.090 -1.180 0.241 -0.285 0.073
gleason 0.0492 0.155 0.317 0.752 -0.259 0.358
pgg45 0.0045 0.004 1.021 0.310 -0.004 0.013
Omnibus: 0.563 Durbin-Watson: 1.540
Prob(Omnibus): 0.755 Jarque-Bera (JB): 0.183
Skew: 0.017 Prob(JB): 0.913
Kurtosis: 3.210 Cond. No. 1.32e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [8]:
np.round(myData.corr(), decimals = 2)
Out[8]:
lcavol lweight age lbph svi lcp gleason pgg45 Y
lcavol 1.00 0.28 0.22 0.03 0.54 0.68 0.43 0.43 0.73
lweight 0.28 1.00 0.35 0.44 0.16 0.16 0.06 0.11 0.43
age 0.22 0.35 1.00 0.35 0.12 0.13 0.27 0.28 0.17
lbph 0.03 0.44 0.35 1.00 -0.09 -0.01 0.08 0.08 0.18
svi 0.54 0.16 0.12 -0.09 1.00 0.67 0.32 0.46 0.57
lcp 0.68 0.16 0.13 -0.01 0.67 1.00 0.51 0.63 0.55
gleason 0.43 0.06 0.27 0.08 0.32 0.51 1.00 0.75 0.37
pgg45 0.43 0.11 0.28 0.08 0.46 0.63 0.75 1.00 0.42
Y 0.73 0.43 0.17 0.18 0.57 0.55 0.37 0.42 1.00

Partial Regression Coef

Check how to retrieve the LS coefficient for age using Algorithm 3.1

In [26]:
partial_lm = lm()
X = myData.drop(['Y', 'age'], axis = 1).copy()
y = myData['Y']
partial_lm.fit(X, y)
y_star = y - partial_lm.predict(X)
y_star = y_star.values.reshape(-1, 1)

partial_lm.fit(X, myData['age'])
age_star = myData['age'] - partial_lm.predict(X)
age_star = age_star.values.reshape(-1, 1)

partial_lm.fit(age_star, y_star)
print(partial_lm.coef_)
[[-0.02124819]]

The LS coefficient for age (from this partial_lm) is the same as the one from lmfit2. The residuals from the two LS models are also the same.

In [31]:
res1 = y - lmfit2.predict()
res2 = y_star - partial_lm.predict(age_star)
print(np.sum(res1 ** 2))
print(np.sum(res2 ** 2))
43.058418737724914
43.05841873772491

F-test

Test a single predictor (in this case, F-test = t-test).

In [10]:
X = myData.iloc[:, :-1]
Y = myData['Y']
lmfit1 = lm()
lmfit1.fit(X, myData['Y'])

X0 = X.drop(['age'], axis=1)
lmfit0 = lm()
lmfit0.fit(X0, Y)

RSS0 = lmfit0.score(X0, Y)
RSS1 = lmfit1.score(X, Y)
In [21]:
import scipy.stats as stats
n, p = myData.shape
p = p - 1 # p = 8, the number of non-intercept predictors

f_value = (RSS1 - RSS0)/((1 - RSS1)/(n - p -1))
print(f_value, np.sqrt(f_value))  # F-stat and t-stat for age
3.6748878552009163 1.9169997014086664
In [22]:
1 - stats.f.cdf(f_value, 1, n - p -1) # p-value for F-test for age
Out[22]:
0.05848267729890899
In [23]:
2*(1 - stats.t.cdf(np.sqrt(f_value), n-p-1))  # p-value for T-test for age
Out[23]:
0.0584826772989091
In [14]:
lmfit2.summary2().tables[1].loc['age',:] # compare the T-stat and p-value for age from lmfit2
Out[14]:
Coef.      -0.021248
Std.Err.    0.011084
t          -1.917000
P>|t|       0.058483
[0.025     -0.043275
0.975]      0.000779
Name: age, dtype: float64

Collinearity

Check the Car Seat Position Data.

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know how different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers.

  • Age:
  • Weight:
  • HtShoes: height with shoes in cm
  • Ht: height without shoes in cm
  • Seated: seated height in cm
  • Arm: lower arm length in cm
  • Thigh: thigh length in cm
  • Leg: lower leg length in cm
  • hipcenter: horizontal distance of the midpoint of the hips from a fixed location in the car in mm

The researchers were interested in determining if a relationship exists between hipcenter and the other variables. Due to the high correlations among the predictors, we see high R-square, significant overall F-test, but no individual variables are significant.

In [32]:
seatpos = pd.read_csv("https://liangfgithub.github.io/Data/SeatPos.csv")
In [33]:
seatpos.shape
Out[33]:
(38, 9)
In [34]:
seatpos.head()
Out[34]:
Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
0 46 180 187.2 184.9 95.2 36.1 45.3 41.3 -206.300
1 31 175 167.5 165.5 83.8 32.9 36.5 35.9 -178.210
2 23 100 153.6 152.2 82.9 26.0 36.6 31.0 -71.673
3 19 185 190.3 187.4 97.3 37.4 44.1 41.0 -257.720
4 23 159 178.0 174.1 93.9 29.5 40.1 36.9 -173.230
In [ ]:
sns.pairplot(seatpos)
In [35]:
all_covariates = list(seatpos.columns)
all_covariates.remove('hipcenter')
my_formula = "hipcenter ~ " + "+".join(all_covariates)
tmpfit = smf.ols(formula = my_formula, data = seatpos).fit()
tmpfit.summary()
Out[35]:
OLS Regression Results
Dep. Variable: hipcenter R-squared: 0.687
Model: OLS Adj. R-squared: 0.600
Method: Least Squares F-statistic: 7.940
Date: Sat, 19 Aug 2023 Prob (F-statistic): 1.31e-05
Time: 01:13:56 Log-Likelihood: -186.73
No. Observations: 38 AIC: 391.5
Df Residuals: 29 BIC: 406.2
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 436.4321 166.572 2.620 0.014 95.755 777.109
Age 0.7757 0.570 1.360 0.184 -0.391 1.942
Weight 0.0263 0.331 0.080 0.937 -0.651 0.703
HtShoes -2.6924 9.753 -0.276 0.784 -22.640 17.255
Ht 0.6013 10.130 0.059 0.953 -20.117 21.319
Seated 0.5338 3.762 0.142 0.888 -7.160 8.228
Arm -1.3281 3.900 -0.341 0.736 -9.305 6.649
Thigh -1.1431 2.660 -0.430 0.671 -6.583 4.297
Leg -6.4390 4.714 -1.366 0.182 -16.080 3.202
Omnibus: 0.543 Durbin-Watson: 1.769
Prob(Omnibus): 0.762 Jarque-Bera (JB): 0.664
Skew: 0.157 Prob(JB): 0.717
Kurtosis: 2.434 Cond. No. 8.44e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [36]:
np.round(seatpos.corr(), decimals = 2)
Out[36]:
Age Weight HtShoes Ht Seated Arm Thigh Leg hipcenter
Age 1.00 0.08 -0.08 -0.09 -0.17 0.36 0.09 -0.04 0.21
Weight 0.08 1.00 0.83 0.83 0.78 0.70 0.57 0.78 -0.64
HtShoes -0.08 0.83 1.00 1.00 0.93 0.75 0.72 0.91 -0.80
Ht -0.09 0.83 1.00 1.00 0.93 0.75 0.73 0.91 -0.80
Seated -0.17 0.78 0.93 0.93 1.00 0.63 0.61 0.81 -0.73
Arm 0.36 0.70 0.75 0.75 0.63 1.00 0.67 0.75 -0.59
Thigh 0.09 0.57 0.72 0.73 0.61 0.67 1.00 0.65 -0.59
Leg -0.04 0.78 0.91 0.91 0.81 0.75 0.65 1.00 -0.79
hipcenter 0.21 -0.64 -0.80 -0.80 -0.73 -0.59 -0.59 -0.79 1.00
In [37]:
tmpfit = smf.ols('hipcenter ~ Age + Weight + Ht + Seated', data = seatpos).fit()
tmpfit.summary()
Out[37]:
OLS Regression Results
Dep. Variable: hipcenter R-squared: 0.660
Model: OLS Adj. R-squared: 0.619
Method: Least Squares F-statistic: 16.01
Date: Sat, 19 Aug 2023 Prob (F-statistic): 2.22e-07
Time: 01:14:16 Log-Likelihood: -188.28
No. Observations: 38 AIC: 386.6
Df Residuals: 33 BIC: 394.8
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 478.6589 159.734 2.997 0.005 153.678 803.639
Age 0.5840 0.426 1.372 0.179 -0.282 1.450
Weight -0.0154 0.316 -0.049 0.962 -0.659 0.628
Ht -4.9902 1.644 -3.036 0.005 -8.335 -1.646
Seated 2.0463 3.413 0.600 0.553 -4.897 8.990
Omnibus: 0.512 Durbin-Watson: 1.672
Prob(Omnibus): 0.774 Jarque-Bera (JB): 0.435
Skew: -0.244 Prob(JB): 0.805
Kurtosis: 2.810 Cond. No. 6.70e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.7e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [40]:
tmpfit = smf.ols('hipcenter ~ Ht', data = seatpos).fit()
tmpfit.summary()
Out[40]:
OLS Regression Results
Dep. Variable: hipcenter R-squared: 0.638
Model: OLS Adj. R-squared: 0.628
Method: Least Squares F-statistic: 63.53
Date: Sat, 19 Aug 2023 Prob (F-statistic): 1.83e-09
Time: 01:14:27 Log-Likelihood: -189.45
No. Observations: 38 AIC: 382.9
Df Residuals: 36 BIC: 386.2
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 556.2553 90.670 6.135 0.000 372.367 740.144
Ht -4.2650 0.535 -7.970 0.000 -5.350 -3.180
Omnibus: 1.707 Durbin-Watson: 1.890
Prob(Omnibus): 0.426 Jarque-Bera (JB): 1.028
Skew: -0.395 Prob(JB): 0.598
Kurtosis: 3.154 Cond. No. 2.60e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.6e+03. This might indicate that there are
strong multicollinearity or other numerical problems.