PSL: Linear Regression (II)¶

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as lm

Load Data¶

myData = pd.read_table("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")
myData.drop(myData.columns[0], axis=1, inplace=True)
myData.drop('train', axis=1, inplace=True)
myData.columns.values[8] = 'Y'
myData.head()

Training vs Test Errors¶

In linear regression models, as we introduce more variables, training error (measured by RSS or MSE) consistently decreases. However, this reduction in training error does not guarantee a corresponding decrease in test error, which measures prediction accuracy on independent test data.

To demonstrate this, we randomly split our dataset into 60% training and 40% test portions, progressively adding predictors. This approach highlights how additional predictors can lower training error while test error may not follow the same trend.

n, p = myData.shape
p_range = np.arange(p)
p_range = p_range[1:]
np.random.seed(20)

from sklearn.model_selection import train_test_split

mytrain_size = 0.6
train_MSE = []
test_MSE =[]

xTrain, xTest, yTrain, yTest = train_test_split(myData.iloc[:, :-1], \
                                                    myData['Y'], \
                                                    train_size = mytrain_size)

for i in p_range:
    myfit = lm()
    myfit.fit(xTrain.iloc[:, :i], yTrain)
    train_Y_pred = myfit.predict(xTrain.iloc[:, :i])
    train_MSE.append(np.mean((yTrain - train_Y_pred)**2))
    
    test_Y_pred = myfit.predict(xTest.iloc[:, :i])
    test_MSE.append(np.mean((yTest - test_Y_pred)**2))

plt.scatter(p_range, train_MSE, c = "blue", label="Train")
plt.plot(p_range, train_MSE, c = "blue")
plt.scatter(p_range, test_MSE, c = "red", label="Test")
plt.plot(p_range, test_MSE, c = "red")
plt.xlabel('# of variables')
plt.ylabel('MSE')
plt.legend()
plt.show()

You can remove np.random.seed(2) and run the code above multiple time. You’ll observe the following patterns:

In most instances, the blue line (representing training error) tends to be positioned below the red line (representing test error), indicating that the training error is generally superior to the test error. However, due to the inherent randomness, there might be instances where the red line falls below the blue line, indicating that the test error outperforms the training error occasionally.
In each iteration, the blue line consistently exhibits a monotonically decreasing trend, signifying the reduction of training error as more predictors are incorporated. On the other hand, the red line’s trajectory may not uniformly decrease. Upon closer examination of the differences between adjacent terms, you’ll observe negative values consistently for the blue line, but for the red line, these differences might encompass some positive values.

np.vstack((np.diff(test_MSE), np.diff(train_MSE))).transpose()

array([[-0.08321623, -0.06892564],
       [-0.00997633, -0.00311116],
       [ 0.03980787, -0.03006297],
       [-0.03887046, -0.07546234],
       [ 0.01135871, -0.00128435],
       [-0.00269569, -0.00889526],
       [-0.00157113, -0.01073944]])

Interpret LS Coefficients¶

How to interpret LS coefficients? LS coefficients express how changes in predictors affect the response while holding all other predictors constant. For instance, the coefficient for age quantifies the average change in the response variable for each year increase in age, holding all other predictors constant.

It’s noteworthy that results from Simple Linear Regression (SLR), which involves only one non-intercept predictor, might differ from results in Multiple Linear Regression (MLR), which incorporates multiple predictors. This divergence can arise due to correlations among predictors.

For instance, consider the case where age is positively correlated with some predictors that have a positive effect on the response. In a joint model (MLR), the coefficient for age appears negative, seemingly contrary to its positive effect in an isolated model (SLR). This adjustment compensates for the positive contribution introduced by other predictors, ensuring that the overall model accurately represents the relationships between predictors and the response variable.

import statsmodels.formula.api as smf
tmpfit = smf.ols(formula = "Y ~ age", data = myData).fit()
tmpfit.summary()

all_covariates = list(myData.columns)
all_covariates.remove('Y')
my_formula = "Y ~ " + "+".join(all_covariates)
lmfit2 = smf.ols(formula = my_formula, data = myData).fit()
lmfit2.summary()

np.round(myData.corr(), decimals = 2)

Partial Regression Coef¶

Check how to retrieve the LS coefficient for age using Algorithm 3.1

partial_lm = lm()
X = myData.drop(['Y', 'age'], axis = 1).copy()
y = myData['Y']
partial_lm.fit(X, y)
y_star = y - partial_lm.predict(X)
y_star = y_star.values.reshape(-1, 1)

partial_lm.fit(X, myData['age'])
age_star = myData['age'] - partial_lm.predict(X)
age_star = age_star.values.reshape(-1, 1)

partial_lm.fit(age_star, y_star)
print(partial_lm.coef_)

[[-0.02124819]]

The LS coefficient for age (from this partial_lm) is the same as the one from lmfit2. The residuals from the two LS models are also the same.

res1 = y - lmfit2.predict()
res2 = y_star - partial_lm.predict(age_star)
print(np.sum(res1 ** 2))
print(np.sum(res2 ** 2))

43.058418737724914
43.05841873772491

F-test¶

Test a single predictor (in this case, F-test = t-test).

X = myData.iloc[:, :-1]
Y = myData['Y']
lmfit1 = lm()
lmfit1.fit(X, myData['Y'])

X0 = X.drop(['age'], axis=1)
lmfit0 = lm()
lmfit0.fit(X0, Y)

RSS0 = lmfit0.score(X0, Y)
RSS1 = lmfit1.score(X, Y)

import scipy.stats as stats
n, p = myData.shape
p = p - 1 # p = 8, the number of non-intercept predictors

f_value = (RSS1 - RSS0)/((1 - RSS1)/(n - p -1))
print(f_value, np.sqrt(f_value))  # F-stat and t-stat for age

3.6748878552009163 1.9169997014086664

1 - stats.f.cdf(f_value, 1, n - p -1) # p-value for F-test for age

0.05848267729890899

2*(1 - stats.t.cdf(np.sqrt(f_value), n-p-1))  # p-value for T-test for age

0.0584826772989091

lmfit2.summary2().tables[1].loc['age',:] # compare the T-stat and p-value for age from lmfit2

Coef.      -0.021248
Std.Err.    0.011084
t          -1.917000
P>|t|       0.058483
[0.025     -0.043275
0.975]      0.000779
Name: age, dtype: float64

Collinearity¶

Check the Car Seat Position Data.

Car drivers like to adjust the seat position for their own comfort. Car designers would find it helpful to know how different drivers will position the seat depending on their size and age. Researchers at the HuMoSim laboratory at the University of Michigan collected data on 38 drivers.

Age:
Weight:
HtShoes: height with shoes in cm
Ht: height without shoes in cm
Seated: seated height in cm
Arm: lower arm length in cm
Thigh: thigh length in cm
Leg: lower leg length in cm
hipcenter: horizontal distance of the midpoint of the hips from a fixed location in the car in mm

The researchers were interested in determining if a relationship exists between hipcenter and the other variables. Due to the high correlations among the predictors, we see high R-square, significant overall F-test, but no individual variables are significant.

seatpos = pd.read_csv("https://liangfgithub.github.io/Data/SeatPos.csv")

seatpos.shape

(38, 9)

seatpos.head()

sns.pairplot(seatpos)

all_covariates = list(seatpos.columns)
all_covariates.remove('hipcenter')
my_formula = "hipcenter ~ " + "+".join(all_covariates)
tmpfit = smf.ols(formula = my_formula, data = seatpos).fit()
tmpfit.summary()

np.round(seatpos.corr(), decimals = 2)

tmpfit = smf.ols('hipcenter ~ Age + Weight + Ht + Seated', data = seatpos).fit()
tmpfit.summary()

tmpfit = smf.ols('hipcenter ~ Ht', data = seatpos).fit()
tmpfit.summary()

	lcavol	lweight	age	lbph	lcp	gleason	pgg45	Y
0	-0.579818	2.769459	50	-1.386294	-1.386294	6	0	-0.430783
1	-0.994252	3.319626	58	-1.386294	-1.386294	6	0	-0.162519
2	-0.510826	2.691243	74	-1.386294	-1.386294	7	20	-0.162519
3	-1.203973	3.282789	58	-1.386294	-1.386294	6	0	-0.162519
4	0.751416	3.432373	62	-1.386294	-1.386294	6	0	0.371564

Dep. Variable:	Y	R-squared:	0.029
Model:	OLS	Adj. R-squared:	0.019
Method:	Least Squares	F-statistic:	2.813
Date:	Sat, 19 Aug 2023	Prob (F-statistic):	0.0968
Time:	00:32:54	Log-Likelihood:	-149.64
No. Observations:	97	AIC:	303.3
Df Residuals:	95	BIC:	308.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.7991	1.008	0.793	0.430	-1.202	2.800
age	0.0263	0.016	1.677	0.097	-0.005	0.057

Omnibus:	2.538	Durbin-Watson:	0.067
Prob(Omnibus):	0.281	Jarque-Bera (JB):	2.096
Skew:	0.152	Prob(JB):	0.351
Kurtosis:	3.653	Cond. No.	558.

Dep. Variable:	Y	R-squared:	0.663
Model:	OLS	Adj. R-squared:	0.633
Method:	Least Squares	F-statistic:	21.68
Date:	Sat, 19 Aug 2023	Prob (F-statistic):	7.65e-18
Time:	01:10:15	Log-Likelihood:	-98.248
No. Observations:	97	AIC:	214.5
Df Residuals:	88	BIC:	237.7
Df Model:	8
Covariance Type:	nonrobust

Omnibus:	0.563	Durbin-Watson:	1.540
Prob(Omnibus):	0.755	Jarque-Bera (JB):	0.183
Skew:	0.017	Prob(JB):	0.913
Kurtosis:	3.210	Cond. No.	1.32e+03

	lcavol	lweight	age	lbph	svi	lcp	gleason	pgg45	Y
lcavol	1.00	0.28	0.22	0.03	0.54	0.68	0.43	0.43	0.73
lweight	0.28	1.00	0.35	0.44	0.16	0.16	0.06	0.11	0.43
age	0.22	0.35	1.00	0.35	0.12	0.13	0.27	0.28	0.17
lbph	0.03	0.44	0.35	1.00	-0.09	-0.01	0.08	0.08	0.18
svi	0.54	0.16	0.12	-0.09	1.00	0.67	0.32	0.46	0.57
lcp	0.68	0.16	0.13	-0.01	0.67	1.00	0.51	0.63	0.55
gleason	0.43	0.06	0.27	0.08	0.32	0.51	1.00	0.75	0.37
pgg45	0.43	0.11	0.28	0.08	0.46	0.63	0.75	1.00	0.42
Y	0.73	0.43	0.17	0.18	0.57	0.55	0.37	0.42	1.00

	Age	Weight	HtShoes	Ht	Seated	Arm	Thigh	Leg	hipcenter
0	46	180	187.2	184.9	95.2	36.1	45.3	41.3	-206.300
1	31	175	167.5	165.5	83.8	32.9	36.5	35.9	-178.210
2	23	100	153.6	152.2	82.9	26.0	36.6	31.0	-71.673
3	19	185	190.3	187.4	97.3	37.4	44.1	41.0	-257.720
4	23	159	178.0	174.1	93.9	29.5	40.1	36.9	-173.230

Dep. Variable:	hipcenter	R-squared:	0.687
Model:	OLS	Adj. R-squared:	0.600
Method:	Least Squares	F-statistic:	7.940
Date:	Sat, 19 Aug 2023	Prob (F-statistic):	1.31e-05
Time:	01:13:56	Log-Likelihood:	-186.73
No. Observations:	38	AIC:	391.5
Df Residuals:	29	BIC:	406.2
Df Model:	8
Covariance Type:	nonrobust

Omnibus:	0.543	Durbin-Watson:	1.769
Prob(Omnibus):	0.762	Jarque-Bera (JB):	0.664
Skew:	0.157	Prob(JB):	0.717
Kurtosis:	2.434	Cond. No.	8.44e+03

	Age	Weight	HtShoes	Ht	Seated	Arm	Thigh	Leg	hipcenter
Age	1.00	0.08	-0.08	-0.09	-0.17	0.36	0.09	-0.04	0.21
Weight	0.08	1.00	0.83	0.83	0.78	0.70	0.57	0.78	-0.64
HtShoes	-0.08	0.83	1.00	1.00	0.93	0.75	0.72	0.91	-0.80
Ht	-0.09	0.83	1.00	1.00	0.93	0.75	0.73	0.91	-0.80
Seated	-0.17	0.78	0.93	0.93	1.00	0.63	0.61	0.81	-0.73
Arm	0.36	0.70	0.75	0.75	0.63	1.00	0.67	0.75	-0.59
Thigh	0.09	0.57	0.72	0.73	0.61	0.67	1.00	0.65	-0.59
Leg	-0.04	0.78	0.91	0.91	0.81	0.75	0.65	1.00	-0.79
hipcenter	0.21	-0.64	-0.80	-0.80	-0.73	-0.59	-0.59	-0.79	1.00

Omnibus:	0.512	Durbin-Watson:	1.672
Prob(Omnibus):	0.774	Jarque-Bera (JB):	0.435
Skew:	-0.244	Prob(JB):	0.805
Kurtosis:	2.810	Cond. No.	6.70e+03

Omnibus:	1.707	Durbin-Watson:	1.890
Prob(Omnibus):	0.426	Jarque-Bera (JB):	1.028
Skew:	-0.395	Prob(JB):	0.598
Kurtosis:	3.154	Cond. No.	2.60e+03

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.1816	1.321	0.137	0.891	-2.443	2.806
lcavol	0.5643	0.088	6.425	0.000	0.390	0.739
lweight	0.6220	0.201	3.096	0.003	0.223	1.021
age	-0.0212	0.011	-1.917	0.058	-0.043	0.001
lbph	0.0967	0.058	1.670	0.098	-0.018	0.212
svi	0.7617	0.241	3.158	0.002	0.282	1.241
lcp	-0.1061	0.090	-1.180	0.241	-0.285	0.073
gleason	0.0492	0.155	0.317	0.752	-0.259	0.358
pgg45	0.0045	0.004	1.021	0.310	-0.004	0.013

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	436.4321	166.572	2.620	0.014	95.755	777.109
Age	0.7757	0.570	1.360	0.184	-0.391	1.942
Weight	0.0263	0.331	0.080	0.937	-0.651	0.703
HtShoes	-2.6924	9.753	-0.276	0.784	-22.640	17.255
Ht	0.6013	10.130	0.059	0.953	-20.117	21.319
Seated	0.5338	3.762	0.142	0.888	-7.160	8.228
Arm	-1.3281	3.900	-0.341	0.736	-9.305	6.649
Thigh	-1.1431	2.660	-0.430	0.671	-6.583	4.297
Leg	-6.4390	4.714	-1.366	0.182	-16.080	3.202

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	478.6589	159.734	2.997	0.005	153.678	803.639
Age	0.5840	0.426	1.372	0.179	-0.282	1.450
Weight	-0.0154	0.316	-0.049	0.962	-0.659	0.628
Ht	-4.9902	1.644	-3.036	0.005	-8.335	-1.646
Seated	2.0463	3.413	0.600	0.553	-4.897	8.990

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	556.2553	90.670	6.135	0.000	372.367	740.144
Ht	-4.2650	0.535	-7.970	0.000	-5.350	-3.180