PSL: Linear Regression (I)¶

In this Notebook, we'll describe three different ways to fit a linear regression model in Python:

Scikit-Learn
NumPy
statsmodels

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Load Data¶

The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

lcavol: log cancer volume
lweight: log prostate weight
age: in years
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration
gleason: a numeric vector
pgg45: percent of Gleason score 4 or 5
lpsa: response

The first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the "training set" and which 30 as the test set, as described on page 48 in the ESL book.

myData = pd.read_table("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")

# Preview the first 5 lines of the loaded data 
myData.head()

myData.shape

(97, 11)

Remove the index column (col 0) and the train indicator column (col 10) and run a quick summary of each column of the data. For convenience, we rename the response column with a generic name Y.

myData.drop(myData.columns[0], axis=1, inplace=True)
myData.drop('train', axis=1, inplace=True)
myData.columns.values[8] = 'Y'
myData.head()

Check missing values

myData.isnull().sum()

A quick summary of each column of myData.

myData.describe(include='all')

Produce a pair-wise scatter plot. Caution: a big figure.

sns.pairplot(myData)

Fit a Linear Model¶

Next we describe how to fit a linear regression model in Python and how to obtain the LS prediction on new samples.

Our new samples are two male patients with median features (i.e., all covariates except age take the median value of the 97 samples); one is 50 years old and the other is 70 years old.

Xnew = pd.concat([myData.median()] * 2, axis = 1)
Xnew = pd.DataFrame(Xnew).T
Xnew['age'] = [50, 70]
Xnew = Xnew.iloc[:, :-1]
Xnew

1. Use `sklearn.linear_model`¶

from sklearn.linear_model import LinearRegression as lm
X = myData.iloc[:, :-1]
y = myData['Y']

lmfit1 = lm()
lmfit1.fit(X, y)
coefs = lmfit1.coef_
intercept = lmfit1.intercept_

print(coefs)
print(intercept)

[ 0.56434128  0.62201979 -0.02124819  0.09671252  0.7616734  -0.10605094
  0.04922793  0.00445751]
0.18156084546895723

Predicting responses for new observations can be accomplished through two methods:

Method 1: Begin by organizing the new observations into a matrix or vector. Then, apply the following formula: intercept + newX (times) coefs.
Method 2: Alternatively, structure the new observations as a data frame with identical column names as those in myData. Subsequently, utilize the predict function.

print(intercept + Xnew @ coefs) 
print(lmfit1.predict(Xnew))

0    2.714454
1    2.289491
dtype: float64
[2.71445433 2.28949063]

2. Use `statsmodels.formula.api`¶

You can use R-style formulas (such as Y ~ X1 + X2 ) to specific models.

import statsmodels.formula.api as smf
all_covariates = list(myData.columns)
all_covariates.remove('Y')
my_formula = "Y ~ " + "+".join(all_covariates)

lmfit2 = smf.ols(formula = my_formula, data = myData).fit()
lmfit2.summary()

coefs = lmfit2.params[1:]
intercept = lmfit2.params[0]
print(coefs)
print(intercept)

lcavol     0.564341
lweight    0.622020
age       -0.021248
lbph       0.096713
svi        0.761673
lcp       -0.106051
gleason    0.049228
pgg45      0.004458
dtype: float64
0.18156084546896145

Predicting responses for new observations can be accomplished through two methods.

print(intercept + Xnew @ coefs)
print(lmfit2.predict(Xnew))

0    2.714454
1    2.289491
dtype: float64
0    2.714454
1    2.289491
dtype: float64

`statsmodels.api` vs `statsmodels.formula.api`¶

In statsmodels.formula.api, like in R, the intercept is automaticlaly added to the model.
In statsmodels.api, we have to add a constant vector to X and Xnew.

import statsmodels.api as sm
X = myData.iloc[:, :-1]
X_const = sm.add_constant(X) # need to add the constant in the design matrix
newlmfit2 = sm.OLS(myData['Y'], X_const).fit()
#newlmfit2.summary()

Xnew_const = sm.add_constant(Xnew, has_constant='add')
print(newlmfit2.predict(Xnew_const))
print(Xnew_const @ newlmfit2.params)

0    2.714454
1    2.289491
dtype: float64
0    2.714454
1    2.289491
dtype: float64

3. Use `Numpy`¶

X = myData.iloc[:, :-1]
# Add a column of ones for the intercept
X_const = np.concatenate([np.ones((len(X), 1)), X], axis=1)

# Calculate coefficients using the normal equation
coefficients = np.linalg.inv(X_const.T @ X_const) @ X_const.T @ y

Xnew_const = np.concatenate([np.ones((len(Xnew), 1)), Xnew], axis=1)
print(Xnew_const @ coefficients)

[2.71445433 2.28949063]

Compute Regression Outputs¶

Next, let's check how residual standard error, log-likelihood, and R-square are computed using the output from the second approach lmfit2.

residuals = y - lmfit2.predict()
n, p = len(residuals), 8
print(n, p)

97 8

sigma_hat = np.sqrt(np.sum(residuals**2)/(n - p -1)) # Residual Standard Error
sigma_hat

0.6994999735015665

-(n/2)*np.log(2*np.pi*(np.sum(residuals**2)/n)) - n/2 # Log-Likelihood

-98.24760705805136

1 - np.sum(residuals**2)/(myData.loc[:, 'Y'].var()*(n-1))  # R-square

0.6633895654989244

Rank Deficiency¶

In cases where the design matrix, including the intercept, does not have full rank, as we have learned in class, the LS coefficient vector is not unique.

In the following example, we add a redudant column variable junk to myData, and then fit a linear model using statsmodels.formula.api and sklearn.linear_model. Python will return just one of many equivalent LS solutions along with a warning on design matrix being singular. You can still use the fitted model to do prediction. The prediction result is the same as before.

myData['junk'] = myData.iloc[:, 0] + myData.iloc[:, 1]
my_formula = my_formula + '+ junk'
tmpfit2 = smf.ols(formula = my_formula, data = myData).fit()
tmpfit2.summary()

Xnew['junk'] = Xnew.iloc[:, 0] + Xnew.iloc[:, 1]
tmpfit2.predict(Xnew)

0    2.714454
1    2.289491
dtype: float64

X = myData.drop(['Y'], axis=1)
y = myData['Y']
tmpfit1 = lm()
tmpfit1.fit(X, y)
print(tmpfit1.coef_)
print(tmpfit1.intercept_)

[ 0.16888759  0.2265661  -0.02124819  0.09671252  0.7616734  -0.10605094
  0.04922793  0.00445751  0.39545369]
0.18156084546894036

tmpfit1.predict(Xnew)

array([2.71445433, 2.28949063])

How does Python compute the LS coefficients when the design matrix does not have full rank? Check the code below.

tmp =  np.concatenate([np.ones((len(X), 1)), X], axis=1)
U, S, V = np.linalg.svd(tmp, full_matrices=False)
num_components = 9
F = U[:, :num_components] @ np.diag(S[:num_components])
coefficients = np.linalg.inv(F.T @ F) @ F.T @ y

# (intercept, other coefs)
V[:num_components, :].T @ coefficients

array([ 0.18156085,  0.16888759,  0.2265661 , -0.02124819,  0.09671252,
        0.7616734 , -0.10605094,  0.04922793,  0.00445751,  0.39545369])

	Unnamed: 0	lcavol	lweight	age	lbph	lcp	gleason	pgg45	lpsa	train
0	1	-0.579818	2.769459	50	-1.386294	-1.386294	6	0	-0.430783	T
1	2	-0.994252	3.319626	58	-1.386294	-1.386294	6	0	-0.162519	T
2	3	-0.510826	2.691243	74	-1.386294	-1.386294	7	20	-0.162519	T
3	4	-1.203973	3.282789	58	-1.386294	-1.386294	6	0	-0.162519	T
4	5	0.751416	3.432373	62	-1.386294	-1.386294	6	0	0.371564	T

	lcavol	lweight	age	lbph	lcp	gleason	pgg45	Y
0	-0.579818	2.769459	50	-1.386294	-1.386294	6	0	-0.430783
1	-0.994252	3.319626	58	-1.386294	-1.386294	6	0	-0.162519
2	-0.510826	2.691243	74	-1.386294	-1.386294	7	20	-0.162519
3	-1.203973	3.282789	58	-1.386294	-1.386294	6	0	-0.162519
4	0.751416	3.432373	62	-1.386294	-1.386294	6	0	0.371564

	lcavol	lweight	age	lbph	svi	lcp	gleason	pgg45	Y
count	97.000000	97.000000	97.000000	97.000000	97.000000	97.000000	97.000000	97.000000	97.000000
mean	1.350010	3.628943	63.865979	0.100356	0.216495	-0.179366	6.752577	24.381443	2.478387
std	1.178625	0.428411	7.445117	1.450807	0.413995	1.398250	0.722134	28.204035	1.154329
min	-1.347074	2.374906	41.000000	-1.386294	0.000000	-1.386294	6.000000	0.000000	-0.430783
25%	0.512824	3.375880	60.000000	-1.386294	0.000000	-1.386294	6.000000	0.000000	1.731656
50%	1.446919	3.623007	65.000000	0.300105	0.000000	-0.798508	7.000000	15.000000	2.591516
75%	2.127041	3.876396	68.000000	1.558145	0.000000	1.178655	7.000000	40.000000	3.056357
max	3.821004	4.780383	79.000000	2.326302	1.000000	2.904165	9.000000	100.000000	5.582932

Dep. Variable:	Y	R-squared:	0.663
Model:	OLS	Adj. R-squared:	0.633
Method:	Least Squares	F-statistic:	21.68
Date:	Fri, 18 Aug 2023	Prob (F-statistic):	7.65e-18
Time:	23:28:19	Log-Likelihood:	-98.248
No. Observations:	97	AIC:	214.5
Df Residuals:	88	BIC:	237.7
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.1816	1.321	0.137	0.891	-2.443	2.806
lcavol	0.5643	0.088	6.425	0.000	0.390	0.739
lweight	0.6220	0.201	3.096	0.003	0.223	1.021
age	-0.0212	0.011	-1.917	0.058	-0.043	0.001
lbph	0.0967	0.058	1.670	0.098	-0.018	0.212
svi	0.7617	0.241	3.158	0.002	0.282	1.241
lcp	-0.1061	0.090	-1.180	0.241	-0.285	0.073
gleason	0.0492	0.155	0.317	0.752	-0.259	0.358
pgg45	0.0045	0.004	1.021	0.310	-0.004	0.013

Omnibus:	0.563	Durbin-Watson:	1.540
Prob(Omnibus):	0.755	Jarque-Bera (JB):	0.183
Skew:	0.017	Prob(JB):	0.913
Kurtosis:	3.210	Cond. No.	1.32e+03

PSL: Linear Regression (I)¶

Load Data¶

Fit a Linear Model¶

1. Use sklearn.linear_model¶

2. Use statsmodels.formula.api¶