PSL: Linear Regression (I)

In this Notebook, we'll describe three different ways to fit a linear regression model in Python:

  • Scikit-Learn
  • NumPy
  • statsmodels
In [35]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Load Data

The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

  • lcavol: log cancer volume
  • lweight: log prostate weight
  • age: in years
  • lbph: log of the amount of benign prostatic hyperplasia
  • svi: seminal vesicle invasion
  • lcp: log of capsular penetration
  • gleason: a numeric vector
  • pgg45: percent of Gleason score 4 or 5
  • lpsa: response

The first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the "training set" and which 30 as the test set, as described on page 48 in the ESL book.

In [36]:
myData = pd.read_table("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")

# Preview the first 5 lines of the loaded data 
myData.head()
Out[36]:
Unnamed: 0 lcavol lweight age lbph svi lcp gleason pgg45 lpsa train
0 1 -0.579818 2.769459 50 -1.386294 0 -1.386294 6 0 -0.430783 T
1 2 -0.994252 3.319626 58 -1.386294 0 -1.386294 6 0 -0.162519 T
2 3 -0.510826 2.691243 74 -1.386294 0 -1.386294 7 20 -0.162519 T
3 4 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 0 -0.162519 T
4 5 0.751416 3.432373 62 -1.386294 0 -1.386294 6 0 0.371564 T
In [37]:
myData.shape
Out[37]:
(97, 11)

Remove the index column (col 0) and the train indicator column (col 10) and run a quick summary of each column of the data. For convenience, we rename the response column with a generic name Y.

In [38]:
myData.drop(myData.columns[0], axis=1, inplace=True)
myData.drop('train', axis=1, inplace=True)
myData.columns.values[8] = 'Y'
myData.head()
Out[38]:
lcavol lweight age lbph svi lcp gleason pgg45 Y
0 -0.579818 2.769459 50 -1.386294 0 -1.386294 6 0 -0.430783
1 -0.994252 3.319626 58 -1.386294 0 -1.386294 6 0 -0.162519
2 -0.510826 2.691243 74 -1.386294 0 -1.386294 7 20 -0.162519
3 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 0 -0.162519
4 0.751416 3.432373 62 -1.386294 0 -1.386294 6 0 0.371564

Check missing values

In [ ]:
myData.isnull().sum()

A quick summary of each column of myData.

In [39]:
myData.describe(include='all')
Out[39]:
lcavol lweight age lbph svi lcp gleason pgg45 Y
count 97.000000 97.000000 97.000000 97.000000 97.000000 97.000000 97.000000 97.000000 97.000000
mean 1.350010 3.628943 63.865979 0.100356 0.216495 -0.179366 6.752577 24.381443 2.478387
std 1.178625 0.428411 7.445117 1.450807 0.413995 1.398250 0.722134 28.204035 1.154329
min -1.347074 2.374906 41.000000 -1.386294 0.000000 -1.386294 6.000000 0.000000 -0.430783
25% 0.512824 3.375880 60.000000 -1.386294 0.000000 -1.386294 6.000000 0.000000 1.731656
50% 1.446919 3.623007 65.000000 0.300105 0.000000 -0.798508 7.000000 15.000000 2.591516
75% 2.127041 3.876396 68.000000 1.558145 0.000000 1.178655 7.000000 40.000000 3.056357
max 3.821004 4.780383 79.000000 2.326302 1.000000 2.904165 9.000000 100.000000 5.582932

Produce a pair-wise scatter plot. Caution: a big figure.

In [ ]:
sns.pairplot(myData)

Fit a Linear Model

Next we describe how to fit a linear regression model in Python and how to obtain the LS prediction on new samples.

Our new samples are two male patients with median features (i.e., all covariates except age take the median value of the 97 samples); one is 50 years old and the other is 70 years old.

In [40]:
Xnew = pd.concat([myData.median()] * 2, axis = 1)
Xnew = pd.DataFrame(Xnew).T
Xnew['age'] = [50, 70]
Xnew = Xnew.iloc[:, :-1]
Xnew
Out[40]:
lcavol lweight age lbph svi lcp gleason pgg45
0 1.446919 3.623007 50 0.300105 0.0 -0.798508 7.0 15.0
1 1.446919 3.623007 70 0.300105 0.0 -0.798508 7.0 15.0

1. Use sklearn.linear_model

In [41]:
from sklearn.linear_model import LinearRegression as lm
X = myData.iloc[:, :-1]
y = myData['Y']

lmfit1 = lm()
lmfit1.fit(X, y)
coefs = lmfit1.coef_
intercept = lmfit1.intercept_

print(coefs)
print(intercept)
[ 0.56434128  0.62201979 -0.02124819  0.09671252  0.7616734  -0.10605094
  0.04922793  0.00445751]
0.18156084546895723

Predicting responses for new observations can be accomplished through two methods:

  • Method 1: Begin by organizing the new observations into a matrix or vector. Then, apply the following formula: intercept + newX (times) coefs.

  • Method 2: Alternatively, structure the new observations as a data frame with identical column names as those in myData. Subsequently, utilize the predict function.

In [42]:
print(intercept + Xnew @ coefs) 
print(lmfit1.predict(Xnew))
0    2.714454
1    2.289491
dtype: float64
[2.71445433 2.28949063]

2. Use statsmodels.formula.api

You can use R-style formulas (such as Y ~ X1 + X2 ) to specific models.

In [43]:
import statsmodels.formula.api as smf
all_covariates = list(myData.columns)
all_covariates.remove('Y')
my_formula = "Y ~ " + "+".join(all_covariates)

lmfit2 = smf.ols(formula = my_formula, data = myData).fit()
lmfit2.summary()
Out[43]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.663
Model: OLS Adj. R-squared: 0.633
Method: Least Squares F-statistic: 21.68
Date: Fri, 18 Aug 2023 Prob (F-statistic): 7.65e-18
Time: 23:28:19 Log-Likelihood: -98.248
No. Observations: 97 AIC: 214.5
Df Residuals: 88 BIC: 237.7
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.1816 1.321 0.137 0.891 -2.443 2.806
lcavol 0.5643 0.088 6.425 0.000 0.390 0.739
lweight 0.6220 0.201 3.096 0.003 0.223 1.021
age -0.0212 0.011 -1.917 0.058 -0.043 0.001
lbph 0.0967 0.058 1.670 0.098 -0.018 0.212
svi 0.7617 0.241 3.158 0.002 0.282 1.241
lcp -0.1061 0.090 -1.180 0.241 -0.285 0.073
gleason 0.0492 0.155 0.317 0.752 -0.259 0.358
pgg45 0.0045 0.004 1.021 0.310 -0.004 0.013
Omnibus: 0.563 Durbin-Watson: 1.540
Prob(Omnibus): 0.755 Jarque-Bera (JB): 0.183
Skew: 0.017 Prob(JB): 0.913
Kurtosis: 3.210 Cond. No. 1.32e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [44]:
coefs = lmfit2.params[1:]
intercept = lmfit2.params[0]
print(coefs)
print(intercept)
lcavol     0.564341
lweight    0.622020
age       -0.021248
lbph       0.096713
svi        0.761673
lcp       -0.106051
gleason    0.049228
pgg45      0.004458
dtype: float64
0.18156084546896145

Predicting responses for new observations can be accomplished through two methods.

In [45]:
print(intercept + Xnew @ coefs)
print(lmfit2.predict(Xnew))
0    2.714454
1    2.289491
dtype: float64
0    2.714454
1    2.289491
dtype: float64

statsmodels.api vs statsmodels.formula.api

  • In statsmodels.formula.api, like in R, the intercept is automaticlaly added to the model.
  • In statsmodels.api, we have to add a constant vector to X and Xnew.
In [29]:
import statsmodels.api as sm
X = myData.iloc[:, :-1]
X_const = sm.add_constant(X) # need to add the constant in the design matrix
newlmfit2 = sm.OLS(myData['Y'], X_const).fit()
#newlmfit2.summary()
In [32]:
Xnew_const = sm.add_constant(Xnew, has_constant='add')
print(newlmfit2.predict(Xnew_const))
print(Xnew_const @ newlmfit2.params)
0    2.714454
1    2.289491
dtype: float64
0    2.714454
1    2.289491
dtype: float64

3. Use Numpy

In [60]:
X = myData.iloc[:, :-1]
# Add a column of ones for the intercept
X_const = np.concatenate([np.ones((len(X), 1)), X], axis=1)

# Calculate coefficients using the normal equation
coefficients = np.linalg.inv(X_const.T @ X_const) @ X_const.T @ y
In [64]:
Xnew_const = np.concatenate([np.ones((len(Xnew), 1)), Xnew], axis=1)
print(Xnew_const @ coefficients)
[2.71445433 2.28949063]

Compute Regression Outputs

Next, let's check how residual standard error, log-likelihood, and R-square are computed using the output from the second approach lmfit2.

In [66]:
residuals = y - lmfit2.predict()
n, p = len(residuals), 8
print(n, p)
97 8
In [67]:
sigma_hat = np.sqrt(np.sum(residuals**2)/(n - p -1)) # Residual Standard Error
sigma_hat
Out[67]:
0.6994999735015665
In [68]:
-(n/2)*np.log(2*np.pi*(np.sum(residuals**2)/n)) - n/2 # Log-Likelihood
Out[68]:
-98.24760705805136
In [69]:
1 - np.sum(residuals**2)/(myData.loc[:, 'Y'].var()*(n-1))  # R-square
Out[69]:
0.6633895654989244

Rank Deficiency

In cases where the design matrix, including the intercept, does not have full rank, as we have learned in class, the LS coefficient vector is not unique.

In the following example, we add a redudant column variable junk to myData, and then fit a linear model using statsmodels.formula.api and sklearn.linear_model. Python will return just one of many equivalent LS solutions along with a warning on design matrix being singular. You can still use the fitted model to do prediction. The prediction result is the same as before.

In [76]:
myData['junk'] = myData.iloc[:, 0] + myData.iloc[:, 1]
my_formula = my_formula + '+ junk'
tmpfit2 = smf.ols(formula = my_formula, data = myData).fit()
tmpfit2.summary()
Out[76]:
OLS Regression Results
Dep. Variable: Y R-squared: 0.663
Model: OLS Adj. R-squared: 0.633
Method: Least Squares F-statistic: 21.68
Date: Sat, 19 Aug 2023 Prob (F-statistic): 7.65e-18
Time: 00:04:42 Log-Likelihood: -98.248
No. Observations: 97 AIC: 214.5
Df Residuals: 88 BIC: 237.7
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.1816 1.321 0.137 0.891 -2.443 2.806
lcavol 0.1689 0.098 1.728 0.088 -0.025 0.363
lweight 0.2266 0.143 1.585 0.117 -0.058 0.511
age -0.0212 0.011 -1.917 0.058 -0.043 0.001
lbph 0.0967 0.058 1.670 0.098 -0.018 0.212
svi 0.7617 0.241 3.158 0.002 0.282 1.241
lcp -0.1061 0.090 -1.180 0.241 -0.285 0.073
gleason 0.0492 0.155 0.317 0.752 -0.259 0.358
pgg45 0.0045 0.004 1.021 0.310 -0.004 0.013
junk 0.3955 0.067 5.882 0.000 0.262 0.529
Omnibus: 0.563 Durbin-Watson: 1.540
Prob(Omnibus): 0.755 Jarque-Bera (JB): 0.183
Skew: 0.017 Prob(JB): 0.913
Kurtosis: 3.210 Cond. No. 1.89e+17


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.34e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [77]:
Xnew['junk'] = Xnew.iloc[:, 0] + Xnew.iloc[:, 1]
tmpfit2.predict(Xnew)
Out[77]:
0    2.714454
1    2.289491
dtype: float64
In [89]:
X = myData.drop(['Y'], axis=1)
y = myData['Y']
tmpfit1 = lm()
tmpfit1.fit(X, y)
print(tmpfit1.coef_)
print(tmpfit1.intercept_)
[ 0.16888759  0.2265661  -0.02124819  0.09671252  0.7616734  -0.10605094
  0.04922793  0.00445751  0.39545369]
0.18156084546894036
In [79]:
tmpfit1.predict(Xnew)
Out[79]:
array([2.71445433, 2.28949063])

How does Python compute the LS coefficients when the design matrix does not have full rank? Check the code below.

In [93]:
tmp =  np.concatenate([np.ones((len(X), 1)), X], axis=1)
U, S, V = np.linalg.svd(tmp, full_matrices=False)
num_components = 9
F = U[:, :num_components] @ np.diag(S[:num_components])
coefficients = np.linalg.inv(F.T @ F) @ F.T @ y

# (intercept, other coefs)
V[:num_components, :].T @ coefficients
Out[93]:
array([ 0.18156085,  0.16888759,  0.2265661 , -0.02124819,  0.09671252,
        0.7616734 , -0.10605094,  0.04922793,  0.00445751,  0.39545369])