In this Notebook, we'll describe three different ways to fit a linear regression model in Python:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.
lcavol
: log cancer volumelweight
: log prostate weightage
: in yearslbph
: log of the amount of benign prostatic hyperplasiasvi
: seminal vesicle invasionlcp
: log of capsular penetrationgleason
: a numeric vectorpgg45
: percent of Gleason score 4 or 5lpsa
: responseThe first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the "training set" and which 30 as the test set, as described on page 48 in the ESL book.
myData = pd.read_table("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data")
# Preview the first 5 lines of the loaded data
myData.head()
myData.shape
Remove the index column (col 0) and the train indicator column (col 10) and run a quick summary of each column of the data. For convenience, we rename the response column with a generic name Y
.
myData.drop(myData.columns[0], axis=1, inplace=True)
myData.drop('train', axis=1, inplace=True)
myData.columns.values[8] = 'Y'
myData.head()
Check missing values
myData.isnull().sum()
A quick summary of each column of myData
.
myData.describe(include='all')
Produce a pair-wise scatter plot. Caution: a big figure.
sns.pairplot(myData)
Next we describe how to fit a linear regression model in Python and how to obtain the LS prediction on new samples.
Our new samples are two male patients with median features (i.e., all covariates except age
take the median value of the 97 samples); one is 50 years old and the other is 70 years old.
Xnew = pd.concat([myData.median()] * 2, axis = 1)
Xnew = pd.DataFrame(Xnew).T
Xnew['age'] = [50, 70]
Xnew = Xnew.iloc[:, :-1]
Xnew
sklearn.linear_model
¶from sklearn.linear_model import LinearRegression as lm
X = myData.iloc[:, :-1]
y = myData['Y']
lmfit1 = lm()
lmfit1.fit(X, y)
coefs = lmfit1.coef_
intercept = lmfit1.intercept_
print(coefs)
print(intercept)
Predicting responses for new observations can be accomplished through two methods:
Method 1: Begin by organizing the new observations into a matrix or vector. Then, apply the following formula: intercept + newX (times) coefs.
Method 2: Alternatively, structure the new observations as a data frame with identical column names as those in myData
. Subsequently, utilize the predict function.
print(intercept + Xnew @ coefs)
print(lmfit1.predict(Xnew))
statsmodels.formula.api
¶You can use R-style formulas (such as Y ~ X1 + X2
) to specific models.
import statsmodels.formula.api as smf
all_covariates = list(myData.columns)
all_covariates.remove('Y')
my_formula = "Y ~ " + "+".join(all_covariates)
lmfit2 = smf.ols(formula = my_formula, data = myData).fit()
lmfit2.summary()
coefs = lmfit2.params[1:]
intercept = lmfit2.params[0]
print(coefs)
print(intercept)
Predicting responses for new observations can be accomplished through two methods.
print(intercept + Xnew @ coefs)
print(lmfit2.predict(Xnew))
statsmodels.api
vs statsmodels.formula.api
¶statsmodels.formula.api
, like in R, the intercept is automaticlaly added to the model. statsmodels.api
, we have to add a constant vector to X
and Xnew
.import statsmodels.api as sm
X = myData.iloc[:, :-1]
X_const = sm.add_constant(X) # need to add the constant in the design matrix
newlmfit2 = sm.OLS(myData['Y'], X_const).fit()
#newlmfit2.summary()
Xnew_const = sm.add_constant(Xnew, has_constant='add')
print(newlmfit2.predict(Xnew_const))
print(Xnew_const @ newlmfit2.params)
Numpy
¶X = myData.iloc[:, :-1]
# Add a column of ones for the intercept
X_const = np.concatenate([np.ones((len(X), 1)), X], axis=1)
# Calculate coefficients using the normal equation
coefficients = np.linalg.inv(X_const.T @ X_const) @ X_const.T @ y
Xnew_const = np.concatenate([np.ones((len(Xnew), 1)), Xnew], axis=1)
print(Xnew_const @ coefficients)
Next, let's check how residual standard error, log-likelihood, and R-square are computed using the output from the second approach lmfit2
.
residuals = y - lmfit2.predict()
n, p = len(residuals), 8
print(n, p)
sigma_hat = np.sqrt(np.sum(residuals**2)/(n - p -1)) # Residual Standard Error
sigma_hat
-(n/2)*np.log(2*np.pi*(np.sum(residuals**2)/n)) - n/2 # Log-Likelihood
1 - np.sum(residuals**2)/(myData.loc[:, 'Y'].var()*(n-1)) # R-square
In cases where the design matrix, including the intercept, does not have full rank, as we have learned in class, the LS coefficient vector is not unique.
In the following example, we add a redudant column variable junk
to myData
, and then fit a linear model using statsmodels.formula.api
and sklearn.linear_model
. Python will return just one of many equivalent LS solutions along with a warning on design matrix being singular. You can still use the fitted model to do prediction. The prediction result is the same as before.
myData['junk'] = myData.iloc[:, 0] + myData.iloc[:, 1]
my_formula = my_formula + '+ junk'
tmpfit2 = smf.ols(formula = my_formula, data = myData).fit()
tmpfit2.summary()
Xnew['junk'] = Xnew.iloc[:, 0] + Xnew.iloc[:, 1]
tmpfit2.predict(Xnew)
X = myData.drop(['Y'], axis=1)
y = myData['Y']
tmpfit1 = lm()
tmpfit1.fit(X, y)
print(tmpfit1.coef_)
print(tmpfit1.intercept_)
tmpfit1.predict(Xnew)
How does Python compute the LS coefficients when the design matrix does not have full rank? Check the code below.
tmp = np.concatenate([np.ones((len(X), 1)), X], axis=1)
U, S, V = np.linalg.svd(tmp, full_matrices=False)
num_components = 9
F = U[:, :num_components] @ np.diag(S[:num_components])
coefficients = np.linalg.inv(F.T @ F) @ F.T @ y
# (intercept, other coefs)
V[:num_components, :].T @ coefficients