PSL: Ridge and Lasso

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as lm
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV
from sklearn.metrics import mean_squared_error

Load Data

The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

  • lcavol: log cancer volume
  • lweight: log prostate weight
  • age: in years
  • lbph: log of the amount of benign prostatic hyperplasia
  • svi: seminal vesicle invasion
  • lcp: log of capsular penetration
  • gleason: a numeric vector
  • pgg45: percent of Gleason score 4 or 5
  • lpsa: response

The first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the “training set” and which 30 as the test set, as described on page 48 in the ESL book. For convenience, we rename the response column with a generic name Y.

In [2]:
myData = pd.read_csv("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data", sep='\t')
myData = myData.drop(columns=[myData.columns[0], myData.columns[10]])
myData.head()
Out[2]:
lcavol lweight age lbph svi lcp gleason pgg45 lpsa
0 -0.579818 2.769459 50 -1.386294 0 -1.386294 6 0 -0.430783
1 -0.994252 3.319626 58 -1.386294 0 -1.386294 6 0 -0.162519
2 -0.510826 2.691243 74 -1.386294 0 -1.386294 7 20 -0.162519
3 -1.203973 3.282789 58 -1.386294 0 -1.386294 6 0 -0.162519
4 0.751416 3.432373 62 -1.386294 0 -1.386294 6 0 0.371564
In [25]:
myData.shape
Out[25]:
(97, 9)
In [3]:
X = myData.iloc[:, :-1]
Y = myData.iloc[:, -1]
X.shape, len(Y)
Out[3]:
((97, 8), 97)

Train/Test Split

Split the data into two parts: 80% for training and 20% for testing. Throughout, we will use the the training data to select variables and estimate coefficients and report the average MSE on the test data.

In [4]:
X_train, X_test , Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)

Full Model

Fit an ordinary linear regression model using all variables.

In [5]:
full_model = lm()
full_model.fit(X_train, Y_train)
# averaged MSE on the test set
mean_squared_error(Y_test, full_model.predict(X_test))
Out[5]:
0.40319945539779695

Ridge Regression

Plot the ridge coefficients against the range of alpha values in Python. It's important to note that the alpha parameter in Python's Ridge regression is equivalent to lambda in R's Ridge regression. The term alpha in R denotes something different.

In [6]:
alphas = np.logspace(-3, 3, 100)
ridge = Ridge(normalize = True)
coefs = []

for a in alphas:
    ridge.set_params(alpha = a)
    ridge.fit(X_train, Y_train)
    coefs.append(ridge.coef_)
    
np.shape(coefs)
Out[6]:
(100, 8)
In [7]:
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('Ridge coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

Tune alpha by 10-fold CV with MSE

In [8]:
ridgecv = RidgeCV(alphas = alphas, cv = 10, 
                  scoring = 'neg_mean_squared_error', 
                  normalize = True)
ridgecv.fit(X_train, Y_train)
ridgecv.alpha_
Out[8]:
0.26560877829466867

Evaluate prediction performance

In [9]:
ridge_model = Ridge(alpha = ridgecv.alpha_, normalize = True)
ridge_model.fit(X_train, Y_train)
mean_squared_error(Y_test, ridge_model.predict(X_test))
Out[9]:
0.5057873761573645

Note that none of the ridge regressions is exactly zero.

In [10]:
pd.Series(ridge_model.coef_, index = X.columns)
Out[10]:
lcavol     0.328243
lweight    0.491542
age       -0.005151
lbph       0.068130
svi        0.695180
lcp        0.061868
gleason    0.164950
pgg45      0.002653
dtype: float64

CV Path Plot

Generate a cross-validation path plot similar to the one from R's glmnet. In order to generate this plot, we need to manually implement cross-validation for Ridge regression.

In [11]:
from sklearn.model_selection import cross_val_score
In [12]:
mean_mse = []
std_mse = []

# Loop through each alpha and perform cross-validation
for alpha in alphas:
    ridge = Ridge(alpha=alpha, normalize = True)
    mse_scores = -cross_val_score(ridge, X_train, Y_train, cv=10, scoring='neg_mean_squared_error')  # 10-fold CV
    mean_mse.append(np.mean(mse_scores))
    std_mse.append(np.std(mse_scores) / np.sqrt(10))
In [13]:
# Identify best alpha and one standard error rule alpha
min_idx = np.argmin(mean_mse)
alpha_min = alphas[min_idx]

# Find the threshold within one standard error of the best alpha
threshold = mean_mse[min_idx] + std_mse[min_idx]

# Find the largest alpha within one standard error of the best alpha
one_se_rule_idx = np.where(mean_mse <= threshold)[0][-1]
alpha_1se = alphas[one_se_rule_idx]

alpha_min, alpha_1se
Out[13]:
(0.26560877829466867, 1.629750834620645)
In [14]:
mean_mse = np.array(mean_mse)
std_mse = np.array(std_mse)

plt.figure(figsize=(10, 6))
plt.semilogx(alphas, mean_mse, label='Mean MSE', linewidth=2)
plt.fill_between(alphas, mean_mse - std_mse, mean_mse + std_mse, alpha=0.2)
plt.axvline(alpha_min, linestyle='--', color='r', label=f'Best alpha: {alpha_min:.2e}')
plt.axvline(alpha_1se, linestyle='--', color='g', label=f'alpha (1 SE): {alpha_1se:.2e}')
plt.xlabel('Alpha (Log Scale)')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.title('Ridge Cross-Validation Path')
plt.show()

Lasso Regression

In [15]:
alphas = np.logspace(-4, -1, 100)
lasso = Lasso(normalize = True)
coefs = []

for a in alphas:
    lasso.set_params(alpha = a)
    lasso.fit(X_train, Y_train)
    coefs.append(lasso.coef_)
    
np.shape(coefs)
Out[15]:
(100, 8)
In [16]:
ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale('log')
plt.xlabel('Alpha')
plt.ylabel('Coefficients')
plt.title('Lasso coefficients as a function of the regularization')
plt.axis('tight')
plt.show()

Tune alpha by 10-fold CV with MSE

In [17]:
lassocv = LassoCV(alphas = alphas, cv = 10, 
                  normalize = True)
lassocv.fit(X_train, Y_train)
lassocv.alpha_
Out[17]:
0.00572236765935022

Evaluate prediction performance

In [18]:
lasso_model = Lasso(alpha = lassocv.alpha_, normalize = True)
lasso_model.fit(X_train, Y_train)
mean_squared_error(Y_test, lasso_model.predict(X_test))
Out[18]:
0.43426555165941305

Note that some lasso regressions are exactly zero.

In [19]:
pd.Series(lasso_model.coef_, index = X.columns)
Out[19]:
lcavol     0.428929
lweight    0.469259
age       -0.000000
lbph       0.052529
svi        0.775122
lcp        0.000000
gleason    0.133524
pgg45      0.001010
dtype: float64

Variables that are not selected.

In [21]:
X.columns[np.where(lasso_model.coef_ == 0)]
Out[21]:
Index(['age', 'lcp'], dtype='object')

In some runs, the optimal alpha value is at the boundary, for example, alphas[0]. Then we should expand the range of alphas and re-run LassoCV.

CV Path Plot

Generate a cross-validation path plot similar to the one from R's glmnet.

In [22]:
lassocv = LassoCV(alphas=alphas, cv=10, normalize = True)
lassocv.fit(X_train, Y_train)

lassocv.alpha_
Out[22]:
0.00572236765935022
In [23]:
# Extract mean and standard
mean_mse = np.mean(lassocv.mse_path_, axis=1)
std_mse = np.std(lassocv.mse_path_, axis=1) / np.sqrt(10) 
In [28]:
# Extract alphas from CV output 
# Note: they may not be in the same order as the alpha sequence your provide to lassoCV
cv_alphas = lassocv.alphas_
# Identify best alpha and one standard error rule alpha
min_idx = np.argmin(mean_mse)
alpha_min = cv_alphas[min_idx]

# Find the threshold within one standard error of the best alpha
threshold = mean_mse[min_idx] + std_mse[min_idx]

# Find the largest alpha within one standard error of the best alpha
alpha_1se = max(cv_alphas[np.where(mean_mse <= threshold)])

alpha_min, alpha_1se  #alpha_min = lassocv.alpha_
Out[28]:
(0.00572236765935022, 0.02848035868435802)
In [29]:
mean_mse = np.array(mean_mse)
std_mse = np.array(std_mse)

plt.figure(figsize=(10, 6))
plt.semilogx(cv_alphas, mean_mse, label='Mean MSE', linewidth=2)
plt.fill_between(cv_alphas, mean_mse - std_mse, mean_mse + std_mse, alpha=0.2)
plt.axvline(alpha_min, linestyle='--', color='r', label=f'Best alpha: {alpha_min:.2e}')
plt.axvline(alpha_1se, linestyle='--', color='g', label=f'alpha (1 SE): {alpha_1se:.2e}')
plt.xlabel('Alpha (Log Scale)')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.title('Lasso Cross-Validation Path')
plt.show()

Evaluate prediction performance with alpha_1se

In [31]:
lasso_1se_model = Lasso(alpha = alpha_1se, normalize = True)
lasso_1se_model.fit(X_train, Y_train)
mean_squared_error(Y_test, lasso_1se_model.predict(X_test))
Out[31]:
0.44811214582804604

Refit to Reduce Bias

We refit an ordinary linear regression model using the variables selected by alpha_1se and then use it for prediction.

In [32]:
lasso_1se_model.coef_
Out[32]:
array([0.39323023, 0.1665207 , 0.        , 0.        , 0.49792202,
       0.        , 0.        , 0.        ])
In [41]:
nonzero_indices = np.where(lasso_1se_model.coef_ != 0)[0]
lm_refit = lm()
lm_refit.fit(X_train.iloc[:, nonzero_indices], Y_train)
mean_squared_error(Y_test, lm_refit.predict(X_test.iloc[:, nonzero_indices]))
Out[41]:
0.3811396432678013
In [ ]: