PSL: Variable Subset Selection¶

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression as lm
from sklearn.metrics import mean_squared_error, r2_score
import itertools

Load Data¶

The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.

lcavol: log cancer volume
lweight: log prostate weight
age: in years
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration
gleason: a numeric vector
pgg45: percent of Gleason score 4 or 5
lpsa: response

The first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the “training set” and which 30 as the test set, as described on page 48 in the ESL book. For convenience, we rename the response column with a generic name Y.

myData = pd.read_csv("https://hastie.su.domains/ElemStatLearn/datasets/prostate.data", sep='\t')
myData.head()

# Drop the 1st and 10th columns
myData = myData.drop(columns=[myData.columns[0], myData.columns[10]])
myData.head()

myData.shape

(97, 9)

X = myData.iloc[:, :-1]
Y = myData.iloc[:, -1]
X.shape, len(Y)

((97, 8), 97)

Level-wise Search¶

First, for each model size (i.e., the number of non-intercept features), use the function bestsubset below to select the best subset of variables that has the smallest RSS among all models with the same size. For this dataset with 8 non-intercept features, we will have 8 models.

def bestsubset(X, Y):
    RSS_list, numb_features, feature_list = [], [], []

    for m in range(1,len(X.columns) + 1):
        best_RSS = np.inf
        
        # Loop over all possible combinations: from n choose m (where n is the number of columns)
        for combo in itertools.combinations(X.columns, m):
            tmp_X = X[list(combo)]
            tmp_model = lm()
            tmp_model.fit(tmp_X, Y)
            tmp_RSS = mean_squared_error(Y, tmp_model.predict(tmp_X)) * len(Y)
            
            if tmp_RSS < best_RSS:
                best_RSS = tmp_RSS
                best_varset = combo
        
        RSS_list.append(best_RSS)                  
        feature_list.append(best_varset)          
        numb_features.append(len(best_varset))   
            
    return pd.DataFrame({'msize': numb_features,
                       'RSS': RSS_list, 
                       'features':feature_list})

rs = bestsubset(X, Y)
rs

Second, compute the model selection score (AIC/BIC) for those p models and find the one that achieves the the smallest score.

n = len(Y)
Aic = n*np.log(rs.RSS/n) + 2*rs.msize
Bic = n*np.log(rs.RSS/n) + rs.msize*np.log(n)

fig = plt.figure(figsize = (16,6))
ax = fig.add_subplot(1, 2, 1)

ax.scatter(rs.msize, Aic, alpha = 1, color = 'blue' )
ax.set_xlabel('No. of Features')
ax.set_ylabel('AIC')
ax.set_title('AIC - Best subset selection')


ax = fig.add_subplot(1, 2, 2)
ax.scatter(rs.msize, Bic, alpha = .5, color = 'red' )
ax.set_xlabel('No. of Features')
ax.set_ylabel('BIC')
ax.set_title('BIC - Best subset selection')

plt.show()

For this particular data set, AIC and BIC end up selecting different models. The model selected by AIC is larger than the one selected by BIC, which is common. (AIC favors larger models while BIC favors smaller models.) Although the model selected by BIC does not have the smallest AIC score, its AIC score is very close to the smallest one.

What are the 2nd and 3rd best models in terms of AIC/BIC?

This cannot be answered by looking at the AIC/BIC plots shown above. Instead, we need to modify the bestsubset function to return the top 3 models for each size.

def bestsubset_3(X, Y):
    RSS_dict, numb_features_dict, feature_list_dict = {}, {}, {}
    
    for m in range(1, len(X.columns) + 1):
        best_RSS = []
        
        # Loop over all possible combinations: from n choose m (where n is the number of columns)
        for combo in itertools.combinations(X.columns, m):
            tmp_X = X[list(combo)]
            tmp_model = lm()
            tmp_model.fit(tmp_X, Y)
            tmp_RSS = mean_squared_error(Y, tmp_model.predict(tmp_X)) * len(Y)
            
            # Keep track of the top 3 models
            best_RSS.append((tmp_RSS, combo))
            best_RSS = sorted(best_RSS)[:3]
            
        # Store the best RSS and corresponding features for each m
        RSS_dict[m] = [x[0] for x in best_RSS]
        feature_list_dict[m] = [x[1] for x in best_RSS]
        numb_features_dict[m] = [len(x[1]) for x in best_RSS]
        
    return pd.DataFrame({'msize': numb_features_dict,
                         'RSS': RSS_dict,
                         'features': feature_list_dict})

rs_3 = bestsubset_3(X, Y)
n = len(Y)


# Unpack the lists into individual data points
msize_flat = []
Aic_flat = []
Bic_flat = []
features_flat = []

for index, row in rs_3.iterrows():
    msize_list = row['msize']
    RSS_list = row['RSS']
    features_list = row['features']  
    
    for msize, RSS, feat in zip(msize_list, RSS_list, features_list):
        Aic_value = n * np.log(RSS / n) + 2 * msize
        Bic_value = n * np.log(RSS / n) + msize * np.log(n)
        
        msize_flat.append(msize)
        Aic_flat.append(Aic_value)
        Bic_flat.append(Bic_value)
        features_flat.append(feat)

fig = plt.figure(figsize=(12, 6))

# Plot for AIC
ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(msize_flat, Aic_flat, alpha=1, facecolors='none', edgecolors='black')
ax1.set_xlabel('No. of Features')
ax1.set_ylabel('AIC')
ax1.set_title('AIC - Best subset selection')


# Plot for BIC
ax2 = fig.add_subplot(1, 2, 2)
ax2.scatter(msize_flat, Bic_flat, alpha=1, facecolors='none', edgecolors='black')
ax2.set_xlabel('No. of Features')
ax2.set_ylabel('BIC')
ax2.set_title('BIC - Best subset selection')

plt.show()

msize_filtered = [m for i, m in enumerate(msize_flat) if m > 2]
Aic_filtered = [a for i, a in enumerate(Aic_flat) if msize_flat[i] > 2]
Bic_filtered = [b for i, b in enumerate(Bic_flat) if msize_flat[i] > 2]

fig = plt.figure(figsize=(12, 6))

# Plot for AIC
ax1 = fig.add_subplot(1, 2, 1)
ax1.scatter(msize_filtered, Aic_filtered, alpha=1, facecolors='none', edgecolors='black')
ax1.set_xlabel('No. of Features')
ax1.set_ylabel('AIC')
ax1.set_title('AIC - Best subset selection')
ax1.set_ylim(-67, -62)


# Plot for BIC
ax2 = fig.add_subplot(1, 2, 2)
ax2.scatter(msize_flat, Bic_flat, alpha=1, facecolors='none', edgecolors='black')
ax2.set_xlabel('No. of Features')
ax2.set_ylabel('BIC')
ax2.set_title('BIC - Best subset selection')
ax2.set_ylim(-60, -50)

plt.show()

Create a DataFrame to easily manage and sort the data. Then find the top three models by AIC or BIC.

df = pd.DataFrame({
    'msize': msize_flat,
    'Aic': Aic_flat,
    'Bic': Bic_flat,
    'features': features_flat
})

# Sort the DataFrame by Aic to get the top 3 features
top3_Aic = df.sort_values(by='Aic').head(3)
print(top3_Aic[['msize','features']].to_string(index=False))

 msize                           features
     5  (lcavol, lweight, age, lbph, svi)
     4       (lcavol, lweight, lbph, svi)
     3             (lcavol, lweight, svi)

top3_Bic = df.sort_values(by='Bic').head(3)
print(top3_Bic[['msize','features']].to_string(index=False))

 msize                      features
     3        (lcavol, lweight, svi)
     4  (lcavol, lweight, lbph, svi)
     4   (lcavol, lweight, age, svi)

Stepwise Selection¶

The function computeAIC computes AIC or BIC for a linear regression model with design matrix X and response vector Y; we can provide different "k" values to select between AIC (k=2, the default value) and BIC (k = np.log(n)).

The function stepAIC runs stepwise subset selection based on AIC or BIC. The algorithm starts with the variable set specied by features (default is the full model, i.e., all variables), then at each step, check whether dropping a variable in features or adding a new feature to features will end up improving AIC/BIC; if not (i.e., the current model is the one with the minimal AIC/BIC), the algorithm halts.

The stepAIC function can be easily modified for carrying out forward and backward variable selection.

Note that stepAIC is written with the argument X being a pandas dataframe and the argument features being a pandas index.

type(X), type(X.columns)

(pandas.core.frame.DataFrame, pandas.core.indexes.base.Index)

def computeAIC(X, Y, k=2):
    n = len(Y)
    model = lm()
    model.fit(X, Y)
    RSS = mean_squared_error(Y, model.predict(X)) * len(Y)
    return n*np.log(RSS/n) + k*X.shape[1]

def stepAIC(X, Y, features = X.columns, AIC = True):
    AIC_list, action_list, feature_list = [], [], []
    best_AIC = np.inf
    best_action = ' '
                             
    n = len(Y)
    if AIC:
        k = 2
    else:
        k = np.log(n)
    AIC = computeAIC(X[features], Y, k)
    
    while(AIC < best_AIC):                     
        AIC_list.append(AIC)
        feature_list.append(list(features))
        action_list.append(best_action)
        best_AIC = AIC                     
                             
        tmp_AIC_list, tmp_action_list, tmp_feature_list = [], [], []
        
        for p in features:
            tmp_features = features.drop(p)
            tmp_AIC = computeAIC(X[tmp_features], Y, k)
            tmp_AIC_list.append(tmp_AIC)
            tmp_feature_list.append(tmp_features)
            tmp_action_list.append('- ' + p)
        
        remaining_features = [p for p in X.columns if p not in features]
        for p in remaining_features:
            tmp_features = list(features) + [p]
            tmp_AIC = computeAIC(X[tmp_features], Y, k)
            tmp_AIC_list.append(tmp_AIC)
            tmp_feature_list.append(tmp_features) 
            tmp_action_list.append('+ ' + p)
        
        best_model = np.array(tmp_AIC_list).argmin()
        AIC = tmp_AIC_list[best_model]
        features = tmp_feature_list[best_model]
        best_action = tmp_action_list[best_model]
    
    return pd.DataFrame({'AIC': AIC_list,
                         'action': action_list,
                       'features': feature_list})

Stepwise AIC¶

myout = stepAIC(X, Y)

myout

The output from AIC stepwise can be summarized as below: start with the full model

At step 1, choose to remove "gleason"
At step 2, choose to remove "lcp"
At step 3, choose to remove "pgg45".
Then the algorithm halts.

Stepwise BIC¶

myout = stepAIC(X, Y, AIC=False)
myout

The output from BIC stepwise can be summarized as below: start with the full model

At step 1, choose to remove "gleason"
At step 2, choose to remove "lcp"
At step 3, choose to remove "pgg45"
At step 4, choose to remove "age"
At step 5, choose to remove "lbph"
Then the algorithm halts.

	Unnamed: 0	lcavol	lweight	age	lbph	lcp	gleason	pgg45	lpsa	train
0	1	-0.579818	2.769459	50	-1.386294	-1.386294	6	0	-0.430783	T
1	2	-0.994252	3.319626	58	-1.386294	-1.386294	6	0	-0.162519	T
2	3	-0.510826	2.691243	74	-1.386294	-1.386294	7	20	-0.162519	T
3	4	-1.203973	3.282789	58	-1.386294	-1.386294	6	0	-0.162519	T
4	5	0.751416	3.432373	62	-1.386294	-1.386294	6	0	0.371564	T

	lcavol	lweight	age	lbph	lcp	gleason	pgg45	lpsa
0	-0.579818	2.769459	50	-1.386294	-1.386294	6	0	-0.430783
1	-0.994252	3.319626	58	-1.386294	-1.386294	6	0	-0.162519
2	-0.510826	2.691243	74	-1.386294	-1.386294	7	20	-0.162519
3	-1.203973	3.282789	58	-1.386294	-1.386294	6	0	-0.162519
4	0.751416	3.432373	62	-1.386294	-1.386294	6	0	0.371564

	msize	RSS	features
0	1	58.914784	(lcavol,)
1	2	51.742176	(lcavol, lweight)
2	3	46.568436	(lcavol, lweight, svi)
3	4	45.595472	(lcavol, lweight, lbph, svi)
4	5	44.436682	(lcavol, lweight, age, lbph, svi)
5	6	43.775974	(lcavol, lweight, age, lbph, svi, pgg45)
6	7	43.107558	(lcavol, lweight, age, lbph, svi, lcp, pgg45)
7	8	43.058419	(lcavol, lweight, age, lbph, svi, lcp, gleason...

	AIC	action	features
0	-62.778861		[lcavol, lweight, age, lbph, svi, lcp, gleason...
1	-64.668226	- gleason	[lcavol, lweight, age, lbph, svi, lcp, pgg45]
2	-65.175708	- lcp	[lcavol, lweight, age, lbph, svi, pgg45]
3	-65.722631	- pgg45	[lcavol, lweight, age, lbph, svi]

	AIC	action	features
0	-42.181173		[lcavol, lweight, age, lbph, svi, lcp, gleason...
1	-46.645249	- gleason	[lcavol, lweight, age, lbph, svi, lcp, pgg45]
2	-49.727442	- lcp	[lcavol, lweight, age, lbph, svi, pgg45]
3	-52.849076	- pgg45	[lcavol, lweight, age, lbph, svi]
4	-54.926705	- age	[lcavol, lweight, lbph, svi]
5	-57.453303	- lbph	[lcavol, lweight, svi]