Stat542: Variable Selection

mypackages = c("leaps", "glmnet")   
tmp = setdiff(mypackages, rownames(installed.packages())) 
if (length(tmp) > 0) install.packages(tmp)

library(leaps)  # regsubsets
library(glmnet)  # glmnet for lasso

This example is from simulation example 1 of Zhao and Yu (2006), which shows that LASSO could fail to pick the correct variable set when the so-called Irrepresentable Condition is violated.

Consider a simple linear regression model with three predictors \(X_1, X_2,\) and \(X_3\), and the response variable \(Y\) is modeled as \[ Y = 2 \cdot X_1 + 3 \cdot X_2 + N(0, 1),\] that is, the last predictor \(X_3\) is irrelevant. The three predictors are generated as follows: \(X_1 \sim N(0,1)\), \(X_2 \sim N(0, 1)\), and \[ X_3 = \frac{2}{3} X_1 + \frac{2}{3} X_2 + N \left ( 0, \frac{1}{3^2} \right ). \] Generate \(n=1000\) samples from this model.

# set random seed in case you want to reproduce the result
set.seed(542)  
n = 1000
x1 = rnorm(n)
x1 = rnorm(n)
x2 = rnorm(n)
e = rnorm(n)
x3 = 2/3 * x1 + 2/3 * x2 + 1/3 * e
epsilon = rnorm(n)
beta = c(2, 3)
y = beta[1] * x1 + beta[2] * x2 + epsilon
myData = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)

Use AIC/BIC to select the best sub-model

IC = regsubsets(y ~ ., myData, method = "exhaustive")
sumIC = summary(IC)
sumIC$bic

## [1] -1488.766 -2581.920 -2575.242

sumIC

## Subset selection object
## Call: regsubsets.formula(y ~ ., myData, method = "exhaustive")
## 3 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
##          x1  x2  x3 
## 1  ( 1 ) " " " " "*"
## 2  ( 1 ) "*" "*" " "
## 3  ( 1 ) "*" "*" "*"

msize = apply(sumIC$which, 1, sum)
AIC = n * log(sumIC$rss/n) + 2 * msize
BIC = n * log(sumIC$rss/n) + log(n) * msize
AIC; BIC

##          1          2          3 
## 1113.79020   15.72812   17.49869

##          1          2          3 
## 1123.60571   30.45139   37.12971

The lowest AIC/BIC is given by \(Y \sim X_1 + X_2\), that is, these two procedures choose the correct sub-model.

Use LASSO

Next, we use LASSO with lambda.min and lambda.1se to check if it can select the correct model.

mylasso.cv = cv.glmnet(as.matrix(myData[, c('x1', 'x2', 'x3')]), as.vector(myData$y))
plot(mylasso.cv)

coef(mylasso.cv, s = 'lambda.1se')

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -0.05015789
## x1           1.60443649
## x2           2.69940979
## x3           0.36282687

coef(mylasso.cv, s = 'lambda.min')

## 4 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -0.04624025
## x1           1.87702903
## x2           2.97065701
## x3           0.09587310

Neither approach selects the correct model. The following code provides two path plots of LASSO: the x-coordinate is the L1-norm of the coefficients in the “norm” plot and the x-coorindate is log-lambda in the “lambda” plot. You can see that \(X_3\) won’t be dropped out of the model unless log-lambda is going to \(-\infty\) (or equivalently \(\lambda \to 0\)). Thus it is impossible for LASSO to select the correct model.

mylasso = glmnet(as.matrix(myData[, c('x1', 'x2', 'x3')]), as.vector(myData$y))
par(mfrow = c(2, 1))
plot(mylasso, label=TRUE, xvar = "norm")
plot(mylasso, label=TRUE, xvar = "lambda")

Compare prediction performance

Next we can compare the corresponding prediction error on a test set.

N = 1000
mytestData = matrix(0, N, 4)
colnames(mytestData) = c("x1", "x2", "x3", "y")
mytestData = as.data.frame(mytestData)

mytestData$x1 = rnorm(N)
mytestData$x2 = rnorm(N)
mytestData$x3 = 2/3 * x1 + 2/3 * x2 + 1/3 * rnorm(N)
mytestData$y = beta[1] * mytestData$x1 + 
  beta[2] * mytestData$x2 + rnorm(N)

For Lasso, you can form prediction using one of the following approaches:

Use the coefficients from lambda.1se;
Use the coefficients from lambda.min;
Refit a LS model using variables selected by lambda.1se;
Refit a LS model using variables selected by lambda.min.

In this example, lambda.1se and lambda.min select the same model, the full model.

There are two commands for Lasso prediction: one takes the output from cv.glmnet as input (where you can use lambda.min and lambda.1se), and the other one takes the output from glmnet as input.

The results below indicate that for this example, the test error from the Refit procedure is not bad even though the model is wrong.

?predict.cv.glmnet
?predict.glmnet

tmp = predict(mylasso.cv, s="lambda.min", 
              newx=data.matrix(mytestData)[, -4])
mean((mytestData$y - tmp)^2)

## [1] 1.065004

tmp = predict(mylasso.cv, s="lambda.1se", 
              newx=data.matrix(mytestData)[, -4])
mean((mytestData$y - tmp)^2)

## [1] 1.367122

myfit.full = lm(y ~ ., myData)
tmp = predict(myfit.full, newdata=mytestData)
mean((mytestData$y - tmp)^2)

## [1] 1.057659

myfit.AIC = lm(y ~ x1 + x2, myData)
tmp = predict(myfit.AIC, newdata=mytestData)
mean((mytestData$y - tmp)^2)

## [1] 1.060607

Stat542: Variable Selection – A Toy Example

Use AIC/BIC to select the best sub-model

Use LASSO

Compare prediction performance