mypackages = c("leaps", "glmnet")
tmp = setdiff(mypackages, rownames(installed.packages()))
if (length(tmp) > 0) install.packages(tmp)
library(leaps) # regsubsets
library(glmnet) # glmnet for lasso
This example is from simulation example 1 of Zhao and Yu (2006), which shows that LASSO could fail to pick the correct variable set when the so-called Irrepresentable Condition is violated.
Consider a simple linear regression model with three predictors \(X_1, X_2,\) and \(X_3\), and the response variable \(Y\) is modeled as \[ Y = 2 \cdot X_1 + 3 \cdot X_2 + N(0, 1),\] that is, the last predictor \(X_3\) is irrelevant. The three predictors are generated as follows: \(X_1 \sim N(0,1)\), \(X_2 \sim N(0, 1)\), and \[ X_3 = \frac{2}{3} X_1 + \frac{2}{3} X_2 + N \left ( 0, \frac{1}{3^2} \right ). \] Generate \(n=1000\) samples from this model.
# set random seed in case you want to reproduce the result
set.seed(542)
n = 1000
x1 = rnorm(n)
x1 = rnorm(n)
x2 = rnorm(n)
e = rnorm(n)
x3 = 2/3 * x1 + 2/3 * x2 + 1/3 * e
epsilon = rnorm(n)
beta = c(2, 3)
y = beta[1] * x1 + beta[2] * x2 + epsilon
myData = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3)
IC = regsubsets(y ~ ., myData, method = "exhaustive")
sumIC = summary(IC)
sumIC$bic
## [1] -1488.766 -2581.920 -2575.242
sumIC
## Subset selection object
## Call: regsubsets.formula(y ~ ., myData, method = "exhaustive")
## 3 Variables (and intercept)
## Forced in Forced out
## x1 FALSE FALSE
## x2 FALSE FALSE
## x3 FALSE FALSE
## 1 subsets of each size up to 3
## Selection Algorithm: exhaustive
## x1 x2 x3
## 1 ( 1 ) " " " " "*"
## 2 ( 1 ) "*" "*" " "
## 3 ( 1 ) "*" "*" "*"
msize = apply(sumIC$which, 1, sum)
AIC = n * log(sumIC$rss/n) + 2 * msize
BIC = n * log(sumIC$rss/n) + log(n) * msize
AIC; BIC
## 1 2 3
## 1113.79020 15.72812 17.49869
## 1 2 3
## 1123.60571 30.45139 37.12971
The lowest AIC/BIC is given by \(Y \sim X_1 + X_2\), that is, these two procedures choose the correct sub-model.
Next, we use LASSO with lambda.min and lambda.1se to check if it can select the correct model.
mylasso.cv = cv.glmnet(as.matrix(myData[, c('x1', 'x2', 'x3')]), as.vector(myData$y))
plot(mylasso.cv)
coef(mylasso.cv, s = 'lambda.1se')
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.05015789
## x1 1.60443649
## x2 2.69940979
## x3 0.36282687
coef(mylasso.cv, s = 'lambda.min')
## 4 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.04624025
## x1 1.87702903
## x2 2.97065701
## x3 0.09587310
Neither approach selects the correct model. The following code provides two path plots of LASSO: the x-coordinate is the L1-norm of the coefficients in the “norm” plot and the x-coorindate is log-lambda in the “lambda” plot. You can see that \(X_3\) won’t be dropped out of the model unless log-lambda is going to \(-\infty\) (or equivalently \(\lambda \to 0\)). Thus it is impossible for LASSO to select the correct model.
mylasso = glmnet(as.matrix(myData[, c('x1', 'x2', 'x3')]), as.vector(myData$y))
par(mfrow = c(2, 1))
plot(mylasso, label=TRUE, xvar = "norm")
plot(mylasso, label=TRUE, xvar = "lambda")