Data
Brief Description of the Diabetes Data (Efron et al,
2004).
Ten baseline variables: age
, sex
,
body mass index
, average blood pressure
, and
six blood serum measurements were obtained for each of
n = 442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.
data = read.table("https://liangfgithub.github.io/Data/diabetes.data", header=TRUE)
dim(data)
head(data)
Lasso
The purpose of this analysis is to demonstrate the following:
- Model size along the CV path plot is not monotonic;
- Lasso does not explore all possible sub-models, i.e., some variable
sets are not listed on the path plot.
library(glmnet)
set.seed(1234)
X = data.matrix(data[, -11])
Y = data$Y
mylasso = glmnet(X, Y, alpha = 1)
plot(mylasso, xvar="lambda",label=TRUE)

Note 1
Check how the model sizes vary along the path.
mylasso$df
When lambda (i.e., the regularization) decreases, the model size
first increases from 0 to 10 and then decreases to 9 and then back to
10. So it is not true that the # of variables always
increases when lambda decreases.
You can see how the coefficient of “S3” changes from non-zero to
zero, and then back to non-zero, which corresponds to that white block
in the right part of the matrix (see the image of the matrix).
lasso.coef = as.matrix(coef(mylasso))
# round(lasso.coef[, -c(1:60)], dig=2)
image(coef(mylasso))

lasso.varset
is an 11-by-88 matrix denoting variable
sets over 88 different values of lambda (denoted by s0, s1, …, s87).
lasso.coef = as.matrix(coef(mylasso))
lasso.varset = ifelse(lasso.coef==0, 0, 1)
lasso.varset
From the output above, we obtain the following summary of models on
the path of Lasso:
- s0: intercept only (size = 0)
- s1: (BMI, S5) enters (size = 2)
- s8: BP enters (size = 3)
- s12: S3 enters (size = 4)
- s22: SEX enters (size = 5)
- s26: S6 enters (size = 6)
- s29: S1 enters (size = 7)
- s42: S4 enters (size = 8)
- s56: S2 enters (size = 9)
- s57: AGE enters (size = 10)
- s67: S3 leaves (size = 9)
- s73: S3 re-enters (size = 10)
S3 is an interesting variable; it seems to be related to other S
variables. On Lasso path, S3 enters to size 4 model (4 non-intercept
predictors), and stays for a long time and only leaves briefly leading
to a size 9 model without S3.
Note 2
Next, you’ll see that the optimal model selected by AIC is not on the
path plot.
library(leaps)
RSSleaps = regsubsets(X, Y, nbest=1, nvmax=10)
sumleaps = summary(RSSleaps, matrix=T)
sumleaps
Subset selection object
10 Variables (and intercept)
Forced in Forced out
AGE FALSE FALSE
SEX FALSE FALSE
BMI FALSE FALSE
BP FALSE FALSE
S1 FALSE FALSE
S2 FALSE FALSE
S3 FALSE FALSE
S4 FALSE FALSE
S5 FALSE FALSE
S6 FALSE FALSE
1 subsets of each size up to 10
Selection Algorithm: exhaustive
AGE SEX BMI BP S1 S2 S3 S4 S5 S6
1 ( 1 ) " " " " "*" " " " " " " " " " " " " " "
2 ( 1 ) " " " " "*" " " " " " " " " " " "*" " "
3 ( 1 ) " " " " "*" "*" " " " " " " " " "*" " "
4 ( 1 ) " " " " "*" "*" "*" " " " " " " "*" " "
5 ( 1 ) " " "*" "*" "*" " " " " "*" " " "*" " "
6 ( 1 ) " " "*" "*" "*" "*" "*" " " " " "*" " "
7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
If you look at optimal size (without intercept) 4, 6, 7, 8 models
from regsubsets
, none of them has S3. That means that if
AIC/BIC ends up picking a model with size 4, 6, 7, or 8, then the
AIC/BIC optimal model is not on the path of Lasso.
You can check which model returned by AIC/BIC. Note that
msize0
denotes the model size including the intercept and
msize = msize0 -1
denotes the model size without the
intercept.
n = length(Y)
msize0 = apply(sumleaps$which, 1, sum);
msize = msize0 - 1
BIC = sumleaps$bic; BIC = BIC - min(BIC); BIC = BIC/max(BIC);
AIC = n*log(sumleaps$rss/n) + 2*msize0;
AIC = AIC - min(AIC); AIC = AIC/max(AIC);
cbind(msize, AIC, BIC)
msize AIC BIC
1 1 1.000000000 1.000000000
2 2 0.322396838 0.230742097
3 3 0.199478409 0.123584530
4 4 0.132534927 0.083241551
5 5 0.029652507 0.000000000
6 6 0.000000000 0.004169454
7 7 0.005806613 0.050664745
8 8 0.013262202 0.099128337
9 9 0.027634213 0.155847729
10 10 0.043603930 0.214474226
plot(range(msize), c(0, 0.4), type="n",
xlab="Model Size (with Intercept)", ylab="Model Selection Criteria")
points(msize, AIC, col="blue", pch= 1)
points(msize, BIC, col="black", pch = 2)
legend("topright", lty=rep(1,3), col=c("blue", "black"), legend=c("AIC", "BIC"))

LS0tCnRpdGxlOiAiU3RhdDU0MjogVmFyaWFibGUgU2VsZWN0aW9uIGZvciBEaWFiZXRlcyBEYXRhIgpkYXRlOiAiRmFsbCAyMDIyIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRoZW1lOiByZWFkYWJsZQogICAgdG9jOiBUUlVFCiAgICB0b2NfZmxvYXQ6IFRSVUUKLS0tCgoKIyMgRGF0YQoKKipCcmllZiBEZXNjcmlwdGlvbiBvZiB0aGUgRGlhYmV0ZXMgRGF0YSoqIChFZnJvbiBldCBhbCwgMjAwNCkuIAoKVGVuIGJhc2VsaW5lIHZhcmlhYmxlczogYGFnZWAsIGBzZXhgLCBgYm9keSBtYXNzIGluZGV4YCwgYGF2ZXJhZ2UgYmxvb2QgcHJlc3N1cmVgLCBhbmQgKipzaXgqKiBibG9vZCBzZXJ1bSBtZWFzdXJlbWVudHMgd2VyZSBvYnRhaW5lZCBmb3IgZWFjaCBvZiBuID0gNDQyIGRpYWJldGVzIHBhdGllbnRzLCBhcyB3ZWxsIGFzIHRoZSByZXNwb25zZSBvZiBpbnRlcmVzdCwgYSBxdWFudGl0YXRpdmUgbWVhc3VyZSBvZiBkaXNlYXNlIHByb2dyZXNzaW9uIG9uZSB5ZWFyIGFmdGVyIGJhc2VsaW5lLgoKCmBgYHtyfQpkYXRhID0gcmVhZC50YWJsZSgiaHR0cHM6Ly9saWFuZ2ZnaXRodWIuZ2l0aHViLmlvL0RhdGEvZGlhYmV0ZXMuZGF0YSIsIGhlYWRlcj1UUlVFKQpkaW0oZGF0YSkKaGVhZChkYXRhKQpgYGAKCiMjIExhc3NvIAoKVGhlIHB1cnBvc2Ugb2YgdGhpcyBhbmFseXNpcyBpcyB0byBkZW1vbnN0cmF0ZSB0aGUgZm9sbG93aW5nOiAKCi0gTW9kZWwgc2l6ZSBhbG9uZyB0aGUgQ1YgcGF0aCBwbG90IGlzIG5vdCBtb25vdG9uaWM7Ci0gTGFzc28gZG9lcyBub3QgZXhwbG9yZSBhbGwgcG9zc2libGUgc3ViLW1vZGVscywgaS5lLiwgc29tZSB2YXJpYWJsZSBzZXRzIGFyZSBub3QgbGlzdGVkIG9uIHRoZSBwYXRoIHBsb3QuIAoKYGBge3J9CmxpYnJhcnkoZ2xtbmV0KQpzZXQuc2VlZCgxMjM0KQpYID0gZGF0YS5tYXRyaXgoZGF0YVssIC0xMV0pClkgPSBkYXRhJFkKYGBgCgpgYGB7cn0KbXlsYXNzbyA9IGdsbW5ldChYLCBZLCBhbHBoYSA9IDEpCnBsb3QobXlsYXNzbywgeHZhcj0ibGFtYmRhIixsYWJlbD1UUlVFKQpgYGAKCiMjIE5vdGUgMQoKQ2hlY2sgaG93IHRoZSBtb2RlbCBzaXplcyB2YXJ5IGFsb25nIHRoZSBwYXRoLgoKYGBge3J9Cm15bGFzc28kZGYgIApgYGAKCldoZW4gbGFtYmRhIChpLmUuLCB0aGUgcmVndWxhcml6YXRpb24pIGRlY3JlYXNlcywgdGhlIG1vZGVsIHNpemUgZmlyc3QgaW5jcmVhc2VzIGZyb20gMCB0byAxMCBhbmQgdGhlbiBkZWNyZWFzZXMgdG8gOSBhbmQgdGhlbiBiYWNrIHRvIDEwLiBTbyBpdCBpcyAqKm5vdCB0cnVlKiogdGhhdCB0aGUgIyBvZiB2YXJpYWJsZXMgYWx3YXlzIGluY3JlYXNlcyB3aGVuIGxhbWJkYSBkZWNyZWFzZXMuIAoKCgpZb3UgY2FuIHNlZSBob3cgdGhlIGNvZWZmaWNpZW50IG9mIOKAnFMz4oCdIGNoYW5nZXMgZnJvbSBub24temVybyB0byB6ZXJvLCBhbmQgdGhlbiBiYWNrIHRvIG5vbi16ZXJvLCB3aGljaCBjb3JyZXNwb25kcyB0byB0aGF0IHdoaXRlIGJsb2NrIGluIHRoZSByaWdodCBwYXJ0IG9mIHRoZSBtYXRyaXggKHNlZSB0aGUgaW1hZ2Ugb2YgdGhlIG1hdHJpeCkuCgpgYGB7cn0KbGFzc28uY29lZiA9IGFzLm1hdHJpeChjb2VmKG15bGFzc28pKQojIHJvdW5kKGxhc3NvLmNvZWZbLCAtYygxOjYwKV0sIGRpZz0yKQppbWFnZShjb2VmKG15bGFzc28pKQpgYGAKCmBsYXNzby52YXJzZXRgIGlzIGFuIDExLWJ5LTg4IG1hdHJpeCBkZW5vdGluZyB2YXJpYWJsZSBzZXRzIG92ZXIgODggZGlmZmVyZW50IHZhbHVlcyBvZiBsYW1iZGEgKGRlbm90ZWQgYnkgczAsIHMxLCDigKYsIHM4NykuIAoKYGBge3J9Cmxhc3NvLmNvZWYgPSBhcy5tYXRyaXgoY29lZihteWxhc3NvKSkKbGFzc28udmFyc2V0ID0gaWZlbHNlKGxhc3NvLmNvZWY9PTAsIDAsIDEpCmxhc3NvLnZhcnNldApgYGAKCkZyb20gdGhlIG91dHB1dCBhYm92ZSwgd2Ugb2J0YWluIHRoZSBmb2xsb3dpbmcgc3VtbWFyeSBvZiBtb2RlbHMgb24gdGhlIHBhdGggb2YgTGFzc286CgotIHMwOiBpbnRlcmNlcHQgb25seSAoc2l6ZSA9IDApCi0gczE6IChCTUksIFM1KSBlbnRlcnMgKHNpemUgPSAyKQotIHM4OiBCUCBlbnRlcnMgKHNpemUgPSAzKQotIHMxMjogUzMgZW50ZXJzIChzaXplID0gNCkKLSBzMjI6IFNFWCBlbnRlcnMgKHNpemUgPSA1KQotIHMyNjogUzYgZW50ZXJzIChzaXplID0gNikKLSBzMjk6IFMxIGVudGVycyAoc2l6ZSA9IDcpCi0gczQyOiBTNCBlbnRlcnMgKHNpemUgPSA4KQotIHM1NjogUzIgZW50ZXJzIChzaXplID0gOSkKLSBzNTc6IEFHRSBlbnRlcnMgKHNpemUgPSAxMCkKLSBzNjc6IFMzIGxlYXZlcyAoc2l6ZSA9IDkpCi0gczczOiBTMyByZS1lbnRlcnMgKHNpemUgPSAxMCkKClMzIGlzIGFuIGludGVyZXN0aW5nIHZhcmlhYmxlOyBpdCBzZWVtcyB0byBiZSByZWxhdGVkIHRvIG90aGVyIFMgdmFyaWFibGVzLiBPbiBMYXNzbyBwYXRoLCBTMyBlbnRlcnMgdG8gc2l6ZSA0IG1vZGVsICg0IG5vbi1pbnRlcmNlcHQgcHJlZGljdG9ycyksIGFuZCBzdGF5cyBmb3IgYSBsb25nIHRpbWUgYW5kIG9ubHkgbGVhdmVzIGJyaWVmbHkgbGVhZGluZyB0byBhIHNpemUgOSBtb2RlbCB3aXRob3V0IFMzLgoKIyMgTm90ZSAyCgpOZXh0LCB5b3UnbGwgc2VlIHRoYXQgdGhlIG9wdGltYWwgbW9kZWwgc2VsZWN0ZWQgYnkgQUlDIGlzIG5vdCBvbiB0aGUgcGF0aCBwbG90LiAKCmBgYHtyfQpsaWJyYXJ5KGxlYXBzKQpSU1NsZWFwcyA9IHJlZ3N1YnNldHMoWCwgWSwgbmJlc3Q9MSwgbnZtYXg9MTApCnN1bWxlYXBzID0gc3VtbWFyeShSU1NsZWFwcywgbWF0cml4PVQpCnN1bWxlYXBzCmBgYAoKCklmIHlvdSBsb29rIGF0IG9wdGltYWwgc2l6ZSAod2l0aG91dCBpbnRlcmNlcHQpIDQsIDYsIDcsIDggbW9kZWxzIGZyb20gYHJlZ3N1YnNldHNgLCBub25lIG9mIHRoZW0gaGFzIFMzLiBUaGF0IG1lYW5zIHRoYXQgaWYgQUlDL0JJQyBlbmRzIHVwIHBpY2tpbmcgYSBtb2RlbCB3aXRoIHNpemUgNCwgNiwgNywgb3IgOCwgdGhlbiB0aGUgQUlDL0JJQyBvcHRpbWFsIG1vZGVsIGlzIG5vdCBvbiB0aGUgcGF0aCBvZiBMYXNzby4gCgpZb3UgY2FuIGNoZWNrIHdoaWNoIG1vZGVsIHJldHVybmVkIGJ5IEFJQy9CSUMuIE5vdGUgdGhhdCBgbXNpemUwYCBkZW5vdGVzIHRoZSBtb2RlbCBzaXplIGluY2x1ZGluZyB0aGUgaW50ZXJjZXB0IGFuZCBgbXNpemUgPSBtc2l6ZTAgLTFgIGRlbm90ZXMgdGhlIG1vZGVsIHNpemUgd2l0aG91dCB0aGUgaW50ZXJjZXB0LiAKCmBgYHtyfQpuID0gbGVuZ3RoKFkpCm1zaXplMCA9IGFwcGx5KHN1bWxlYXBzJHdoaWNoLCAxLCBzdW0pOwptc2l6ZSA9IG1zaXplMCAtIDEKQklDID0gc3VtbGVhcHMkYmljOyBCSUMgPSBCSUMgLSBtaW4oQklDKTsgQklDID0gQklDL21heChCSUMpOwpBSUMgPSBuKmxvZyhzdW1sZWFwcyRyc3MvbikgKyAyKm1zaXplMDsKQUlDID0gQUlDIC0gbWluKEFJQyk7IEFJQyA9IEFJQy9tYXgoQUlDKTsKY2JpbmQobXNpemUsIEFJQywgQklDKQpgYGAKCmBgYHtyfQpwbG90KHJhbmdlKG1zaXplKSwgYygwLCAwLjQpLCB0eXBlPSJuIiwgCiAgICAgeGxhYj0iTW9kZWwgU2l6ZSAod2l0aCBJbnRlcmNlcHQpIiwgeWxhYj0iTW9kZWwgU2VsZWN0aW9uIENyaXRlcmlhIikKcG9pbnRzKG1zaXplLCBBSUMsIGNvbD0iYmx1ZSIsIHBjaD0gMSkKcG9pbnRzKG1zaXplLCBCSUMsIGNvbD0iYmxhY2siLCBwY2ggPSAyKQpsZWdlbmQoInRvcHJpZ2h0IiwgbHR5PXJlcCgxLDMpLCBjb2w9YygiYmx1ZSIsICJibGFjayIpLCBsZWdlbmQ9YygiQUlDIiwgIkJJQyIpKQpgYGAK