Data

Brief Description of the Diabetes Data (Efron et al, 2004).

Ten baseline variables: age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

data = read.table("https://liangfgithub.github.io/Data/diabetes.data", header=TRUE)
dim(data)
head(data)

Lasso

The purpose of this analysis is to demonstrate the following:

  • Model size along the CV path plot is not monotonic;
  • Lasso does not explore all possible sub-models, i.e., some variable sets are not listed on the path plot.
library(glmnet)
set.seed(1234)
X = data.matrix(data[, -11])
Y = data$Y
mylasso = glmnet(X, Y, alpha = 1)
plot(mylasso, xvar="lambda",label=TRUE)

Note 1

Check how the model sizes vary along the path.

mylasso$df  

When lambda (i.e., the regularization) decreases, the model size first increases from 0 to 10 and then decreases to 9 and then back to 10. So it is not true that the # of variables always increases when lambda decreases.

You can see how the coefficient of “S3” changes from non-zero to zero, and then back to non-zero, which corresponds to that white block in the right part of the matrix (see the image of the matrix).

lasso.coef = as.matrix(coef(mylasso))
# round(lasso.coef[, -c(1:60)], dig=2)
image(coef(mylasso))

lasso.varset is an 11-by-88 matrix denoting variable sets over 88 different values of lambda (denoted by s0, s1, …, s87).

lasso.coef = as.matrix(coef(mylasso))
lasso.varset = ifelse(lasso.coef==0, 0, 1)
lasso.varset

From the output above, we obtain the following summary of models on the path of Lasso:

  • s0: intercept only (size = 0)
  • s1: (BMI, S5) enters (size = 2)
  • s8: BP enters (size = 3)
  • s12: S3 enters (size = 4)
  • s22: SEX enters (size = 5)
  • s26: S6 enters (size = 6)
  • s29: S1 enters (size = 7)
  • s42: S4 enters (size = 8)
  • s56: S2 enters (size = 9)
  • s57: AGE enters (size = 10)
  • s67: S3 leaves (size = 9)
  • s73: S3 re-enters (size = 10)

S3 is an interesting variable; it seems to be related to other S variables. On Lasso path, S3 enters to size 4 model (4 non-intercept predictors), and stays for a long time and only leaves briefly leading to a size 9 model without S3.

Note 2

Next, you’ll see that the optimal model selected by AIC is not on the path plot.

library(leaps)
RSSleaps = regsubsets(X, Y, nbest=1, nvmax=10)
sumleaps = summary(RSSleaps, matrix=T)
sumleaps
Subset selection object
10 Variables  (and intercept)
    Forced in Forced out
AGE     FALSE      FALSE
SEX     FALSE      FALSE
BMI     FALSE      FALSE
BP      FALSE      FALSE
S1      FALSE      FALSE
S2      FALSE      FALSE
S3      FALSE      FALSE
S4      FALSE      FALSE
S5      FALSE      FALSE
S6      FALSE      FALSE
1 subsets of each size up to 10
Selection Algorithm: exhaustive
          AGE SEX BMI BP  S1  S2  S3  S4  S5  S6 
1  ( 1 )  " " " " "*" " " " " " " " " " " " " " "
2  ( 1 )  " " " " "*" " " " " " " " " " " "*" " "
3  ( 1 )  " " " " "*" "*" " " " " " " " " "*" " "
4  ( 1 )  " " " " "*" "*" "*" " " " " " " "*" " "
5  ( 1 )  " " "*" "*" "*" " " " " "*" " " "*" " "
6  ( 1 )  " " "*" "*" "*" "*" "*" " " " " "*" " "
7  ( 1 )  " " "*" "*" "*" "*" "*" " " "*" "*" " "
8  ( 1 )  " " "*" "*" "*" "*" "*" " " "*" "*" "*"
9  ( 1 )  " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

If you look at optimal size (without intercept) 4, 6, 7, 8 models from regsubsets, none of them has S3. That means that if AIC/BIC ends up picking a model with size 4, 6, 7, or 8, then the AIC/BIC optimal model is not on the path of Lasso.

You can check which model returned by AIC/BIC. Note that msize0 denotes the model size including the intercept and msize = msize0 -1 denotes the model size without the intercept.

n = length(Y)
msize0 = apply(sumleaps$which, 1, sum);
msize = msize0 - 1
BIC = sumleaps$bic; BIC = BIC - min(BIC); BIC = BIC/max(BIC);
AIC = n*log(sumleaps$rss/n) + 2*msize0;
AIC = AIC - min(AIC); AIC = AIC/max(AIC);
cbind(msize, AIC, BIC)
   msize         AIC         BIC
1      1 1.000000000 1.000000000
2      2 0.322396838 0.230742097
3      3 0.199478409 0.123584530
4      4 0.132534927 0.083241551
5      5 0.029652507 0.000000000
6      6 0.000000000 0.004169454
7      7 0.005806613 0.050664745
8      8 0.013262202 0.099128337
9      9 0.027634213 0.155847729
10    10 0.043603930 0.214474226
plot(range(msize), c(0, 0.4), type="n", 
     xlab="Model Size (with Intercept)", ylab="Model Selection Criteria")
points(msize, AIC, col="blue", pch= 1)
points(msize, BIC, col="black", pch = 2)
legend("topright", lty=rep(1,3), col=c("blue", "black"), legend=c("AIC", "BIC"))

LS0tCnRpdGxlOiAiU3RhdDU0MjogVmFyaWFibGUgU2VsZWN0aW9uIGZvciBEaWFiZXRlcyBEYXRhIgpkYXRlOiAiRmFsbCAyMDIyIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRoZW1lOiByZWFkYWJsZQogICAgdG9jOiBUUlVFCiAgICB0b2NfZmxvYXQ6IFRSVUUKLS0tCgoKIyMgRGF0YQoKKipCcmllZiBEZXNjcmlwdGlvbiBvZiB0aGUgRGlhYmV0ZXMgRGF0YSoqIChFZnJvbiBldCBhbCwgMjAwNCkuIAoKVGVuIGJhc2VsaW5lIHZhcmlhYmxlczogYGFnZWAsIGBzZXhgLCBgYm9keSBtYXNzIGluZGV4YCwgYGF2ZXJhZ2UgYmxvb2QgcHJlc3N1cmVgLCBhbmQgKipzaXgqKiBibG9vZCBzZXJ1bSBtZWFzdXJlbWVudHMgd2VyZSBvYnRhaW5lZCBmb3IgZWFjaCBvZiBuID0gNDQyIGRpYWJldGVzIHBhdGllbnRzLCBhcyB3ZWxsIGFzIHRoZSByZXNwb25zZSBvZiBpbnRlcmVzdCwgYSBxdWFudGl0YXRpdmUgbWVhc3VyZSBvZiBkaXNlYXNlIHByb2dyZXNzaW9uIG9uZSB5ZWFyIGFmdGVyIGJhc2VsaW5lLgoKCmBgYHtyfQpkYXRhID0gcmVhZC50YWJsZSgiaHR0cHM6Ly9saWFuZ2ZnaXRodWIuZ2l0aHViLmlvL0RhdGEvZGlhYmV0ZXMuZGF0YSIsIGhlYWRlcj1UUlVFKQpkaW0oZGF0YSkKaGVhZChkYXRhKQpgYGAKCiMjIExhc3NvIAoKVGhlIHB1cnBvc2Ugb2YgdGhpcyBhbmFseXNpcyBpcyB0byBkZW1vbnN0cmF0ZSB0aGUgZm9sbG93aW5nOiAKCi0gTW9kZWwgc2l6ZSBhbG9uZyB0aGUgQ1YgcGF0aCBwbG90IGlzIG5vdCBtb25vdG9uaWM7Ci0gTGFzc28gZG9lcyBub3QgZXhwbG9yZSBhbGwgcG9zc2libGUgc3ViLW1vZGVscywgaS5lLiwgc29tZSB2YXJpYWJsZSBzZXRzIGFyZSBub3QgbGlzdGVkIG9uIHRoZSBwYXRoIHBsb3QuIAoKYGBge3J9CmxpYnJhcnkoZ2xtbmV0KQpzZXQuc2VlZCgxMjM0KQpYID0gZGF0YS5tYXRyaXgoZGF0YVssIC0xMV0pClkgPSBkYXRhJFkKYGBgCgpgYGB7cn0KbXlsYXNzbyA9IGdsbW5ldChYLCBZLCBhbHBoYSA9IDEpCnBsb3QobXlsYXNzbywgeHZhcj0ibGFtYmRhIixsYWJlbD1UUlVFKQpgYGAKCiMjIE5vdGUgMQoKQ2hlY2sgaG93IHRoZSBtb2RlbCBzaXplcyB2YXJ5IGFsb25nIHRoZSBwYXRoLgoKYGBge3J9Cm15bGFzc28kZGYgIApgYGAKCldoZW4gbGFtYmRhIChpLmUuLCB0aGUgcmVndWxhcml6YXRpb24pIGRlY3JlYXNlcywgdGhlIG1vZGVsIHNpemUgZmlyc3QgaW5jcmVhc2VzIGZyb20gMCB0byAxMCBhbmQgdGhlbiBkZWNyZWFzZXMgdG8gOSBhbmQgdGhlbiBiYWNrIHRvIDEwLiBTbyBpdCBpcyAqKm5vdCB0cnVlKiogdGhhdCB0aGUgIyBvZiB2YXJpYWJsZXMgYWx3YXlzIGluY3JlYXNlcyB3aGVuIGxhbWJkYSBkZWNyZWFzZXMuIAoKCgpZb3UgY2FuIHNlZSBob3cgdGhlIGNvZWZmaWNpZW50IG9mIOKAnFMz4oCdIGNoYW5nZXMgZnJvbSBub24temVybyB0byB6ZXJvLCBhbmQgdGhlbiBiYWNrIHRvIG5vbi16ZXJvLCB3aGljaCBjb3JyZXNwb25kcyB0byB0aGF0IHdoaXRlIGJsb2NrIGluIHRoZSByaWdodCBwYXJ0IG9mIHRoZSBtYXRyaXggKHNlZSB0aGUgaW1hZ2Ugb2YgdGhlIG1hdHJpeCkuCgpgYGB7cn0KbGFzc28uY29lZiA9IGFzLm1hdHJpeChjb2VmKG15bGFzc28pKQojIHJvdW5kKGxhc3NvLmNvZWZbLCAtYygxOjYwKV0sIGRpZz0yKQppbWFnZShjb2VmKG15bGFzc28pKQpgYGAKCmBsYXNzby52YXJzZXRgIGlzIGFuIDExLWJ5LTg4IG1hdHJpeCBkZW5vdGluZyB2YXJpYWJsZSBzZXRzIG92ZXIgODggZGlmZmVyZW50IHZhbHVlcyBvZiBsYW1iZGEgKGRlbm90ZWQgYnkgczAsIHMxLCDigKYsIHM4NykuIAoKYGBge3J9Cmxhc3NvLmNvZWYgPSBhcy5tYXRyaXgoY29lZihteWxhc3NvKSkKbGFzc28udmFyc2V0ID0gaWZlbHNlKGxhc3NvLmNvZWY9PTAsIDAsIDEpCmxhc3NvLnZhcnNldApgYGAKCkZyb20gdGhlIG91dHB1dCBhYm92ZSwgd2Ugb2J0YWluIHRoZSBmb2xsb3dpbmcgc3VtbWFyeSBvZiBtb2RlbHMgb24gdGhlIHBhdGggb2YgTGFzc286CgotIHMwOiBpbnRlcmNlcHQgb25seSAoc2l6ZSA9IDApCi0gczE6IChCTUksIFM1KSBlbnRlcnMgKHNpemUgPSAyKQotIHM4OiBCUCBlbnRlcnMgKHNpemUgPSAzKQotIHMxMjogUzMgZW50ZXJzIChzaXplID0gNCkKLSBzMjI6IFNFWCBlbnRlcnMgKHNpemUgPSA1KQotIHMyNjogUzYgZW50ZXJzIChzaXplID0gNikKLSBzMjk6IFMxIGVudGVycyAoc2l6ZSA9IDcpCi0gczQyOiBTNCBlbnRlcnMgKHNpemUgPSA4KQotIHM1NjogUzIgZW50ZXJzIChzaXplID0gOSkKLSBzNTc6IEFHRSBlbnRlcnMgKHNpemUgPSAxMCkKLSBzNjc6IFMzIGxlYXZlcyAoc2l6ZSA9IDkpCi0gczczOiBTMyByZS1lbnRlcnMgKHNpemUgPSAxMCkKClMzIGlzIGFuIGludGVyZXN0aW5nIHZhcmlhYmxlOyBpdCBzZWVtcyB0byBiZSByZWxhdGVkIHRvIG90aGVyIFMgdmFyaWFibGVzLiBPbiBMYXNzbyBwYXRoLCBTMyBlbnRlcnMgdG8gc2l6ZSA0IG1vZGVsICg0IG5vbi1pbnRlcmNlcHQgcHJlZGljdG9ycyksIGFuZCBzdGF5cyBmb3IgYSBsb25nIHRpbWUgYW5kIG9ubHkgbGVhdmVzIGJyaWVmbHkgbGVhZGluZyB0byBhIHNpemUgOSBtb2RlbCB3aXRob3V0IFMzLgoKIyMgTm90ZSAyCgpOZXh0LCB5b3UnbGwgc2VlIHRoYXQgdGhlIG9wdGltYWwgbW9kZWwgc2VsZWN0ZWQgYnkgQUlDIGlzIG5vdCBvbiB0aGUgcGF0aCBwbG90LiAKCmBgYHtyfQpsaWJyYXJ5KGxlYXBzKQpSU1NsZWFwcyA9IHJlZ3N1YnNldHMoWCwgWSwgbmJlc3Q9MSwgbnZtYXg9MTApCnN1bWxlYXBzID0gc3VtbWFyeShSU1NsZWFwcywgbWF0cml4PVQpCnN1bWxlYXBzCmBgYAoKCklmIHlvdSBsb29rIGF0IG9wdGltYWwgc2l6ZSAod2l0aG91dCBpbnRlcmNlcHQpIDQsIDYsIDcsIDggbW9kZWxzIGZyb20gYHJlZ3N1YnNldHNgLCBub25lIG9mIHRoZW0gaGFzIFMzLiBUaGF0IG1lYW5zIHRoYXQgaWYgQUlDL0JJQyBlbmRzIHVwIHBpY2tpbmcgYSBtb2RlbCB3aXRoIHNpemUgNCwgNiwgNywgb3IgOCwgdGhlbiB0aGUgQUlDL0JJQyBvcHRpbWFsIG1vZGVsIGlzIG5vdCBvbiB0aGUgcGF0aCBvZiBMYXNzby4gCgpZb3UgY2FuIGNoZWNrIHdoaWNoIG1vZGVsIHJldHVybmVkIGJ5IEFJQy9CSUMuIE5vdGUgdGhhdCBgbXNpemUwYCBkZW5vdGVzIHRoZSBtb2RlbCBzaXplIGluY2x1ZGluZyB0aGUgaW50ZXJjZXB0IGFuZCBgbXNpemUgPSBtc2l6ZTAgLTFgIGRlbm90ZXMgdGhlIG1vZGVsIHNpemUgd2l0aG91dCB0aGUgaW50ZXJjZXB0LiAKCmBgYHtyfQpuID0gbGVuZ3RoKFkpCm1zaXplMCA9IGFwcGx5KHN1bWxlYXBzJHdoaWNoLCAxLCBzdW0pOwptc2l6ZSA9IG1zaXplMCAtIDEKQklDID0gc3VtbGVhcHMkYmljOyBCSUMgPSBCSUMgLSBtaW4oQklDKTsgQklDID0gQklDL21heChCSUMpOwpBSUMgPSBuKmxvZyhzdW1sZWFwcyRyc3MvbikgKyAyKm1zaXplMDsKQUlDID0gQUlDIC0gbWluKEFJQyk7IEFJQyA9IEFJQy9tYXgoQUlDKTsKY2JpbmQobXNpemUsIEFJQywgQklDKQpgYGAKCmBgYHtyfQpwbG90KHJhbmdlKG1zaXplKSwgYygwLCAwLjQpLCB0eXBlPSJuIiwgCiAgICAgeGxhYj0iTW9kZWwgU2l6ZSAod2l0aCBJbnRlcmNlcHQpIiwgeWxhYj0iTW9kZWwgU2VsZWN0aW9uIENyaXRlcmlhIikKcG9pbnRzKG1zaXplLCBBSUMsIGNvbD0iYmx1ZSIsIHBjaD0gMSkKcG9pbnRzKG1zaXplLCBCSUMsIGNvbD0iYmxhY2siLCBwY2ggPSAyKQpsZWdlbmQoInRvcHJpZ2h0IiwgbHR5PXJlcCgxLDMpLCBjb2w9YygiYmx1ZSIsICJibGFjayIpLCBsZWdlbmQ9YygiQUlDIiwgIkJJQyIpKQpgYGAK