The Prostate data can be found [here]. The data is to examine the correlation between the level of prostate-specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy.
lcavol
: log cancer volumelweight
: log prostate weightage
: in yearslbph
: log of the amount of benign prostatic
hyperplasiasvi
: seminal vesicle invasionlcp
: log of capsular penetrationgleason
: a numeric vectorpgg45
: percent of Gleason score 4 or 5lpsa
: responseThe first 8 columns are predictors; column 9 is the outcome/response. We do not use column 10, which indicates which 67 observations were used as the “training set” and which 30 as the test set, as described on page 48 in the ESL book. For convenience, we rename the response column with a generic name Y.
myData = read.table(file = "https://hastie.su.domains/ElemStatLearn/datasets/prostate.data", header = TRUE)
names(myData)
myData = myData[, -10]
names(myData)[9] = 'Y'
n = dim(myData)[1] # sample size
p = dim(myData)[2] - 1 # number of non-intercept predictors
full.model = lm( Y ~ ., data = myData);
summary(full.model)
Call:
lm(formula = Y ~ ., data = myData)
Residuals:
Min 1Q Median 3Q Max
-1.76644 -0.35510 -0.00328 0.38087 1.55770
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.181561 1.320568 0.137 0.89096
lcavol 0.564341 0.087833 6.425 6.55e-09 ***
lweight 0.622020 0.200897 3.096 0.00263 **
age -0.021248 0.011084 -1.917 0.05848 .
lbph 0.096713 0.057913 1.670 0.09848 .
svi 0.761673 0.241176 3.158 0.00218 **
lcp -0.106051 0.089868 -1.180 0.24115
gleason 0.049228 0.155341 0.317 0.75207
pgg45 0.004458 0.004365 1.021 0.31000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6995 on 88 degrees of freedom
Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
First, for each size (from 1 to p), find the variable set (of that size) that produces the smallest RSS.
# install.packages("leaps")
library(leaps)
b = regsubsets(Y ~ ., data=myData, nvmax = p)
rs = summary(b)
rs$which
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Second, compute the model selection score (AIC/BIC) for those p models and find the one that achieves the the smallest score.
msize = 1:p;
par(mfrow=c(1,2))
Aic = n*log(rs$rss/n) + 2*msize;
Bic = n*log(rs$rss/n) + msize*log(n)
plot(msize, Aic, xlab="No. of Parameters", ylab = "AIC")
plot(msize, Bic, xlab="No. of Parameters", ylab = "BIC")
For this particular data set, AIC and BIC end up selecting different models. The model selected by AIC is larger than the one selected by BIC, which is common. (AIC favors larger models while BIC favors smaller models.) Although the model selected by BIC does not have the smallest AIC score, its AIC score is very close to the smallest one.
cbind(rs$which[which.min(Aic),], rs$which[which.min(Bic), ])
[,1] [,2]
(Intercept) TRUE TRUE
lcavol TRUE TRUE
lweight TRUE TRUE
age TRUE FALSE
lbph TRUE FALSE
svi TRUE TRUE
lcp FALSE FALSE
gleason FALSE FALSE
pgg45 FALSE FALSE
leaps
does not return AIC, but BIC. Its BIC differs from
what has been computed above, but the difference is a constant, so the
two BIC formulas (ours and the one used by leaps) are essentially the
same.
cbind(rs$bic, Bic, rs$bic - Bic)
What are the 2nd and 3rd best models in terms of AIC/BIC?
This cannot be answered by looking at the AIC/BIC plots shown above. Instead, we need to run the following code.
# ?regsubsets
b = regsubsets(Y ~ ., data=myData, nbest = 3, nvmax = p)
rs = summary(b)
rs$which
msize = apply(rs$which, 1, sum) - 1
par(mfrow=c(1,2))
Aic = n*log(rs$rss/n) + 2*msize
Bic = n*log(rs$rss/n) + msize*log(n)
plot(msize, Aic, xlab="No. of Parameters", ylab = "AIC")
plot(msize, Bic, xlab="No. of Parameters", ylab = "BIC")
plot(msize[msize > 2], Aic[msize > 2], ylim = c(-67, -62), xlab="No. of Parameters", ylab = "AIC")
plot(msize[msize > 2], Bic[msize > 2], ylim = c(-50, -45), xlab="No. of Parameters", ylab = "BIC")
# top three models by AIC
rs$which[order(Aic)[1:3],]
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
# top three models by BIC
rs$which[order(Bic)[1:3],]
(Intercept) lcavol lweight age lbph svi lcp gleason pgg45
3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
4 TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE
Set trace=0
if you do not want to see the intermediate
results
#stepAIC = step(full.model, trace=0, direction="both")
stepAIC = step(full.model, direction="both")
Start: AIC=-60.78
Y ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45
Df Sum of Sq RSS AIC
- gleason 1 0.0491 43.108 -62.668
- pgg45 1 0.5102 43.569 -61.636
- lcp 1 0.6814 43.740 -61.256
<none> 43.058 -60.779
- lbph 1 1.3646 44.423 -59.753
- age 1 1.7981 44.857 -58.810
- lweight 1 4.6907 47.749 -52.749
- svi 1 4.8803 47.939 -52.364
- lcavol 1 20.1994 63.258 -25.467
Step: AIC=-62.67
Y ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
Df Sum of Sq RSS AIC
- lcp 1 0.6684 43.776 -63.176
<none> 43.108 -62.668
- pgg45 1 1.1987 44.306 -62.008
- lbph 1 1.3844 44.492 -61.602
- age 1 1.7579 44.865 -60.791
+ gleason 1 0.0491 43.058 -60.779
- lweight 1 4.6429 47.751 -54.746
- svi 1 4.8333 47.941 -54.360
- lcavol 1 21.3191 64.427 -25.691
Step: AIC=-63.18
Y ~ lcavol + lweight + age + lbph + svi + pgg45
Df Sum of Sq RSS AIC
- pgg45 1 0.6607 44.437 -63.723
<none> 43.776 -63.176
+ lcp 1 0.6684 43.108 -62.668
- lbph 1 1.3329 45.109 -62.266
- age 1 1.4878 45.264 -61.934
+ gleason 1 0.0362 43.740 -61.256
- svi 1 4.1766 47.953 -56.336
- lweight 1 4.6553 48.431 -55.373
- lcavol 1 22.7555 66.531 -24.572
Step: AIC=-63.72
Y ~ lcavol + lweight + age + lbph + svi
Df Sum of Sq RSS AIC
<none> 44.437 -63.723
- age 1 1.1588 45.595 -63.226
+ pgg45 1 0.6607 43.776 -63.176
+ gleason 1 0.4767 43.960 -62.769
- lbph 1 1.5087 45.945 -62.484
+ lcp 1 0.1304 44.306 -62.008
- lweight 1 4.3140 48.751 -56.735
- svi 1 5.8509 50.288 -53.724
- lcavol 1 25.9427 70.379 -21.119
n = dim(myData)[1]
stepBIC = step(full.model, direction="both", k=log(n))
Start: AIC=-37.61
Y ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45
Df Sum of Sq RSS AIC
- gleason 1 0.0491 43.108 -42.071
- pgg45 1 0.5102 43.569 -41.039
- lcp 1 0.6814 43.740 -40.658
- lbph 1 1.3646 44.423 -39.155
- age 1 1.7981 44.857 -38.213
<none> 43.058 -37.606
- lweight 1 4.6907 47.749 -32.151
- svi 1 4.8803 47.939 -31.767
- lcavol 1 20.1994 63.258 -4.869
Step: AIC=-42.07
Y ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
Df Sum of Sq RSS AIC
- lcp 1 0.6684 43.776 -45.153
- pgg45 1 1.1987 44.306 -43.985
- lbph 1 1.3844 44.492 -43.579
- age 1 1.7579 44.865 -42.768
<none> 43.108 -42.071
+ gleason 1 0.0491 43.058 -37.606
- lweight 1 4.6429 47.751 -36.723
- svi 1 4.8333 47.941 -36.337
- lcavol 1 21.3191 64.427 -7.668
Step: AIC=-45.15
Y ~ lcavol + lweight + age + lbph + svi + pgg45
Df Sum of Sq RSS AIC
- pgg45 1 0.6607 44.437 -48.274
- lbph 1 1.3329 45.109 -46.818
- age 1 1.4878 45.264 -46.486
<none> 43.776 -45.153
+ lcp 1 0.6684 43.108 -42.071
- svi 1 4.1766 47.953 -40.888
+ gleason 1 0.0362 43.740 -40.658
- lweight 1 4.6553 48.431 -39.924
- lcavol 1 22.7555 66.531 -9.124
Step: AIC=-48.27
Y ~ lcavol + lweight + age + lbph + svi
Df Sum of Sq RSS AIC
- age 1 1.1588 45.595 -50.352
- lbph 1 1.5087 45.945 -49.610
<none> 44.437 -48.274
+ pgg45 1 0.6607 43.776 -45.153
+ gleason 1 0.4767 43.960 -44.746
+ lcp 1 0.1304 44.306 -43.985
- lweight 1 4.3140 48.751 -43.862
- svi 1 5.8509 50.288 -40.851
- lcavol 1 25.9427 70.379 -8.245
Step: AIC=-50.35
Y ~ lcavol + lweight + lbph + svi
Df Sum of Sq RSS AIC
- lbph 1 0.9730 46.568 -52.879
<none> 45.595 -50.352
+ age 1 1.1588 44.437 -48.274
- lweight 1 3.6907 49.286 -47.377
+ pgg45 1 0.3317 45.264 -46.486
+ gleason 1 0.2069 45.389 -46.218
+ lcp 1 0.1012 45.494 -45.993
- svi 1 5.7027 51.298 -43.496
- lcavol 1 24.9384 70.534 -12.607
Step: AIC=-52.88
Y ~ lcavol + lweight + svi
Df Sum of Sq RSS AIC
<none> 46.568 -52.879
+ lbph 1 0.9730 45.595 -50.352
+ age 1 0.6230 45.945 -49.610
+ pgg45 1 0.5007 46.068 -49.352
+ gleason 1 0.3445 46.224 -49.024
+ lcp 1 0.0694 46.499 -48.448
- svi 1 5.1737 51.742 -47.234
- lweight 1 7.1089 53.677 -43.673
- lcavol 1 24.7058 71.274 -16.169
How to retrieve the output from those step-wise procedures?
sel.var.AIC = attr(stepAIC$terms, "term.labels")
sel.var.BIC = attr(stepBIC$terms, "term.labels")
sel.var.AIC
length(sel.var.AIC)
length(sel.var.BIC)
c("age", "lcp") %in% sel.var.AIC
sel.var.BIC %in% sel.var.AIC