Before running a GBM model, decide on the following hyperparameters:
loss Function: For regression tasks, ‘Gaussian’ (squared error) is commonly used
n.trees
: number of trees (default = 100)
shrinkage
: shrinkage factor or learning rate
(default = 0.1)
bag.fraction
: fraction of the training data used for
learning (default = 0.5)
cv.folds
: number of folds for cross-validation
(default = 0, i.e., no CV error returned)
interaction.depth
: depth of individual trees
(default 1)
We first split the Boston Housing dataset into a 70% training set and a 30% test set.
library(gbm)
## Loaded gbm 2.1.8.1
= "https://liangfgithub.github.io/Data/HousingData.csv"
url = read.csv(url)
mydata = nrow(mydata)
n = round(n * 0.3)
ntest set.seed(1234)
= sample(1:n, ntest) test.id
= gbm(Y ~ . , data = mydata[-test.id, ],
myfit1 distribution = "gaussian",
n.trees = 100,
shrinkage = 1,
interaction.depth = 3,
bag.fraction = 1,
cv.folds = 5)
myfit1;
## gbm(formula = Y ~ ., distribution = "gaussian", data = mydata[-test.id,
## ], n.trees = 100, interaction.depth = 3, shrinkage = 1, bag.fraction = 1,
## cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 20.
## There were 14 predictors of which 13 had non-zero influence.
Plot the CV error to find the optimal number of trees to prevent overfitting. In our case, the optimal stopping point is about trees.
= gbm.perf(myfit1) opt.size
We evaluate the model’s performance on the test data against the
number of trees. It turns out that the performance of the forest with
the optimal opt.size
is reasonably good even if it’s not
the smallest test error.
= 1:myfit1$n.trees
size = rep(0, length(size))
test.err for(i in 1:length(size)){
= predict(myfit1, mydata[test.id, ], n.trees = size[i])
y.pred = sum((mydata$Y[test.id] - y.pred)^2)
test.err[i]
} plot(test.err, type = "n")
lines(size, test.err, lwd = 2)
abline(v = opt.size, lwd = 2, lty = 2, col = "blue")
We then construct another GBM model but with shrinkage. Because of this, we will need more trees and also set the subsampling rate to 50%. In this scenario, the optimal stopping point was over 200 trees.
= gbm(Y ~ . , data = mydata[-test.id, ],
myfit2 distribution = "gaussian",
n.trees = 1000,
shrinkage = 0.1,
interaction.depth = 3,
bag.fraction = 0.5,
cv.folds = 5)
= gbm.perf(myfit2) opt.size
= 1:myfit2$n.trees
size = rep(0, length(size))
test.err for(i in 1:length(size)){
= predict(myfit2, mydata[test.id, ], n.trees = size[i])
y.pred = sum((mydata$Y[test.id] - y.pred)^2)
test.err[i]
} plot(test.err, type = "n")
lines(size, test.err, lwd = 2)
abline(v = opt.size, lwd = 2, lty = 2, col = "blue")
GBM in R provides measures for variable importance similar to Random
Forest. You can access these measures using the summary()
function.
par(mfrow=c(1, 2))
summary(myfit1, cBars = 10,
method = relative.influence,
las = 2)
## var rel.inf
## lstat lstat 55.48719511
## rm rm 23.49623645
## crim crim 9.92391857
## lon lon 2.61359901
## dis dis 1.97426669
## nox nox 1.67166826
## lat lat 1.57319028
## age age 1.08397273
## ptratio ptratio 0.97920710
## tax tax 0.57316429
## chas chas 0.28845125
## zn zn 0.14235518
## rad rad 0.12114019
## indus indus 0.07163489
summary(myfit1, cBars = 10,
method = permutation.test.gbm,
las = 2)
## var rel.inf
## 1 lstat 46.9921210
## 2 rm 18.1468970
## 3 crim 12.9336936
## 4 lon 5.4035500
## 5 dis 5.0083208
## 6 age 3.9477929
## 7 nox 2.4677146
## 8 lat 2.3663326
## 9 ptratio 0.8346349
## 10 tax 0.7929412
## 11 indus 0.4274400
## 12 chas 0.2313825
## 13 rad 0.2285759
## 14 zn 0.2186029