Before running a GBM model, decide on the following hyperparameters:
loss Function: For regression tasks, ‘Gaussian’ (squared error) is commonly used
n.trees
: number of trees (default = 100)
shrinkage
: shrinkage factor or learning rate
(default = 0.1)
bag.fraction
: fraction of the training data used for
learning (default = 0.5)
cv.folds
: number of folds for cross-validation
(default = 0, i.e., no CV error returned)
interaction.depth
: depth of individual trees
(default 1)
We first split the Boston Housing dataset into a 70% training set and a 30% test set.
library(gbm)
## Loaded gbm 2.1.8.1
= "https://liangfgithub.github.io/Data/HousingData.csv"
url = read.csv(url)
mydata = nrow(mydata)
n = round(n * 0.3)
ntest set.seed(1234)
= sample(1:n, ntest) test.id
= gbm(Y ~ . , data = mydata[-test.id, ],
myfit1 distribution = "gaussian",
n.trees = 100,
shrinkage = 1,
interaction.depth = 3,
bag.fraction = 1,
cv.folds = 5)
myfit1;
## gbm(formula = Y ~ ., distribution = "gaussian", data = mydata[-test.id,
## ], n.trees = 100, interaction.depth = 3, shrinkage = 1, bag.fraction = 1,
## cv.folds = 5)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 20.
## There were 14 predictors of which 13 had non-zero influence.
Plot the CV error to find the optimal number of trees to prevent overfitting. In our case, the optimal stopping point is about trees.
= gbm.perf(myfit1) opt.size