Let’s discuss how to fit a Random Forest model in R. To begin, we
load the randomForest
package—note that the ‘F’ is
uppercase. We then load the housing dataset and split it into a training
set (70%) and a test set (30%).
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
= "https://liangfgithub.github.io/Data/HousingData.csv"
url = read.csv(url)
mydata = nrow(mydata)
n = round(n * 0.3)
ntest set.seed(134)
= sample(1:n, ntest) test.id
To fit the model, we use the randomForest
function,
employing syntax similar to standard predictive modeling functions in
R.
We aim to predict the variable Y using all other columns in the data
matrix as potential predictors. We set the importance
parameter to TRUE (which I’ll explain later) and specify that the forest
should consist of 400 trees.
names(mydata)
## [1] "Y" "chas" "lon" "lat" "crim" "zn" "indus"
## [8] "nox" "rm" "age" "dis" "rad" "tax" "ptratio"
## [15] "lstat"
= randomForest(Y ~ ., data = mydata[-test.id, ],
rfModel importance = T, ntree=400);
After fitting the model, various outputs are generated.
names(rfModel)
One key parameter is mtry
, which indicates the number of
variables selected for each split. In our example, mtry
is
set to 4, aligning with the recommended value for regression-based
Random Forest models: p/3, where p is the number of predictors.
## the default value for mtry is p/3 for regression
## p = ncol(mydata) - 1 = 14
## mtry = 14/3 = 4
$mtry rfModel
To generate predictions, we use the fitted model and provide the feature matrix (X) for the test set. We can then calculate the average test error, which, in our case, is approximately 3%.
= predict(rfModel, mydata[test.id, ])
yhat.test sum((mydata$Y[test.id] - yhat.test)^2)/length(test.id)
## [1] 0.03093106
It’s worth discussing the training error. Random Forest provides two types of training errors.
The first is obtained by calling the prediction function on the training data, resulting in an error rate, 0.45%, substantially lower than the test error. This is not surprising, as training errors are generally lower than test errors, especially for complex models.
= predict(rfModel, mydata[-test.id, ])
yhat.train sum((mydata$Y[-test.id] - yhat.train) ^2)/(n - ntest)
## [1] 0.004530652
However, Random Forest also provides a second type of training error—based on so-called out-of-bag samples. This error rate tends to be closer to the test error and serves as a more reliable estimator of the model’s generalization performance.
sum((mydata$Y[-test.id] - rfModel$predicted) ^2)/(n - ntest)
## [1] 0.02201958
What exactly are “Out-of-Bag” samples, and what is the meaning of ‘bag’ in this specific context?
The term “bag” simply refers to a collection or set where the bootstrap samples can be stored. Imagine your original training dataset as being stored in a box. A bootstrap sample is then formed by drawing, with replacement, from this box. These selected samples are placed into a metaphorical “bag.”
In the Random Forest procedure, each tree is built using one such bootstrap sample. Importantly, this bag is unlikely to contain every unique instance from the original training set. The data points not included in a specific bag are termed “Out-of-Bag” (OOB) samples for that tree. These samples serve as test points for evaluating that tree’s performance.
Here’s how it works: Let’s say your Random Forest consists of five trees. Assume that your first training instance, X_1, appears only in the bootstrap samples for the first three trees. When making predictions for X_1, only the last two trees will be used because X_1 is an OOB sample for these trees. Likewise, each data point will have its prediction made from an ensemble of trees for which it is an Out-of-Bag sample.
This prediction methodology makes the OOB error rate akin to a
cross-validation error, differing fundamentally from traditional
training error. In Random Forest’s output, there’s a component called
oob.times
that indicates, for each training instance, how
many times it was an OOB sample. For example, if the first entry in
oob.times
is 148, it means that the first training instance
was not used in building 148 out of the total 400 trees in the
forest.
$oob.times[1:5] rfModel
## [1] 148 152 134 150 141
# length(rfModel$oob)
To delve a bit deeper, the average number of times a sample becomes an OOB can be calculated. This figure is derived from the mathematical limit that arises when considering the probability of a single instance not being included in a bootstrap sample. The formula \[ (1 - \frac{1}{n} )^n\] approximates exp(-1) as n becomes large. In our case, multiplying this limit by the total number of trees (400) gives us an average Out-of-Bag count of 147, corroborating our observations.
## oob.times --> ntree * exp(-1) = ntree * 0.368
$ntree * exp(-1) rfModel
## [1] 147.1518
mean(rfModel$oob.times)
## [1] 146.5791
Next, let’s visually examine the performance of our Random Forest model using two plots.
The first is generated using the Random Forest’s built-in plotting function. The second is our custom reproduction of the first, created to ensure we fully understand what exactly is depicted in the default plot.
Both plots feature the number of trees on the x-axis and their corresponding mean square error (MSE) on the y-axis; each MSE value on the y-axis is calculated based on a forest with the specific number of trees marked on the x-axis. These MSE values are derived from OOB predictions, making them a reliable estimate more akin to cross-validation error than to simple training error.
The plots reveal that the model’s performance improves rapidly with the initial addition of trees but starts to plateau beyond a certain point. This suggests that adding more trees does not lead to overfitting, although one could potentially achieve similar performance with fewer trees, say, just the first 100.
par(mfrow=c(1, 2))
plot(rfModel)
= rfModel$mse
tmp plot(c(0, rfModel$ntree), range(tmp), type="n",
xlab = "Number of trees", ylab="Error")
lines(tmp)
In R, setting importance = TRUE
when fitting the Random
Forest model allows you to retrieve an importance table. This table
provides both importance measures for each predictor variable:
Both scaled and unscaled versions of these measures are available, along with their standard errors.
## default %IncMSE is normalized
$importance rfModel
## %IncMSE IncNodePurity
## chas 0.0007086229 0.2480448
## lon 0.0135125886 3.5880722
## lat 0.0042584017 1.0158025
## crim 0.0176796018 6.7019603
## zn 0.0009654224 0.2330893
## indus 0.0062395887 2.0610976
## nox 0.0140181455 3.1427440
## rm 0.0410363280 11.3062363
## age 0.0074238948 2.4463350
## dis 0.0086043495 3.2297331
## rad 0.0010435982 0.3092604
## tax 0.0092346131 3.4895404
## ptratio 0.0103262972 4.2626560
## lstat 0.0707457432 15.2265327
importance(rfModel, scale = F)
## %IncMSE IncNodePurity
## chas 0.0007086229 0.2480448
## lon 0.0135125886 3.5880722
## lat 0.0042584017 1.0158025
## crim 0.0176796018 6.7019603
## zn 0.0009654224 0.2330893
## indus 0.0062395887 2.0610976
## nox 0.0140181455 3.1427440
## rm 0.0410363280 11.3062363
## age 0.0074238948 2.4463350
## dis 0.0086043495 3.2297331
## rad 0.0010435982 0.3092604
## tax 0.0092346131 3.4895404
## ptratio 0.0103262972 4.2626560
## lstat 0.0707457432 15.2265327
#cbind(importance(rfModel, scale = TRUE),
# importance(rfModel, scale = F)[,1]/rfModel$importanceSD)
To visualize variable importance, you can use command
varImpPlot
that generate plots for each of the importance
measures. These plots help in understanding which variables are most
relevant, good, or mediocre in their predictive power.
par(mfrow = c(1,2))
varImpPlot(rfModel, type=1)
varImpPlot(rfModel, type=2)