(PSL) Random Forest for Regression

In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

Load the housing dataset.

In [67]:
url = "https://liangfgithub.github.io/Data/HousingData.csv"
Housing = pd.read_csv(url)
Housing.head()
Out[67]:
Y chas lon lat crim zn indus nox rm age dis rad tax ptratio lstat
0 3.178054 0 -70.955 42.2550 -5.064036 1.8 0.837248 -0.619897 1.883275 3.432567 1.408545 0.000000 5.690359 0.454865 2.231591
1 3.072693 0 -70.950 42.2875 -3.600502 0.0 1.955860 -0.757153 1.859574 5.529585 1.602836 0.693147 5.488938 1.236450 3.023243
2 3.546740 0 -70.936 42.2830 -3.601235 0.0 1.955860 -0.757153 1.971996 2.918119 1.602836 0.693147 5.488938 1.236450 2.007486
3 3.508556 0 -70.928 42.2930 -3.430523 0.0 0.779325 -0.780886 1.945624 1.419592 1.802073 1.098612 5.402677 1.772241 1.714643
4 3.589059 0 -70.922 42.2980 -2.672924 0.0 0.779325 -0.780886 1.966693 2.162710 1.802073 1.098612 5.402677 1.772241 2.308679

Split the housing data into a training set (70%) and a test set (30%).

In [68]:
random.seed(124)
X = Housing.drop("Y", axis=1)
y = Housing["Y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Model Fitting

To fit the model, we use the RandomForest function from sklearn.

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

We aim to predict the variable Y using all other columns in the data matrix as potential predictors. We set the oob_score to TRUE (which I’ll explain later), max_features = 1.0/3, and specify that the forest should consist of 400 trees.

max_features, if float, indicates the fraction of features considered at each split. the number of variables selected for each split. The recommended value for regression-based Random Forest models is 1/3.

In [69]:
rfModel = RandomForestRegressor(n_estimators=400, oob_score=True, max_features = 1.0/3)
rfModel.fit(X_train, y_train);

Making Predictions

In [70]:
yhat = rfModel.predict(X_test)
np.mean((yhat - y_test) ** 2.0)
Out[70]:
0.03189071320016078

Understanding Training Errors

It’s worth discussing the training error. Random Forest provides two types of training errors.

The first is obtained by calling the prediction function on the training data, usually resulting in an error rate substantially lower than the test error. This is not surprising, as training errors are generally lower than test errors, especially for complex models.

In [71]:
yhat_train = rfModel.predict(X_train)
np.mean((yhat_train - y_train) ** 2.0)
Out[71]:
0.002657482754564075

However, Random Forest also provides a second type of training error—based on so-called out-of-bag samples, if setting oob_score to TRUE. This error rate tends to be closer to the test error and serves as a more reliable estimator of the model’s generalization performance.

In [72]:
yhat_train = rfModel.oob_prediction_
np.mean((yhat_train - y_train) ** 2.0)
Out[72]:
0.019421577663267978

Out-of-Bag Samples

What exactly are “Out-of-Bag” samples, and what is the meaning of ‘bag’ in this specific context?

The term “bag” simply refers to a collection or set where the bootstrap samples can be stored. Imagine your original training dataset as being stored in a box. A bootstrap sample is then formed by drawing, with replacement, from this box. These selected samples are placed into a metaphorical “bag.”

In the Random Forest procedure, each tree is built using one such bootstrap sample. Importantly, this bag is unlikely to contain every unique instance from the original training set. The data points not included in a specific bag are termed “Out-of-Bag” (OOB) samples for that tree. These samples serve as test points for evaluating that tree’s performance.

Here’s how it works: Let’s say your Random Forest consists of five trees. Assume that your first training instance, X_1, appears only in the bootstrap samples for the first three trees. When making predictions for X_1, only the last two trees will be used because X_1 is an OOB sample for these trees. Likewise, each data point will have its prediction made from an ensemble of trees for which it is an Out-of-Bag sample.

This prediction methodology makes the OOB error rate akin to a cross-validation error, differing fundamentally from traditional training error.

In R's Random Forest package, you'll find a useful component named 'oob.times'. This feature provides valuable insights by indicating the number of times each training instance was used as an Out-Of-Bag (OOB) sample. This can be helpful for understanding the generalization performance of the model on different subsets of your data.

However, scikit-learn's RandomForest implementation does not directly offer this 'oob.times' feature. While scikit-learn does allow you to calculate OOB error and OOB predictions (by setting oob_score=True or using the oob_prediction_ attribute, respectively) but doesn't provide the specific counts of how many times each training instance was used as an OOB sample.

Since the 'oob.times' information is not readily available in scikit-learn, I recommend students reviewing the corresponding section in our Rcode page, which could offer valuable insights.

Performance Plots

Next, let’s visually examine the performance of our Random Forest model.

The performance plot features the number of trees on the x-axis and their corresponding mean square error (MSE) on the y-axis; each MSE value on the y-axis is calculated based on a forest with the specific number of trees marked on the x-axis. These MSE values are derived from OOB predictions, making them a reliable estimate more akin to cross-validation error than to simple training error.

The plots reveal that the model’s performance improves rapidly with the initial addition of trees but starts to plateau beyond a certain point. This suggests that adding more trees does not lead to overfitting, although one could potentially achieve similar performance with fewer trees, say, just the first 100.

In [73]:
# ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_ensemble_oob.html
min_estimators = 20
max_estimators = 400

error_rate = []
clf = RandomForestRegressor(oob_score=True)
for i in range(min_estimators, max_estimators + 1):
    clf.set_params(n_estimators=i, warm_start=True)
    clf.fit(X_train, y_train)

    # Record the OOB error for each `n_estimators=i` setting.
    oob_error = np.mean((clf.oob_prediction_ - y_train) ** 2.0)
    error_rate.append((i, oob_error))
In [74]:
error_rate = np.array([*error_rate])

plt.plot(error_rate[:, 0], error_rate[:, 1])
plt.xlabel("Number of trees")
plt.ylabel("OOB error rate")
Out[74]:
Text(0, 0.5, 'OOB error rate')

Variable Importance

The output feature_importances_ measures importance based on RSS gain.

In [77]:
# ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = rfModel.feature_importances_
std = np.std(
    [tree.feature_importances_ for tree in rfModel.estimators_], axis=0)
forest_importances = pd.DataFrame({'Importance': importances}, index=X.columns)
forest_importances = forest_importances.sort_values(by=["Importance"])

fig, ax = plt.subplots()
forest_importances.plot.barh(y="Importance", ax=ax, legend=False)
ax.set_title("Feature importances using Mean Decrease in Impurity")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

For permutation based importance, see sklearn.inspection.permutation_importance

In [ ]: