(PSL) GBM for Regression

In [21]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

Preliminaries

Before running a GBM model, decide on the following hyperparameters:

  • n_estimators: number of trees
  • learning_rate: shrinkage factor (default = 0.1)
  • subsample: fraction of the training data used for learning (default = 1.0)
  • max_depth: depth of individual trees (default = 3)

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

Case Study: Housing Data

We first split the Boston Housing dataset into a 70% training set and a 30% test set.

In [22]:
url = "https://liangfgithub.github.io/Data/HousingData.csv"
Housing = pd.read_csv(url)
Housing.head()
Out[22]:
Y chas lon lat crim zn indus nox rm age dis rad tax ptratio lstat
0 3.178054 0 -70.955 42.2550 -5.064036 1.8 0.837248 -0.619897 1.883275 3.432567 1.408545 0.000000 5.690359 0.454865 2.231591
1 3.072693 0 -70.950 42.2875 -3.600502 0.0 1.955860 -0.757153 1.859574 5.529585 1.602836 0.693147 5.488938 1.236450 3.023243
2 3.546740 0 -70.936 42.2830 -3.601235 0.0 1.955860 -0.757153 1.971996 2.918119 1.602836 0.693147 5.488938 1.236450 2.007486
3 3.508556 0 -70.928 42.2930 -3.430523 0.0 0.779325 -0.780886 1.945624 1.419592 1.802073 1.098612 5.402677 1.772241 1.714643
4 3.589059 0 -70.922 42.2980 -2.672924 0.0 0.779325 -0.780886 1.966693 2.162710 1.802073 1.098612 5.402677 1.772241 2.308679
In [23]:
random.seed(124)
X = Housing.drop("Y", axis=1)
y = Housing["Y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Fit a GBM

In [24]:
myfit1 = GradientBoostingRegressor(
    learning_rate=1.0, n_estimators=100, subsample=1)
myfit1.fit(X_train, y_train);

Performance Evaluation

We evaluate the model’s performance on the training and the test data against the number of trees.

In [25]:
# ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html
train_mse = np.zeros(100, dtype=np.float64)
test_mse = np.zeros(100, dtype=np.float64)

for i, y_pred in enumerate(myfit1.staged_predict(X_train)):
    train_mse[i] = np.mean((y_train - y_pred) ** 2.0)

for i, y_pred in enumerate(myfit1.staged_predict(X_test)):
    test_mse[i] = np.mean((y_test - y_pred) ** 2.0)

best_iter = np.argmin(test_mse)
plt.plot(
    (np.arange(train_mse.shape[0]) + 1),
    train_mse,
    "-",
)

plt.plot(
    (np.arange(test_mse.shape[0]) + 1),
    test_mse,
    "-",
)
plt.axvline(x=best_iter, color='k', ls=":")
plt.xlabel("Boosting Iterations")
plt.ylabel("MSE Loss")
Out[25]:
Text(0, 0.5, 'MSE Loss')

Fit Another GBM

We then construct another GBM model but with shrinkage. Because of this, we will need more trees and also set the subsampling rate to 50%.

In [26]:
myfit2 = GradientBoostingRegressor(
    learning_rate=0.02, n_estimators=1000, subsample=0.5)
myfit2.fit(X_train, y_train);
In [27]:
train_mse = np.zeros(1000, dtype=np.float64)
test_mse = np.zeros(1000, dtype=np.float64)

for i, y_pred in enumerate(myfit2.staged_predict(X_train)):
    train_mse[i] = np.mean((y_train - y_pred) ** 2.0)

for i, y_pred in enumerate(myfit2.staged_predict(X_test)):
    test_mse[i] = np.mean((y_test - y_pred) ** 2.0)

best_iter = np.argmin(test_mse)
plt.plot(
    (np.arange(train_mse.shape[0]) + 1),
    train_mse,
    "-",
)

plt.plot(
    (np.arange(test_mse.shape[0]) + 1),
    test_mse,
    "-",
)
plt.axvline(x=best_iter, color='k', ls=":")
plt.xlabel("Boosting Iterations")
plt.ylabel("MSE Loss")
Out[27]:
Text(0, 0.5, 'MSE Loss')

Variable Importance

GBM also provides measures for variable importance similar to Random Forest.

In [28]:
importances = myfit1.feature_importances_
gbm_importances = pd.Series(
    myfit1.feature_importances_, index=X.columns).sort_values()

fig, ax = plt.subplots()
gbm_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using Mean Decrease in Impurity")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
In [ ]: