(PSL) GBM for Regression¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

Preliminaries¶

Before running a GBM model, decide on the following hyperparameters:

n_estimators: number of trees
learning_rate: shrinkage factor (default = 0.1)
subsample: fraction of the training data used for learning (default = 1.0)
max_depth: depth of individual trees (default = 3)

Ref: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

Case Study: Housing Data¶

We first split the Boston Housing dataset into a 70% training set and a 30% test set.

url = "https://liangfgithub.github.io/Data/HousingData.csv"
Housing = pd.read_csv(url)
Housing.head()

random.seed(124)
X = Housing.drop("Y", axis=1)
y = Housing["Y"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Fit a GBM¶

myfit1 = GradientBoostingRegressor(
    learning_rate=1.0, n_estimators=100, subsample=1)
myfit1.fit(X_train, y_train);

Performance Evaluation¶

We evaluate the model’s performance on the training and the test data against the number of trees.

# ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html
train_mse = np.zeros(100, dtype=np.float64)
test_mse = np.zeros(100, dtype=np.float64)

for i, y_pred in enumerate(myfit1.staged_predict(X_train)):
    train_mse[i] = np.mean((y_train - y_pred) ** 2.0)

for i, y_pred in enumerate(myfit1.staged_predict(X_test)):
    test_mse[i] = np.mean((y_test - y_pred) ** 2.0)

best_iter = np.argmin(test_mse)
plt.plot(
    (np.arange(train_mse.shape[0]) + 1),
    train_mse,
    "-",
)

plt.plot(
    (np.arange(test_mse.shape[0]) + 1),
    test_mse,
    "-",
)
plt.axvline(x=best_iter, color='k', ls=":")
plt.xlabel("Boosting Iterations")
plt.ylabel("MSE Loss")

Text(0, 0.5, 'MSE Loss')

Fit Another GBM¶

We then construct another GBM model but with shrinkage. Because of this, we will need more trees and also set the subsampling rate to 50%.

myfit2 = GradientBoostingRegressor(
    learning_rate=0.02, n_estimators=1000, subsample=0.5)
myfit2.fit(X_train, y_train);

train_mse = np.zeros(1000, dtype=np.float64)
test_mse = np.zeros(1000, dtype=np.float64)

for i, y_pred in enumerate(myfit2.staged_predict(X_train)):
    train_mse[i] = np.mean((y_train - y_pred) ** 2.0)

for i, y_pred in enumerate(myfit2.staged_predict(X_test)):
    test_mse[i] = np.mean((y_test - y_pred) ** 2.0)

best_iter = np.argmin(test_mse)
plt.plot(
    (np.arange(train_mse.shape[0]) + 1),
    train_mse,
    "-",
)

plt.plot(
    (np.arange(test_mse.shape[0]) + 1),
    test_mse,
    "-",
)
plt.axvline(x=best_iter, color='k', ls=":")
plt.xlabel("Boosting Iterations")
plt.ylabel("MSE Loss")

Text(0, 0.5, 'MSE Loss')

Variable Importance¶

GBM also provides measures for variable importance similar to Random Forest.

importances = myfit1.feature_importances_
gbm_importances = pd.Series(
    myfit1.feature_importances_, index=X.columns).sort_values()

fig, ax = plt.subplots()
gbm_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using Mean Decrease in Impurity")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()

	Y	lon	lat	crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	lstat
0	3.178054	-70.955	42.2550	-5.064036	1.8	0.837248	-0.619897	1.883275	3.432567	1.408545	0.000000	5.690359	0.454865	2.231591
1	3.072693	-70.950	42.2875	-3.600502	0.0	1.955860	-0.757153	1.859574	5.529585	1.602836	0.693147	5.488938	1.236450	3.023243
2	3.546740	-70.936	42.2830	-3.601235	0.0	1.955860	-0.757153	1.971996	2.918119	1.602836	0.693147	5.488938	1.236450	2.007486
3	3.508556	-70.928	42.2930	-3.430523	0.0	0.779325	-0.780886	1.945624	1.419592	1.802073	1.098612	5.402677	1.772241	1.714643
4	3.589059	-70.922	42.2980	-2.672924	0.0	0.779325	-0.780886	1.966693	2.162710	1.802073	1.098612	5.402677	1.772241	2.308679