import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
Before running a GBM model, decide on the following hyperparameters:
n_estimators
: number of treeslearning_rate
: shrinkage factor (default = 0.1)subsample
: fraction of the training data used for learning (default = 1.0)max_depth
: depth of individual trees (default = 3)Ref: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
We first split the Boston Housing dataset into a 70% training set and a 30% test set.
url = "https://liangfgithub.github.io/Data/HousingData.csv"
Housing = pd.read_csv(url)
Housing.head()
random.seed(124)
X = Housing.drop("Y", axis=1)
y = Housing["Y"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
myfit1 = GradientBoostingRegressor(
learning_rate=1.0, n_estimators=100, subsample=1)
myfit1.fit(X_train, y_train);
We evaluate the model’s performance on the training and the test data against the number of trees.
# ref: https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html
train_mse = np.zeros(100, dtype=np.float64)
test_mse = np.zeros(100, dtype=np.float64)
for i, y_pred in enumerate(myfit1.staged_predict(X_train)):
train_mse[i] = np.mean((y_train - y_pred) ** 2.0)
for i, y_pred in enumerate(myfit1.staged_predict(X_test)):
test_mse[i] = np.mean((y_test - y_pred) ** 2.0)
best_iter = np.argmin(test_mse)
plt.plot(
(np.arange(train_mse.shape[0]) + 1),
train_mse,
"-",
)
plt.plot(
(np.arange(test_mse.shape[0]) + 1),
test_mse,
"-",
)
plt.axvline(x=best_iter, color='k', ls=":")
plt.xlabel("Boosting Iterations")
plt.ylabel("MSE Loss")
We then construct another GBM model but with shrinkage. Because of this, we will need more trees and also set the subsampling rate to 50%.
myfit2 = GradientBoostingRegressor(
learning_rate=0.02, n_estimators=1000, subsample=0.5)
myfit2.fit(X_train, y_train);
train_mse = np.zeros(1000, dtype=np.float64)
test_mse = np.zeros(1000, dtype=np.float64)
for i, y_pred in enumerate(myfit2.staged_predict(X_train)):
train_mse[i] = np.mean((y_train - y_pred) ** 2.0)
for i, y_pred in enumerate(myfit2.staged_predict(X_test)):
test_mse[i] = np.mean((y_test - y_pred) ** 2.0)
best_iter = np.argmin(test_mse)
plt.plot(
(np.arange(train_mse.shape[0]) + 1),
train_mse,
"-",
)
plt.plot(
(np.arange(test_mse.shape[0]) + 1),
test_mse,
"-",
)
plt.axvline(x=best_iter, color='k', ls=":")
plt.xlabel("Boosting Iterations")
plt.ylabel("MSE Loss")
GBM also provides measures for variable importance similar to Random Forest.
importances = myfit1.feature_importances_
gbm_importances = pd.Series(
myfit1.feature_importances_, index=X.columns).sort_values()
fig, ax = plt.subplots()
gbm_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using Mean Decrease in Impurity")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()