PSL: Polynomial Regression¶

Load Data and Packages¶

For linear regression, we utilize two primary packages:

  • statsmodels, when we require detailed statistical outputs such as t-tests and p-values.
  • sklearn, when our focus is solely on prediction.

Additionally, we've incorporated a function to generate orthogonal polynomials, mirroring the functionality of R's poly function. We also include a summarize function sourced from the ISLP package. This function processes a fitted statsmodels object to return the usual coefficient estimates, their standard errors, the usual test statistics and P-values.

In [131]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
In [125]:
# copy from ISLP package
def summarize(results,
              conf_int=False):
    """
       https://github.com/intro-stat-learning/ISLP/blob/main/ISLP/models/__init__.py
    """
    tab = results.summary().tables[1]
    results_table = pd.read_html(tab.as_html(),
                                 index_col=0,
                                 header=0)[0]
    if not conf_int:
        columns = ['coef', 'std err', 't','P>|t|']
        return results_table[results_table.columns[:-2]]
    return results_table

# converted from R's poly()
# ref: https://stackoverflow.com/questions/39031172/how-poly-generates-orthogonal-polynomials-how-to-understand-the-coefs-ret
# compute orthogonal polynomials
def poly(x, degree):
    x_mean = x.mean(axis=0)
    x = x - x_mean
    Z = np.power.outer(x, np.arange(0, degree+1))

    # orthogonalize
    x = x.reshape(-1, 1)
    qr = np.linalg.qr(Z, mode='complete')
    z = np.zeros_like(Z)
    np.fill_diagonal(z, np.diag(qr[1]))
    Z = qr[0] @ z
    norm2 = (Z ** 2.0).sum(axis=0)
    alpha = (x * Z ** 2.0).sum(axis=0) / norm2 + x_mean
    Z = Z / np.sqrt(norm2)
    norm2 = np.insert(norm2, 0, 1.0, axis=0)

    return Z[:, 1:], alpha[:-1], norm2


def poly_predict(x_new, degree, alpha, norm2):
    n = x_new.shape[0]
    Z = np.ones((n, degree + 1))
    Z[:, 1] = x_new - alpha[0]
    for i in range(1, degree):
        Z[:, i + 1] = (x_new - alpha[i]) * Z[:, i]
        Z[:, i + 1] -= (norm2[i + 1] / norm2[i]) * Z[:, i - 1]

    Z = Z / np.sqrt(norm2[1:])

    return Z[:, 1:]

Load the Wage data (from ISLR, Chapter 7), which contains income and demographic information for males who reside in the central Atlantic region of the United States. We will fit wage (in thousands of dollars) as a function of age.

In [ ]:
url = "https://liangfgithub.github.io/Data/Wage.csv"
Wage = pd.read_csv(url)
Wage.describe()
In [122]:
age = Wage['age'].to_numpy()
wage = Wage['wage']

Fit a Polynomial Regression Model¶

Two Methods to specify the design matrix for polynomial regression:

  • Method 1: Manually specify each term of the polynomial. For example, a cubic model would include terms like x, x^2, x^3.
  • Method 2: Using the poly function. This is a more compact and efficient way, especially when d is large. For example, poly(x,3) directly creates a cubic polynomial model.

When you use the poly function, like poly(age, 3), you’re not directly getting age, age^2, age^3. Instead, you get three orthogonalized (and standardized) polynomial variables.

For example, poly(age, 3) returns an n-by-3 matrix:

  • each column has mean zero and sample variance 1, and they are orthogonal to each other.
  • The 1st column is a linear combination of “age” and the intercept, capturing the linear component of the trend.
  • The 2nd column is a linear combination of “age^2”, “age”, and the intercept, capturing the quadratic trend.
  • The 3rd column is a linear combination of “age^3”, age^2”, “age”, and the intercept, encapsulating the cubic trend.
In [173]:
X1 = np.power.outer(age, np.arange(1, 4))
M1 = sm.OLS(wage, sm.add_constant(X1)).fit()
summarize(M1)
Out[173]:
coef std err t P>|t|
const -75.2439 22.184 -3.392 0.001
x1 10.1900 1.605 6.348 0.000
x2 -0.1680 0.037 -4.559 0.000
x3 0.0008 0.000 3.143 0.002
In [174]:
X2, alpha, norm2 = poly(age, 3)
M2 = sm.OLS(wage, sm.add_constant(X2)).fit()
summarize(M2)
Out[174]:
coef std err t P>|t|
const 111.7036 0.729 153.211 0.000
x1 447.0679 39.933 11.195 0.000
x2 -478.3158 39.933 -11.978 0.000
x3 125.5217 39.933 3.143 0.002
In [175]:
print(X2.shape)
print(X2.mean(axis=0))
print(np.round(X2.T @ X2, 4))
(3000, 3)
[-2.46330734e-19  9.47195158e-19 -2.37078875e-19]
[[ 1. -0. -0.]
 [-0.  1. -0.]
 [-0. -0.  1.]]

Prediction¶

Method 1 and Method 2 would produce different coefficients, but the fitted values (predictions) remain the same.

In [179]:
fit1 = LinearRegression().fit(X1, wage)
fit1.intercept_, fit1.coef_
Out[179]:
(-75.24391422432832,
 array([ 1.01899915e+01, -1.68028587e-01,  8.49452197e-04]))
In [180]:
fit2 = LinearRegression().fit(X2, wage)
fit2.intercept_, fit2.coef_
Out[180]:
(111.70360820174345, array([ 447.06785276, -478.31580585,  125.52168603]))
In [183]:
age_new = np.array([82])
age_new = poly_predict(age_new, 3, alpha, norm2)
fit2.predict(age_new)
Out[183]:
array([98.87192947])
In [191]:
tmp = np.power.outer(82, np.arange(1, 4))
fit1.intercept_ + tmp @ fit1.coef_
Out[191]:
98.87192946555125

Note that although the coefficients in fit1 and fit2 are different, the t-value and p-value for the last predictor remain identical. Why is this the case?

In the context of linear regression, the t-test for a variable gauges its conditional contribution. In both fit1 and fit2, the sole unique contribution of the cubic term of age comes only from the last variable, irrespective of whether it’s represented as “age^3” or “poly(age, 3)”. Thus, the p-value for the last variable in both models, which measures the unique contribution of the cubic term of age, isolated from the influences of other variables, stays the same.

Thus, if your primary concern is gauging the significance of the coefficient linked with the highest order term, the choice of design matrix becomes inconsequential. Whether you utilize the direct design matrix or the orthogonal polynomial design matrix, the significance of the highest order term remains consistent.

This insight becomes particularly valuable when employing forward and backward selection methods to determine the optimal d.

Fitted Curve¶

fit1 and fit2 should return an identical fitted curve.

In [135]:
age_grid = np.arange(age.min(), age.max() + 1)
age_grid_poly = poly_predict(age_grid, 3, alpha, norm2)
y_grid = fit.predict(age_grid_poly)

plt.plot(age_grid, y_grid)
plt.scatter(age, wage, marker='.', alpha=0.5, c='orange')
plt.title("Degree-3 Polynomial")
plt.xlabel("Age")
plt.ylabel("Wage")
Out[135]:
Text(0, 0.5, 'Wage')

Next, we explore how to select the polynomial degree d using forward and backward approaches.

Forward Selection¶

  • Starting with a linear model, we introduce a quadratic term and assess the significance of this newly added variable. Given its significant p-value, we retain the quadratic term.

  • Proceeding, we introduce a cubic term. Once again, its significance necessitates its inclusion.

  • On adding the quartic term (where d=4), the p-value hovers around 5%, placing it at the edge of our significance threshold. If sticking to a strict 5% cutoff, we would revert to the cubic model, as the quartic term fails to meet the significance criterion.

  • Hence the forward selection concludes with a preference for the cubic regression model.

In [192]:
d = 2
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[192]:
coef std err t P>|t|
const 111.7036 0.730 152.985 0.0
x1 447.0679 39.993 11.179 0.0
x2 -478.3158 39.993 -11.960 0.0
In [193]:
d = 3
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[193]:
coef std err t P>|t|
const 111.7036 0.729 153.211 0.000
x1 447.0679 39.933 11.195 0.000
x2 -478.3158 39.933 -11.978 0.000
x3 125.5217 39.933 3.143 0.002
In [194]:
d = 4
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[194]:
coef std err t P>|t|
const 111.7036 0.729 153.283 0.000
x1 447.0679 39.915 11.201 0.000
x2 -478.3158 39.915 -11.983 0.000
x3 125.5217 39.915 3.145 0.002
x4 -77.9112 39.915 -1.952 0.051

Backward Selection¶

  • In the backward approach, we initiate our analysis with a higher-degree polynomial, say d = 6.
  • Beginning with the highest degree term, we evaluate the significance of its coefficient to determine its retention or removal. For the sixth-degree term, the p-value stands at 10%, suggesting its removal.
  • On reducing to d = 5, the p-value is 36%, further justifying the term’s removal.
  • Continuing this pattern, the quartic term’s p-value prompts its removal, consistent with the 5% cutoff criterion.
  • Consequently, both forward and backward approaches converge on d=3 for this dataset.
In [195]:
d = 6
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[195]:
coef std err t P>|t|
const 111.7036 0.729 153.316 0.000
x1 447.0679 39.906 11.203 0.000
x2 -478.3158 39.906 -11.986 0.000
x3 125.5217 39.906 3.145 0.002
x4 -77.9112 39.906 -1.952 0.051
x5 -35.8129 39.906 -0.897 0.370
x6 62.7077 39.906 1.571 0.116
In [196]:
d = 5
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[196]:
coef std err t P>|t|
const 111.7036 0.729 153.278 0.000
x1 447.0679 39.916 11.200 0.000
x2 -478.3158 39.916 -11.983 0.000
x3 125.5217 39.916 3.145 0.002
x4 -77.9112 39.916 -1.952 0.051
x5 -35.8129 39.916 -0.897 0.370
In [197]:
d = 4
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[197]:
coef std err t P>|t|
const 111.7036 0.729 153.283 0.000
x1 447.0679 39.915 11.201 0.000
x2 -478.3158 39.915 -11.983 0.000
x3 125.5217 39.915 3.145 0.002
x4 -77.9112 39.915 -1.952 0.051
In [198]:
d = 3
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())
Out[198]:
coef std err t P>|t|
const 111.7036 0.729 153.211 0.000
x1 447.0679 39.933 11.195 0.000
x2 -478.3158 39.933 -11.978 0.000
x3 125.5217 39.933 3.143 0.002