PSL: Polynomial Regression¶

Load Data and Packages¶

For linear regression, we utilize two primary packages:

statsmodels, when we require detailed statistical outputs such as t-tests and p-values.
sklearn, when our focus is solely on prediction.

Additionally, we've incorporated a function to generate orthogonal polynomials, mirroring the functionality of R's poly function. We also include a summarize function sourced from the ISLP package. This function processes a fitted statsmodels object to return the usual coefficient estimates, their standard errors, the usual test statistics and P-values.

In [131]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

In [125]:

# copy from ISLP package
def summarize(results,
              conf_int=False):
    """
       https://github.com/intro-stat-learning/ISLP/blob/main/ISLP/models/__init__.py
    """
    tab = results.summary().tables[1]
    results_table = pd.read_html(tab.as_html(),
                                 index_col=0,
                                 header=0)[0]
    if not conf_int:
        columns = ['coef', 'std err', 't','P>|t|']
        return results_table[results_table.columns[:-2]]
    return results_table

# converted from R's poly()
# ref: https://stackoverflow.com/questions/39031172/how-poly-generates-orthogonal-polynomials-how-to-understand-the-coefs-ret
# compute orthogonal polynomials
def poly(x, degree):
    x_mean = x.mean(axis=0)
    x = x - x_mean
    Z = np.power.outer(x, np.arange(0, degree+1))

    # orthogonalize
    x = x.reshape(-1, 1)
    qr = np.linalg.qr(Z, mode='complete')
    z = np.zeros_like(Z)
    np.fill_diagonal(z, np.diag(qr[1]))
    Z = qr[0] @ z
    norm2 = (Z ** 2.0).sum(axis=0)
    alpha = (x * Z ** 2.0).sum(axis=0) / norm2 + x_mean
    Z = Z / np.sqrt(norm2)
    norm2 = np.insert(norm2, 0, 1.0, axis=0)

    return Z[:, 1:], alpha[:-1], norm2


def poly_predict(x_new, degree, alpha, norm2):
    n = x_new.shape[0]
    Z = np.ones((n, degree + 1))
    Z[:, 1] = x_new - alpha[0]
    for i in range(1, degree):
        Z[:, i + 1] = (x_new - alpha[i]) * Z[:, i]
        Z[:, i + 1] -= (norm2[i + 1] / norm2[i]) * Z[:, i - 1]

    Z = Z / np.sqrt(norm2[1:])

    return Z[:, 1:]

Load the Wage data (from ISLR, Chapter 7), which contains income and demographic information for males who reside in the central Atlantic region of the United States. We will fit wage (in thousands of dollars) as a function of age.

In [ ]:

url = "https://liangfgithub.github.io/Data/Wage.csv"
Wage = pd.read_csv(url)
Wage.describe()

In [122]:

age = Wage['age'].to_numpy()
wage = Wage['wage']

Fit a Polynomial Regression Model¶

Two Methods to specify the design matrix for polynomial regression:

Method 1: Manually specify each term of the polynomial. For example, a cubic model would include terms like x, x^2, x^3.
Method 2: Using the poly function. This is a more compact and efficient way, especially when d is large. For example, poly(x,3) directly creates a cubic polynomial model.

When you use the poly function, like poly(age, 3), you’re not directly getting age, age^2, age^3. Instead, you get three orthogonalized (and standardized) polynomial variables.

For example, poly(age, 3) returns an n-by-3 matrix:

each column has mean zero and sample variance 1, and they are orthogonal to each other.
The 1st column is a linear combination of “age” and the intercept, capturing the linear component of the trend.
The 2nd column is a linear combination of “age^2”, “age”, and the intercept, capturing the quadratic trend.
The 3rd column is a linear combination of “age^3”, age^2”, “age”, and the intercept, encapsulating the cubic trend.

In [173]:

X1 = np.power.outer(age, np.arange(1, 4))
M1 = sm.OLS(wage, sm.add_constant(X1)).fit()
summarize(M1)

Out[173]:

	coef	std err	t	P>\|t\|
const	-75.2439	22.184	-3.392	0.001
x1	10.1900	1.605	6.348	0.000
x2	-0.1680	0.037	-4.559	0.000
x3	0.0008	0.000	3.143	0.002

In [174]:

X2, alpha, norm2 = poly(age, 3)
M2 = sm.OLS(wage, sm.add_constant(X2)).fit()
summarize(M2)

Out[174]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.211	0.000
x1	447.0679	39.933	11.195	0.000
x2	-478.3158	39.933	-11.978	0.000
x3	125.5217	39.933	3.143	0.002

In [175]:

print(X2.shape)
print(X2.mean(axis=0))
print(np.round(X2.T @ X2, 4))

(3000, 3)
[-2.46330734e-19  9.47195158e-19 -2.37078875e-19]
[[ 1. -0. -0.]
 [-0.  1. -0.]
 [-0. -0.  1.]]

Prediction¶

Method 1 and Method 2 would produce different coefficients, but the fitted values (predictions) remain the same.

In [179]:

fit1 = LinearRegression().fit(X1, wage)
fit1.intercept_, fit1.coef_

Out[179]:

(-75.24391422432832,
 array([ 1.01899915e+01, -1.68028587e-01,  8.49452197e-04]))

In [180]:

fit2 = LinearRegression().fit(X2, wage)
fit2.intercept_, fit2.coef_

Out[180]:

(111.70360820174345, array([ 447.06785276, -478.31580585,  125.52168603]))

In [183]:

age_new = np.array([82])
age_new = poly_predict(age_new, 3, alpha, norm2)
fit2.predict(age_new)

Out[183]:

array([98.87192947])

In [191]:

tmp = np.power.outer(82, np.arange(1, 4))
fit1.intercept_ + tmp @ fit1.coef_

Out[191]:

98.87192946555125

Note that although the coefficients in fit1 and fit2 are different, the t-value and p-value for the last predictor remain identical. Why is this the case?

In the context of linear regression, the t-test for a variable gauges its conditional contribution. In both fit1 and fit2, the sole unique contribution of the cubic term of age comes only from the last variable, irrespective of whether it’s represented as “age^3” or “poly(age, 3)”. Thus, the p-value for the last variable in both models, which measures the unique contribution of the cubic term of age, isolated from the influences of other variables, stays the same.

Thus, if your primary concern is gauging the significance of the coefficient linked with the highest order term, the choice of design matrix becomes inconsequential. Whether you utilize the direct design matrix or the orthogonal polynomial design matrix, the significance of the highest order term remains consistent.

This insight becomes particularly valuable when employing forward and backward selection methods to determine the optimal d.

Fitted Curve¶

fit1 and fit2 should return an identical fitted curve.

In [135]:

age_grid = np.arange(age.min(), age.max() + 1)
age_grid_poly = poly_predict(age_grid, 3, alpha, norm2)
y_grid = fit.predict(age_grid_poly)

plt.plot(age_grid, y_grid)
plt.scatter(age, wage, marker='.', alpha=0.5, c='orange')
plt.title("Degree-3 Polynomial")
plt.xlabel("Age")
plt.ylabel("Wage")

Out[135]:

Text(0, 0.5, 'Wage')

Next, we explore how to select the polynomial degree d using forward and backward approaches.

Forward Selection¶

Starting with a linear model, we introduce a quadratic term and assess the significance of this newly added variable. Given its significant p-value, we retain the quadratic term.
Proceeding, we introduce a cubic term. Once again, its significance necessitates its inclusion.
On adding the quartic term (where d=4), the p-value hovers around 5%, placing it at the edge of our significance threshold. If sticking to a strict 5% cutoff, we would revert to the cubic model, as the quartic term fails to meet the significance criterion.
Hence the forward selection concludes with a preference for the cubic regression model.

In [192]:

d = 2
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[192]:

	coef	std err	t
const	111.7036	0.730	152.985
x1	447.0679	39.993	11.179
x2	-478.3158	39.993	-11.960

In [193]:

d = 3
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[193]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.211	0.000
x1	447.0679	39.933	11.195	0.000
x2	-478.3158	39.933	-11.978	0.000
x3	125.5217	39.933	3.143	0.002

In [194]:

d = 4
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[194]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.283	0.000
x1	447.0679	39.915	11.201	0.000
x2	-478.3158	39.915	-11.983	0.000
x3	125.5217	39.915	3.145	0.002
x4	-77.9112	39.915	-1.952	0.051

Backward Selection¶

In the backward approach, we initiate our analysis with a higher-degree polynomial, say d = 6.
Beginning with the highest degree term, we evaluate the significance of its coefficient to determine its retention or removal. For the sixth-degree term, the p-value stands at 10%, suggesting its removal.
On reducing to d = 5, the p-value is 36%, further justifying the term’s removal.
Continuing this pattern, the quartic term’s p-value prompts its removal, consistent with the 5% cutoff criterion.
Consequently, both forward and backward approaches converge on d=3 for this dataset.

In [195]:

d = 6
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[195]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.316	0.000
x1	447.0679	39.906	11.203	0.000
x2	-478.3158	39.906	-11.986	0.000
x3	125.5217	39.906	3.145	0.002
x4	-77.9112	39.906	-1.952	0.051
x5	-35.8129	39.906	-0.897	0.370
x6	62.7077	39.906	1.571	0.116

In [196]:

d = 5
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[196]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.278	0.000
x1	447.0679	39.916	11.200	0.000
x2	-478.3158	39.916	-11.983	0.000
x3	125.5217	39.916	3.145	0.002
x4	-77.9112	39.916	-1.952	0.051
x5	-35.8129	39.916	-0.897	0.370

In [197]:

d = 4
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[197]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.283	0.000
x1	447.0679	39.915	11.201	0.000
x2	-478.3158	39.915	-11.983	0.000
x3	125.5217	39.915	3.145	0.002
x4	-77.9112	39.915	-1.952	0.051

In [198]:

d = 3
tmp, alpha, norm2 = poly(age, d)
summarize(sm.OLS(wage, sm.add_constant(tmp)).fit())

Out[198]:

	coef	std err	t	P>\|t\|
const	111.7036	0.729	153.211	0.000
x1	447.0679	39.933	11.195	0.000
x2	-478.3158	39.933	-11.978	0.000
x3	125.5217	39.933	3.143	0.002