Load Data

Load the Wage data (from ISLR, Chapter 7), which contains income and demographic information for males who reside in the central Atlantic region of the United States. We will fit wage (in thousands of dollars) as a function of age.

url = "https://liangfgithub.github.io/Data/Wage.csv"
Wage = read.csv(url)
dim(Wage)
names(Wage)
attach(Wage)

Fit a Polynomial Regression Model

Two Methods to specify the design matrix for polynomial regression in R:

  • Method 1: Manually specify each term of the polynomial. For example, a cubic model would include terms like x, x^2, x^3.
  • Method 2: Using the poly function in R. This is a more compact and efficient way, especially when d is large. For example, poly(X,3) directly creates a cubic polynomial model.

(The design matrix for Method 1 can be generated by poly with option raw = TRUE.)

When you use the poly function, like poly( , 3), you’re not directly getting age, age^2, age^3. Instead, you get three orthogonalized (and standardized) polynomial variables.

For example, poly( , 3) returns an n-by-3 matrix:

  • each column has mean zero and sample variance 1, and they are orthogonal to each other.
  • The 1st column is a linear combination of “age” and the intercept, capturing the linear component of the trend.
  • The 2nd column is a linear combination of “age^2”, “age”, and the intercept, capturing the quadratic trend.
  • The 3rd column is a linear combination of “age^3”, age^2”, “age”, and the intercept, encapsulating the cubic trend.
tmp = poly(age, 3)
dim(tmp)
colMeans(tmp)
round(t(tmp) %*% tmp, dig=4)

Method 1 and Method 2 would produce different coefficients, but the fitted values (predictions) remain the same.

fit1 = lm(wage ~ age + I(age^2) + I(age^3), data=Wage)
round(summary(fit1)$coef, dig = 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -75.244     22.184  -3.392    0.001
## age           10.190      1.605   6.348    0.000
## I(age^2)      -0.168      0.037  -4.559    0.000
## I(age^3)       0.001      0.000   3.143    0.002
fit2 = lm(wage ~ poly(age, 3), data = Wage)
round(summary(fit2)$coef, dig = 3)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    111.704      0.729 153.211    0.000
## poly(age, 3)1  447.068     39.933  11.195    0.000
## poly(age, 3)2 -478.316     39.933 -11.978    0.000
## poly(age, 3)3  125.522     39.933   3.143    0.002

Note that although the coefficients in fit1 and fit2 are different, the t-value and p-value for the last predictor remain identical. Why is this the case?

In the context of linear regression, the t-test for a variable gauges its conditional contribution. In both fit1 and fit2, the sole unique contribution of the cubic term of age comes only from the last variable, irrespective of whether it’s represented as “age^3” or “poly(age, 3)”. Thus, the p-value for the last variable in both models, which measures the unique contribution of the cubic term of age, isolated from the influences of other variables, stays the same.

Thus, if your primary concern is gauging the significance of the coefficient linked with the highest order term, the choice of design matrix becomes inconsequential. Whether you utilize the direct design matrix or the orthogonal polynomial design matrix, the significance of the highest order term remains consistent.

This insight becomes particularly valuable when employing forward and backward selection methods to determine the optimal d. 

Prediction

Predict the wage at age = 82. fit1 and fit2 should give you the same answer.

c(predict(fit1, newdata = list(age=82)), 
  predict(fit2, newdata = list(age=82)))
##        1        1 
## 98.87193 98.87193

fit1 and fit2 should return an identical fitted curve.

agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
#preds = predict(fit1, newdata = list(age = age.grid), se=TRUE)
preds = predict(fit2, newdata = list(age = age.grid), se=TRUE)
plot(age, wage, xlim = agelims, pch = '.', cex = 2, col="darkgrey")
title("Degree -3 Polynomial ")
lines(age.grid, preds$fit, lwd=2, col="blue")

Next, we explore how to select the polynomial degree d using forward and backward approaches.

Forward Selection

  • Starting with a linear model, we introduce a quadratic term and assess the significance of this newly added variable. Given its significant p-value, we retain the quadratic term.

  • Proceeding, we introduce a cubic term. Once again, its significance necessitates its inclusion.

  • On adding the quartic term (where d=4), the p-value hovers around 5%, placing it at the edge of our significance threshold. If sticking to a strict 5% cutoff, we would revert to the cubic model, as the quartic term fails to meet the significance criterion.

  • Hence the forward selection concludes with a preference for the cubic regression model.

summary(lm(wage ~ poly(age, 2), data=Wage))$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)    111.7036   0.730162 152.98470 0.000000e+00
## poly(age, 2)1  447.0679  39.992617  11.17876 1.878131e-28
## poly(age, 2)2 -478.3158  39.992617 -11.96010 3.077420e-32
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.7036  0.7290826 153.211181 0.000000e+00
## poly(age, 3)1  447.0679 39.9334995  11.195309 1.570802e-28
## poly(age, 3)2 -478.3158 39.9334995 -11.977808 2.511784e-32
## poly(age, 3)3  125.5217 39.9334995   3.143268 1.687063e-03
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
## poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
## poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02

Backward Selection

  • In the backward approach, we initiate our analysis with a higher-degree polynomial, say d = 6.
  • Beginning with the highest degree term, we evaluate the significance of its coefficient to determine its retention or removal. For the sixth-degree term, the p-value stands at 10%, suggesting its removal.
  • On reducing to d = 5, the p-value is 36%, further justifying the term’s removal.
  • Continuing this pattern, the quartic term’s p-value prompts its removal, consistent with the 5% cutoff criterion.
  • Consequently, both forward and backward approaches converge on d=3 for this dataset.
summary(lm(wage ~ poly(age, 6), data=Wage))$coef
##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)    111.70361  0.7285859 153.3156280 0.000000e+00
## poly(age, 6)1  447.06785 39.9062945  11.2029408 1.447983e-28
## poly(age, 6)2 -478.31581 39.9062945 -11.9859739 2.290183e-32
## poly(age, 6)3  125.52169 39.9062945   3.1454107 1.674811e-03
## poly(age, 6)4  -77.91118 39.9062945  -1.9523532 5.098940e-02
## poly(age, 6)5  -35.81289 39.9062945  -0.8974246 3.695646e-01
## poly(age, 6)6   62.70772 39.9062945   1.5713741 1.162015e-01
summary(lm(wage ~ poly(age, 5), data=Wage))$coef
##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287647 153.2780243 0.000000e+00
## poly(age, 5)1  447.06785 39.9160847  11.2001930 1.491111e-28
## poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(age, 5)3  125.52169 39.9160847   3.1446392 1.679213e-03
## poly(age, 5)4  -77.91118 39.9160847  -1.9518743 5.104623e-02
## poly(age, 5)5  -35.81289 39.9160847  -0.8972045 3.696820e-01
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
## poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
## poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)    111.7036  0.7290826 153.211181 0.000000e+00
## poly(age, 3)1  447.0679 39.9334995  11.195309 1.570802e-28
## poly(age, 3)2 -478.3158 39.9334995 -11.977808 2.511784e-32
## poly(age, 3)3  125.5217 39.9334995   3.143268 1.687063e-03