Load Data

Load the Wage data (from ISLR, Chapter 7), which contains income and demographic information for males who reside in the central Atlantic region of the United States. We will fit wage (in thousands of dollars) as a function of age.

url = "https://liangfgithub.github.io/Data/Wage.csv"
Wage = read.csv(url)
dim(Wage)
names(Wage)
attach(Wage)

Fit a Polynomial Regression Model

Two Methods to specify the design matrix for polynomial regression in R:

  • Method 1: Manually specify each term of the polynomial. For example, a cubic model would include terms like x, x^2, x^3.
  • Method 2: Using the poly function in R. This is a more compact and efficient way, especially when d is large. For example, poly(X,3) directly creates a cubic polynomial model.

(The design matrix for Method 1 can be generated by poly with option raw = TRUE.)

When you use the poly function, like poly( , 3), you’re not directly getting age, age^2, age^3. Instead, you get three orthogonalized (and standardized) polynomial variables.

For example, poly( , 3) returns an n-by-3 matrix:

  • each column has mean zero and sample variance 1, and they are orthogonal to each other.
  • The 1st column is a linear combination of “age” and the intercept, capturing the linear component of the trend.
  • The 2nd column is a linear combination of “age^2”, “age”, and the intercept, capturing the quadratic trend.
  • The 3rd column is a linear combination of “age^3”, age^2”, “age”, and the intercept, encapsulating the cubic trend.
tmp = poly(age, 3)
dim(tmp)
colMeans(tmp)
round(t(tmp) %*% tmp, dig=4)

Method 1 and Method 2 would produce different coefficients, but the fitted values (predictions) remain the same.

fit1 = lm(wage ~ age + I(age^2) + I(age^3), data=Wage)
round(summary(fit1)$coef, dig = 3)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -75.244     22.184  -3.392    0.001
## age           10.190      1.605   6.348    0.000
## I(age^2)      -0.168      0.037  -4.559    0.000
## I(age^3)       0.001      0.000   3.143    0.002
fit2 = lm(wage ~ poly(age, 3), data = Wage)
round(summary(fit2)$coef, dig = 3)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    111.704      0.729 153.211    0.000
## poly(age, 3)1  447.068     39.933  11.195    0.000
## poly(age, 3)2 -478.316     39.933 -11.978    0.000
## poly(age, 3)3  125.522     39.933   3.143    0.002

Note that although the coefficients in fit1 and fit2 are different, the t-value and p-value for the last predictor remain identical. Why is this the case?

In the context of linear regression, the t-test for a variable gauges its conditional contribution. In both fit1 and fit2, the sole unique contribution of the cubic term of age comes only from the last variable, irrespective of whether it’s represented as “age^3” or “poly(age, 3)”. Thus, the p-value for the last variable in both models, which measures the unique contribution of the cubic term of age, isolated from the influences of other variables, stays the same.

Thus, if your primary concern is gauging the significance of the coefficient linked with the highest order term, the choice of design matrix becomes inconsequential. Whether you utilize the direct design matrix or the orthogonal polynomial design matrix, the significance of the highest order term remains consistent.

This insight becomes particularly valuable when employing forward and backward selection methods to determine the optimal d. 

Prediction

Predict the wage at age = 82. fit1 and fit2 should give you the same answer.

c(predict(fit1, newdata = list(age=82)), 
  predict(fit2, newdata = list(age=82)))
##        1        1 
## 98.87193 98.87193

fit1 and fit2 should return an identical fitted curve.

agelims = range(age)
age.grid = seq(from = agelims[1], to = agelims[2])
#preds = predict(fit1, newdata = list(age = age.grid), se=TRUE)
preds = predict(fit2, newdata = list(age = age.grid), se=TRUE)
plot(age, wage, xlim = agelims, pch = '.', cex = 2, col="darkgrey")
title("Degree -3 Polynomial ")
lines(age.grid, preds$fit, lwd=2, col="blue")