Load the Wage
data (from ISLR, Chapter 7), which
contains income and demographic information for males who reside in the
central Atlantic region of the United States. We will fit
wage
(in thousands of dollars) as a function of
age
.
= "https://liangfgithub.github.io/Data/Wage.csv"
url = read.csv(url) Wage
dim(Wage)
names(Wage)
attach(Wage)
Two Methods to specify the design matrix for polynomial regression in R:
poly
function in R. This is a more
compact and efficient way, especially when d is large. For example,
poly(X,3)
directly creates a cubic polynomial model.(The design matrix for Method 1 can be generated by poly with option
raw = TRUE
.)
When you use the poly
function, like
poly( , 3)
, you’re not directly getting age, age^2, age^3.
Instead, you get three orthogonalized (and standardized) polynomial
variables.
For example, poly( , 3)
returns an n-by-3 matrix:
= poly(age, 3) tmp
dim(tmp)
colMeans(tmp)
round(t(tmp) %*% tmp, dig=4)
Method 1 and Method 2 would produce different coefficients, but the fitted values (predictions) remain the same.
= lm(wage ~ age + I(age^2) + I(age^3), data=Wage)
fit1 round(summary(fit1)$coef, dig = 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.244 22.184 -3.392 0.001
## age 10.190 1.605 6.348 0.000
## I(age^2) -0.168 0.037 -4.559 0.000
## I(age^3) 0.001 0.000 3.143 0.002
= lm(wage ~ poly(age, 3), data = Wage)
fit2 round(summary(fit2)$coef, dig = 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.704 0.729 153.211 0.000
## poly(age, 3)1 447.068 39.933 11.195 0.000
## poly(age, 3)2 -478.316 39.933 -11.978 0.000
## poly(age, 3)3 125.522 39.933 3.143 0.002
Note that although the coefficients in fit1
and
fit2
are different, the t-value and p-value for the last
predictor remain identical. Why is this the case?
In the context of linear regression, the t-test for a variable gauges
its conditional contribution. In both fit1
and
fit2
, the sole unique contribution of the cubic term of age
comes only from the last variable, irrespective of whether it’s
represented as “age^3” or “poly(age, 3)”. Thus, the p-value for the last
variable in both models, which measures the unique contribution of the
cubic term of age, isolated from the influences of other variables,
stays the same.
Thus, if your primary concern is gauging the significance of the coefficient linked with the highest order term, the choice of design matrix becomes inconsequential. Whether you utilize the direct design matrix or the orthogonal polynomial design matrix, the significance of the highest order term remains consistent.
This insight becomes particularly valuable when employing forward and backward selection methods to determine the optimal d.
Predict the wage at age = 82. fit1
and fit2
should give you the same answer.
c(predict(fit1, newdata = list(age=82)),
predict(fit2, newdata = list(age=82)))
## 1 1
## 98.87193 98.87193
fit1
and fit2
should return an identical
fitted curve.
= range(age)
agelims = seq(from = agelims[1], to = agelims[2])
age.grid #preds = predict(fit1, newdata = list(age = age.grid), se=TRUE)
= predict(fit2, newdata = list(age = age.grid), se=TRUE)
preds plot(age, wage, xlim = agelims, pch = '.', cex = 2, col="darkgrey")
title("Degree -3 Polynomial ")
lines(age.grid, preds$fit, lwd=2, col="blue")