Load the Wage
data (from ISLR, Chapter 7), which
contains income and demographic information for males who reside in the
central Atlantic region of the United States. We will fit
wage
(in thousands of dollars) as a function of
age
.
= "https://liangfgithub.github.io/Data/Wage.csv"
url = read.csv(url) Wage
dim(Wage)
names(Wage)
attach(Wage)
Two Methods to specify the design matrix for polynomial regression in R:
poly
function in R. This is a more
compact and efficient way, especially when d is large. For example,
poly(X,3)
directly creates a cubic polynomial model.(The design matrix for Method 1 can be generated by poly with option
raw = TRUE
.)
When you use the poly
function, like
poly( , 3)
, you’re not directly getting age, age^2, age^3.
Instead, you get three orthogonalized (and standardized) polynomial
variables.
For example, poly( , 3)
returns an n-by-3 matrix:
= poly(age, 3) tmp
dim(tmp)
colMeans(tmp)
round(t(tmp) %*% tmp, dig=4)
Method 1 and Method 2 would produce different coefficients, but the fitted values (predictions) remain the same.
= lm(wage ~ age + I(age^2) + I(age^3), data=Wage)
fit1 round(summary(fit1)$coef, dig = 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -75.244 22.184 -3.392 0.001
## age 10.190 1.605 6.348 0.000
## I(age^2) -0.168 0.037 -4.559 0.000
## I(age^3) 0.001 0.000 3.143 0.002
= lm(wage ~ poly(age, 3), data = Wage)
fit2 round(summary(fit2)$coef, dig = 3)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.704 0.729 153.211 0.000
## poly(age, 3)1 447.068 39.933 11.195 0.000
## poly(age, 3)2 -478.316 39.933 -11.978 0.000
## poly(age, 3)3 125.522 39.933 3.143 0.002
Note that although the coefficients in fit1
and
fit2
are different, the t-value and p-value for the last
predictor remain identical. Why is this the case?
In the context of linear regression, the t-test for a variable gauges
its conditional contribution. In both fit1
and
fit2
, the sole unique contribution of the cubic term of age
comes only from the last variable, irrespective of whether it’s
represented as “age^3” or “poly(age, 3)”. Thus, the p-value for the last
variable in both models, which measures the unique contribution of the
cubic term of age, isolated from the influences of other variables,
stays the same.
Thus, if your primary concern is gauging the significance of the coefficient linked with the highest order term, the choice of design matrix becomes inconsequential. Whether you utilize the direct design matrix or the orthogonal polynomial design matrix, the significance of the highest order term remains consistent.
This insight becomes particularly valuable when employing forward and backward selection methods to determine the optimal d.
Predict the wage at age = 82. fit1
and fit2
should give you the same answer.
c(predict(fit1, newdata = list(age=82)),
predict(fit2, newdata = list(age=82)))
## 1 1
## 98.87193 98.87193
fit1
and fit2
should return an identical
fitted curve.
= range(age)
agelims = seq(from = agelims[1], to = agelims[2])
age.grid #preds = predict(fit1, newdata = list(age = age.grid), se=TRUE)
= predict(fit2, newdata = list(age = age.grid), se=TRUE)
preds plot(age, wage, xlim = agelims, pch = '.', cex = 2, col="darkgrey")
title("Degree -3 Polynomial ")
lines(age.grid, preds$fit, lwd=2, col="blue")
Next, we explore how to select the polynomial degree d using forward and backward approaches.
Starting with a linear model, we introduce a quadratic term and assess the significance of this newly added variable. Given its significant p-value, we retain the quadratic term.
Proceeding, we introduce a cubic term. Once again, its significance necessitates its inclusion.
On adding the quartic term (where d=4), the p-value hovers around 5%, placing it at the edge of our significance threshold. If sticking to a strict 5% cutoff, we would revert to the cubic model, as the quartic term fails to meet the significance criterion.
Hence the forward selection concludes with a preference for the cubic regression model.
summary(lm(wage ~ poly(age, 2), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.730162 152.98470 0.000000e+00
## poly(age, 2)1 447.0679 39.992617 11.17876 1.878131e-28
## poly(age, 2)2 -478.3158 39.992617 -11.96010 3.077420e-32
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7290826 153.211181 0.000000e+00
## poly(age, 3)1 447.0679 39.9334995 11.195309 1.570802e-28
## poly(age, 3)2 -478.3158 39.9334995 -11.977808 2.511784e-32
## poly(age, 3)3 125.5217 39.9334995 3.143268 1.687063e-03
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
summary(lm(wage ~ poly(age, 6), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7285859 153.3156280 0.000000e+00
## poly(age, 6)1 447.06785 39.9062945 11.2029408 1.447983e-28
## poly(age, 6)2 -478.31581 39.9062945 -11.9859739 2.290183e-32
## poly(age, 6)3 125.52169 39.9062945 3.1454107 1.674811e-03
## poly(age, 6)4 -77.91118 39.9062945 -1.9523532 5.098940e-02
## poly(age, 6)5 -35.81289 39.9062945 -0.8974246 3.695646e-01
## poly(age, 6)6 62.70772 39.9062945 1.5713741 1.162015e-01
summary(lm(wage ~ poly(age, 5), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287647 153.2780243 0.000000e+00
## poly(age, 5)1 447.06785 39.9160847 11.2001930 1.491111e-28
## poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(age, 5)3 125.52169 39.9160847 3.1446392 1.679213e-03
## poly(age, 5)4 -77.91118 39.9160847 -1.9518743 5.104623e-02
## poly(age, 5)5 -35.81289 39.9160847 -0.8972045 3.696820e-01
summary(lm(wage ~ poly(age, 4), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7287409 153.283015 0.000000e+00
## poly(age, 4)1 447.06785 39.9147851 11.200558 1.484604e-28
## poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
## poly(age, 4)3 125.52169 39.9147851 3.144742 1.678622e-03
## poly(age, 4)4 -77.91118 39.9147851 -1.951938 5.103865e-02
summary(lm(wage ~ poly(age, 3), data=Wage))$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.7036 0.7290826 153.211181 0.000000e+00
## poly(age, 3)1 447.0679 39.9334995 11.195309 1.570802e-28
## poly(age, 3)2 -478.3158 39.9334995 -11.977808 2.511784e-32
## poly(age, 3)3 125.5217 39.9334995 3.143268 1.687063e-03