Let’s delve into fitting a regression spline model in R. The
foundational package we’ll use is splines
, with two pivotal
commands: bs
and ns
. These commands allow us
to generate the design matrix for cubic splines and natural cubic
splines respectively.
library(splines);
library(ggplot2)
help(bs)
help(ns)
To get a grasp on what the spline basis functions resemble, consider the scenario with five knots. This gives us a degree of freedom of nine, implying we should expect nine basis functions. These knots are equally spaced, and their locations are provided.
= (1:199)/100;
x = length(x)
n = 5;
m = 2*(1:m)/(m+1)
myknots myknots
## [1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667
Among the nine basis functions referenced in the notes, the initial four represent the basis functions for global cubic polynomials: the intercept, x, x^2, and x^3. The latter five basis functions are truncated functions situated at each of the five knots. To the left of a given knot, the function remains at zero; to its right, it takes on a cubic form.
Visualizing these, the five knots appear as dashed lines representing the truncated cubic functions, while the solid lines denote the global cubic polynomial basis functions.
= cbind(1, x, x^2, x^3);
X for(i in 1:m){
= (x-myknots[i])^3;
tmp <0] = 0;
tmp[tmp= cbind(X, tmp);
X
}plot(c(0,2), range(X), type="n", xlab="", ylab="")
title("Truncated Power Basis")
for(i in 1:(m+4)){
= X[,i];
tmp if (i<=4) mylty=1 else mylty=2;
lines(x[tmp!=0], tmp[tmp!=0], col=i, lty=mylty, lwd=2)
}for(i in 1:m){
points(myknots[i], 0, pty="m", pch=19, cex=2)
}
When using the bs
command in R to generate the design
matrix for a cubic spline with these knots, R creates a matrix with nine
columns, considering the intercept. Each column corresponds to a basis
function, giving rise to the B-spline basis functions. An advantage of
B-splines is their compact support, ensuring functions are non-zero only
within certain intervals. This results in a banded design matrix, where
many elements are zero, streamlining computations.
= bs(x, knots = myknots, intercept = TRUE)
F dim(F)
## [1] 199 9
= m+4;
mydf = data.frame(t = rep(1:n, mydf),
tmpdata basisfunc=as.vector(F),
type=as.factor(rep(1:mydf, each=n)))
ggplot(tmpdata, aes(x=t, y=basisfunc, color=type)) +
geom_path()
If not setting intercept = TRUE
, bs
will return
9-1 = 8 columns.
= bs(x, knots = myknots)
F dim(F)
= m+3;
mydf = data.frame(t = rep(1:n, mydf),
tmpdata basisfunc=as.vector(F),
type=as.factor(rep(1:mydf, each=n)))
ggplot(tmpdata, aes(x=t, y=basisfunc, color=type)) +
geom_path()
To inspect the basis functions of a natural cubic spline, utilize the
ns
command.
= ns(x, knots=myknots, Boundary.knots=c(0,2), intercept=TRUE)
F #dim(F)
= 7
mydf = data.frame(t = rep(1:n, mydf),
tmpdata basisfunc=as.vector(F),
type=as.factor(rep(1:mydf, each=n)))
ggplot(tmpdata, aes(x=t, y=basisfunc, color=type)) +
geom_path()
Moving on, we’ll apply regression splines to a dataset labeled “The Birthrate Data”. This dataset documents the number of births per 2,000 people. Notable patterns include the birthrate dip during the Great Depression, followed by a rise leading up to World War II, and a subsequent gradual decline. There are evident spikes, possibly reflecting birth surges from the ’60s impacting subsequent decades. An economist or social scientist might offer insights into this dataset’s narrative. For our purposes, it serves as a practical example.
From a statistical standpoint, it’s evident that a linear model isn’t optimal for mapping years against rates. Regression splines might be a better fit.
= "https://liangfgithub.github.io/Data/birthrates.csv"
url = read.csv(url)
birthrates names(birthrates) = c("year", "rate")
ggplot(birthrates, aes(x=year, y=rate)) +
geom_point() + geom_smooth(method="lm", se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
We can use bs
or ns
to generate the design
matrix F.
In R, you can either specify the location of knots directly or provide the desired degrees of freedom (df).
However, R’s method of counting the df can be perplexing. In the commands, the term df actually denotes the number of columns in the design matrix returned by the command. If an intercept isn’t included in the matrix (which is the default behavior), the df in the command will be the actual degree of freedom minus one.
Let’s consider several R commands to elucidate this.
The subsequent command fit four spline models. Intriguingly, three of them were identical. The purpose was to help students comprehend how R calculates the degree of freedom.
= lm(rate~bs(year, knots=quantile(year, c(1/3, 2/3))),
fit1 data=birthrates);
= lm(rate~bs(year, df=5), data=birthrates);
fit2 = lm(rate~bs(year, df=6, intercept=TRUE), data=birthrates)
fit3 = lm(rate~bs(year, df=5, intercept=TRUE), data=birthrates)
fit4
plot(birthrates$year, birthrates$rate, ylim=c(90,280))
lines(spline(birthrates$year, predict(fit1)), col="red", lty=1)
lines(spline(birthrates$year, predict(fit2)), col="blue", lty=2)
lines(spline(birthrates$year, predict(fit3)), col="green", lty=3)
lines(spline(birthrates$year, predict(fit4)), lty=2, lwd=2)
Alternatively, you can predict the spline fit on a fine grid, and then connect them
plot(birthrates$year, birthrates$rate, ylim=c(90,280))
= seq(from=min(birthrates$year), to=max(birthrates$year), length=200)
year.grid = predict(fit1, data.frame(year=year.grid))
ypred lines(year.grid, ypred, col="blue", lwd=2)
For natural cubic splines, there’s an intriguing detail to consider. Let’s examine an interval [a, b] populated with knots represented as \(\xi_1, \xi_2, \dots, \xi_m\).
The behavior of the functions at the two extreme intervals is uniquely determined by their neighboring functions. Specifically, the function spanning \([a, \xi_1]\) is governed by the behavior of the function within the interval \([\xi_1, \xi_2]\). Why is this the case?
Within the interval \([a, \xi_1]\), the function is linear and is determined by two parameters: its value and its 1st derivative at the boundary knot \(\xi_1\). Due to the inherent continuity of spline functions (which preserves continuity up to the second derivative), both these parameters can be inferred from the function in the adjacent interval \([\xi_1, \xi_2]\).
A similar relationship is seen for the function in the \([\xi_m, b]\), which is shaped by the function’s characteristics in the \([\xi_{m-1}, \xi_m]\) interval.
Hence, it’s unnecessary to position data points in these outer intervals. By default, R places boundary knots \(\xi_1\) and \(\xi_m\) at the dataset’s minimum and maximum values.
=lm(rate~ns(year, knots=quantile(year, (1:4)/5)), data=birthrates);
fit1=lm(rate~ns(year, df=5), data=birthrates);
fit2=lm(rate~ns(year, df=6, intercept=TRUE), data=birthrates)
fit3
plot(birthrates$year, birthrates$rate, ylim=c(90,280))
lines(spline(birthrates$year, predict(fit1)), col="red", lty=1)
lines(spline(birthrates$year, predict(fit2)), col="blue", lty=2)
lines(spline(birthrates$year, predict(fit3)), col="green", lty=3)
Make prediction outside the data range.
= data.frame(year=1905:2015);
new =lm(rate~bs(year, df=7), data=birthrates);
fit1=predict(fit1, new); pred1
## Warning in bs(year, degree = 3L, knots = c(`20%` = 1934.2, `40%` = 1951.4, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases
=lm(rate~ns(year, df=7), data=birthrates);
fit2=predict(fit2, new);
pred2plot(birthrates$year,birthrates$rate, xlim=c(1905,2015),
ylim=c(min(pred1,pred2), max(pred1,pred2)),
ylab="Birth Rate", xlab="Year")
lines(new$year, pred1, col="red")
lines(new$year, pred2, col="blue")
legend("bottomleft", lty=rep(1,2), col=c("red", "blue" ), legend=c("CS with df=7", "NCS with df=7"))
The placement of knots significantly influences the performance of a spline model. However, handpicking the precise locations of these knots can be computationally burdensome. A more efficient strategy involves positioning the knots at quantiles of x, allowing us to focus solely on determining the optimal number of knots or, equivalently, the degrees of freedom (df).
Try cubic splines with different degree-of-freedoms
plot(birthrates$year, birthrates$rate, ylim=c(90,280));
lines(spline(birthrates$year, predict(lm(rate~bs(year, df=7), data=birthrates))), col="blue");
lines(spline(birthrates$year, predict(lm(rate~bs(year, df=14), data=birthrates))), col="red");
lines(spline(birthrates$year, predict(lm(rate~bs(year, df=19), data=birthrates))), col="black");
legend("topright", lty=rep(1,3), col=c("blue", "red", "black"), legend=c("df=8", "df=15", "df=20"))
Recall that we use hypothesis testing to determine the optimal polynomial degree. The process involves iteratively increasing the polynomial degree. During each iteration, we utilize the t-test (which effectively acts as an F-test) to contrast two models: the current model and a bigger model with one degree more.
Can we utilize this same F-test approach to select the optimal number of knots in spline regression? The answer is “No.” The primary reason being that the models being compared in this context are not nested. Consequently, the F-test, which relies on comparing nested models, is unsuitable for this specific purpose.
To address this, we employ 10-fold CV. We start with dividing the data into 10 folds.
set.seed(1234)
=10
K= nrow(birthrates)
n = c(rep(9, 7), rep(8, 3))
fold.size = rep(1:K, fold.size)
fold.id fold.id
## [1] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3
## [26] 3 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 6 6 6 6 6
## [51] 6 6 6 6 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 9 9 9 9
## [76] 9 9 9 9 10 10 10 10 10 10 10 10
= fold.id[sample(1:n, n)]
fold.id fold.id
## [1] 4 10 3 1 1 5 2 1 8 9 2 7 7 10 9 3 5 9 8 10 8 6 8 7 3
## [26] 6 5 1 3 5 6 6 4 1 4 5 5 1 2 10 1 4 6 9 3 2 7 5 10 4
## [51] 10 7 5 2 9 3 2 3 7 3 2 9 8 8 8 7 10 1 10 4 8 2 2 7 6
## [76] 1 9 4 6 4 3 4 5 9 6 7 6
= 10:30
mydf = rep(0, length(mydf))
mycv
for(i in 1:length(mydf)){
= mydf[i]-4;
m for(k in 1:K){
= which(fold.id == k);
id = quantile(birthrates$year[-id], (1:m)/(m+1))
myknots = lm(rate ~ bs(year, knots=myknots),
myfit data=birthrates[-id,])
= predict(myfit, newdata=birthrates[id,])
ypred =mycv[i] + sum((birthrates$rate[id] - ypred)^2)
mycv[i]
}
}plot(mydf, mycv)
Re-run the 10-fold CV. The plot of mydf versus mycv may vary, but shouldn’t be too different.