(PSL) Regression Splines

Load Libraries

Let’s delve into fitting a regression spline model in R. The foundational package we’ll use is splines, with two pivotal commands: bs and ns. These commands allow us to generate the design matrix for cubic splines and natural cubic splines respectively.

library(splines);
library(ggplot2)
help(bs)
help(ns)

Spline Basis Functions

To get a grasp on what the spline basis functions resemble, consider the scenario with five knots. This gives us a degree of freedom of nine, implying we should expect nine basis functions. These knots are equally spaced, and their locations are provided.

x = (1:199)/100;
n = length(x)
m = 5;
myknots = 2*(1:m)/(m+1)
myknots

## [1] 0.3333333 0.6666667 1.0000000 1.3333333 1.6666667

Among the nine basis functions referenced in the notes, the initial four represent the basis functions for global cubic polynomials: the intercept, x, x^2, and x^3. The latter five basis functions are truncated functions situated at each of the five knots. To the left of a given knot, the function remains at zero; to its right, it takes on a cubic form.

Visualizing these, the five knots appear as dashed lines representing the truncated cubic functions, while the solid lines denote the global cubic polynomial basis functions.

X = cbind(1, x, x^2, x^3);
for(i in 1:m){
    tmp = (x-myknots[i])^3;
    tmp[tmp<0] = 0;
    X = cbind(X, tmp);
    }
plot(c(0,2), range(X), type="n", xlab="", ylab="")
title("Truncated Power Basis")

for(i in 1:(m+4)){
    tmp = X[,i];
    if (i<=4) mylty=1 else mylty=2;
    lines(x[tmp!=0], tmp[tmp!=0], col=i, lty=mylty, lwd=2)
    }
for(i in 1:m){
    points(myknots[i], 0, pty="m", pch=19, cex=2)
  }

When using the bs command in R to generate the design matrix for a cubic spline with these knots, R creates a matrix with nine columns, considering the intercept. Each column corresponds to a basis function, giving rise to the B-spline basis functions. An advantage of B-splines is their compact support, ensuring functions are non-zero only within certain intervals. This results in a banded design matrix, where many elements are zero, streamlining computations.

F = bs(x, knots = myknots, intercept = TRUE)
dim(F)

## [1] 199   9

mydf = m+4; 
tmpdata = data.frame(t = rep(1:n, mydf),
                     basisfunc=as.vector(F), 
                     type=as.factor(rep(1:mydf, each=n)))
ggplot(tmpdata, aes(x=t, y=basisfunc, color=type)) + 
  geom_path()

If not setting intercept = TRUE, bs will return 9-1 = 8 columns.

F = bs(x, knots = myknots)
dim(F)
mydf = m+3; 
tmpdata = data.frame(t = rep(1:n, mydf),
                     basisfunc=as.vector(F), 
                     type=as.factor(rep(1:mydf, each=n)))
ggplot(tmpdata, aes(x=t, y=basisfunc, color=type)) +
  geom_path()

NCS Basis Functions

To inspect the basis functions of a natural cubic spline, utilize the ns command.

F = ns(x, knots=myknots, Boundary.knots=c(0,2), intercept=TRUE)
#dim(F)
mydf = 7
tmpdata = data.frame(t = rep(1:n, mydf),
                     basisfunc=as.vector(F), 
                     type=as.factor(rep(1:mydf, each=n)))
ggplot(tmpdata, aes(x=t, y=basisfunc, color=type)) +
  geom_path()

Regression Splines

The Birthrate Data

Moving on, we’ll apply regression splines to a dataset labeled “The Birthrate Data”. This dataset documents the number of births per 2,000 people. Notable patterns include the birthrate dip during the Great Depression, followed by a rise leading up to World War II, and a subsequent gradual decline. There are evident spikes, possibly reflecting birth surges from the ’60s impacting subsequent decades. An economist or social scientist might offer insights into this dataset’s narrative. For our purposes, it serves as a practical example.

From a statistical standpoint, it’s evident that a linear model isn’t optimal for mapping years against rates. Regression splines might be a better fit.

url = "https://liangfgithub.github.io/Data/birthrates.csv"
birthrates = read.csv(url)
names(birthrates) = c("year", "rate")
ggplot(birthrates, aes(x=year, y=rate)) + 
  geom_point() + geom_smooth(method="lm", se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

How does R Count DF

We can use bs or ns to generate the design matrix F.

In R, you can either specify the location of knots directly or provide the desired degrees of freedom (df).

The relationship between df and the number of knots is straightforward: for a cubic spline with m knots, the corresponding df is m+4.
If you provide R with a df, it will determine the number of knots as df - 4.
By default, R places knots at quantiles. For instance, if three knots are required, R will position them at the 25th, 50th, and 75th percentiles.

However, R’s method of counting the df can be perplexing. In the commands, the term df actually denotes the number of columns in the design matrix returned by the command. If an intercept isn’t included in the matrix (which is the default behavior), the df in the command will be the actual degree of freedom minus one.

Let’s consider several R commands to elucidate this.

For a command specifying two knots at particular quantiles, this equates to 2+4=6 degrees of freedom. But since the intercept is typically omitted, this command would produce an n-by-5 design matrix.
When the df is set to five (and no intercept included), the actual degrees of freedom become six. Hence, R would place two knots, typically at the 1/3 and 2/3 quantiles.

The subsequent command fit four spline models. Intriguingly, three of them were identical. The purpose was to help students comprehend how R calculates the degree of freedom.

fit1 = lm(rate~bs(year, knots=quantile(year, c(1/3, 2/3))),
          data=birthrates);
fit2 = lm(rate~bs(year, df=5), data=birthrates);
fit3 = lm(rate~bs(year, df=6, intercept=TRUE), data=birthrates) 
fit4 = lm(rate~bs(year, df=5, intercept=TRUE), data=birthrates)

plot(birthrates$year, birthrates$rate, ylim=c(90,280))
lines(spline(birthrates$year, predict(fit1)), col="red", lty=1)
lines(spline(birthrates$year, predict(fit2)), col="blue", lty=2)
lines(spline(birthrates$year, predict(fit3)), col="green", lty=3)
lines(spline(birthrates$year, predict(fit4)), lty=2, lwd=2)

Alternatively, you can predict the spline fit on a fine grid, and then connect them

plot(birthrates$year, birthrates$rate, ylim=c(90,280))
year.grid = seq(from=min(birthrates$year), to=max(birthrates$year), length=200)
ypred = predict(fit1, data.frame(year=year.grid))
lines(year.grid, ypred, col="blue", lwd=2)

For natural cubic splines, there’s an intriguing detail to consider. Let’s examine an interval [a, b] populated with knots represented as \(\xi_1, \xi_2, \dots, \xi_m\).

The behavior of the functions at the two extreme intervals is uniquely determined by their neighboring functions. Specifically, the function spanning \([a, \xi_1]\) is governed by the behavior of the function within the interval \([\xi_1, \xi_2]\). Why is this the case?

Within the interval \([a, \xi_1]\), the function is linear and is determined by two parameters: its value and its 1st derivative at the boundary knot \(\xi_1\). Due to the inherent continuity of spline functions (which preserves continuity up to the second derivative), both these parameters can be inferred from the function in the adjacent interval \([\xi_1, \xi_2]\).

A similar relationship is seen for the function in the \([\xi_m, b]\), which is shaped by the function’s characteristics in the \([\xi_{m-1}, \xi_m]\) interval.

Hence, it’s unnecessary to position data points in these outer intervals. By default, R places boundary knots \(\xi_1\) and \(\xi_m\) at the dataset’s minimum and maximum values.

fit1=lm(rate~ns(year, knots=quantile(year, (1:4)/5)), data=birthrates);
fit2=lm(rate~ns(year, df=5), data=birthrates);
fit3=lm(rate~ns(year, df=6, intercept=TRUE), data=birthrates) 

plot(birthrates$year, birthrates$rate, ylim=c(90,280))
lines(spline(birthrates$year, predict(fit1)), col="red", lty=1)
lines(spline(birthrates$year, predict(fit2)), col="blue", lty=2)
lines(spline(birthrates$year, predict(fit3)), col="green", lty=3)

Prediction

Make prediction outside the data range.

new = data.frame(year=1905:2015);
fit1=lm(rate~bs(year, df=7), data=birthrates);
pred1=predict(fit1, new);

## Warning in bs(year, degree = 3L, knots = c(`20%` = 1934.2, `40%` = 1951.4, :
## some 'x' values beyond boundary knots may cause ill-conditioned bases

fit2=lm(rate~ns(year, df=7), data=birthrates);
pred2=predict(fit2, new);
plot(birthrates$year,birthrates$rate, xlim=c(1905,2015),
     ylim=c(min(pred1,pred2), max(pred1,pred2)), 
     ylab="Birth Rate", xlab="Year") 
lines(new$year, pred1, col="red")
lines(new$year, pred2, col="blue")
legend("bottomleft", lty=rep(1,2),  col=c("red",  "blue" ), legend=c("CS with df=7", "NCS with df=7"))

Select DF

The placement of knots significantly influences the performance of a spline model. However, handpicking the precise locations of these knots can be computationally burdensome. A more efficient strategy involves positioning the knots at quantiles of x, allowing us to focus solely on determining the optimal number of knots or, equivalently, the degrees of freedom (df).

Try cubic splines with different degree-of-freedoms

plot(birthrates$year, birthrates$rate, ylim=c(90,280));
lines(spline(birthrates$year, predict(lm(rate~bs(year, df=7), data=birthrates))), col="blue");
lines(spline(birthrates$year, predict(lm(rate~bs(year, df=14), data=birthrates))), col="red");
lines(spline(birthrates$year, predict(lm(rate~bs(year, df=19), data=birthrates))), col="black");
legend("topright", lty=rep(1,3), col=c("blue", "red", "black"), legend=c("df=8", "df=15", "df=20"))

Recall that we use hypothesis testing to determine the optimal polynomial degree. The process involves iteratively increasing the polynomial degree. During each iteration, we utilize the t-test (which effectively acts as an F-test) to contrast two models: the current model and a bigger model with one degree more.

Can we utilize this same F-test approach to select the optimal number of knots in spline regression? The answer is “No.” The primary reason being that the models being compared in this context are not nested. Consequently, the F-test, which relies on comparing nested models, is unsuitable for this specific purpose.

To address this, we employ 10-fold CV. We start with dividing the data into 10 folds.

set.seed(1234)
K=10
n = nrow(birthrates)
fold.size = c(rep(9, 7), rep(8, 3))
fold.id = rep(1:K, fold.size)
fold.id

##  [1]  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3
## [26]  3  3  4  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  6  6  6  6  6
## [51]  6  6  6  6  7  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8  9  9  9  9
## [76]  9  9  9  9 10 10 10 10 10 10 10 10

fold.id = fold.id[sample(1:n, n)]
fold.id

##  [1]  4 10  3  1  1  5  2  1  8  9  2  7  7 10  9  3  5  9  8 10  8  6  8  7  3
## [26]  6  5  1  3  5  6  6  4  1  4  5  5  1  2 10  1  4  6  9  3  2  7  5 10  4
## [51] 10  7  5  2  9  3  2  3  7  3  2  9  8  8  8  7 10  1 10  4  8  2  2  7  6
## [76]  1  9  4  6  4  3  4  5  9  6  7  6

mydf = 10:30
mycv = rep(0, length(mydf))

for(i in 1:length(mydf)){
  m = mydf[i]-4;  
  for(k in 1:K){
    id = which(fold.id == k);
    myknots = quantile(birthrates$year[-id], (1:m)/(m+1))
    myfit = lm(rate ~ bs(year, knots=myknots),
               data=birthrates[-id,])
    ypred = predict(myfit, newdata=birthrates[id,])
    mycv[i]=mycv[i] + sum((birthrates$rate[id] - ypred)^2)
  }
}
plot(mydf, mycv)

Re-run the 10-fold CV. The plot of mydf versus mycv may vary, but shouldn’t be too different.