Load Data

To illustrate local regression techniques, I’ve chosen three distinct data sets.

The first two are simulated.

  • In the first set exa, the true curve is smooth. Notably, on the left side, it remains relatively flat, while on the right, there’s a noticeable fluctuation.
  • In the second set exb, the true curve is a simple straight line, , but two outliers may impact the estimated curve.

The third data set is derived from observations of the Old Faithful Geyser. Familiar to many statistics courses, each data point here denotes the duration of a specific eruption, while the y-axis represents the waiting time between eruptions. There’s an evident positive correlation between these two variables. While it might be tempting to fit a linear model, in this session, we will explore non-linear modeling to uncover deeper insights the data might offer.

par(mfrow=c(1,3))
url = "https://liangfgithub.github.io/Data/Example_A.csv"
exa = read.csv(url)
plot (y ~ x, exa, main="Example A")
lines(m ~ x, exa)

url = "https://liangfgithub.github.io/Data/Example_B.csv"
exb = read.csv(url)
plot(y ~ x, exb, main="Example B")
lines(m ~ x, exb)

url = "https://liangfgithub.github.io/Data/faithful.dat"
faithful = read.table(url, header=TRUE)
plot(waiting ~ eruptions, faithful,main="Old Faithful")

Kernel Smoothing

We will employ the ksmooth command for Kernel smoothing, a method we recently discussed. To get acquainted, you can refer to the ksmooth function documentation.

?ksmooth

Using the Old Faithful data as an example, we applied ksmooth with varying bandwidths. As the bandwidth increases, it’s similar to enlarging the kernel neighbors. Consequently, the function becomes progressively flatter since we’re averaging across a broader set of data points.

par(mfrow=c(1,3))
plot(waiting ~ eruptions,
     faithful,main="bandwidth=0.1", col="gray", cex=0.5)
lines(ksmooth(faithful$eruptions, faithful$waiting, "normal", 0.1), lwd=2)

plot(waiting ~ eruptions, faithful, main="bandwidth=0.5", col="gray", cex=0.5)
lines(ksmooth(faithful$eruptions, faithful$waiting,"normal", 0.5), lwd=2)

plot(waiting ~ eruptions, faithful, main="bandwidth=2", col="gray", cex=0.5)
lines(ksmooth(faithful$eruptions, faithful$waiting, "normal", 2), lwd=2)

Bandwith Selection

There’s a package available that assists in performing cross-validation to determine the best bandwidth.

library(sm)
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
?hcv
?sm.options  # check the options
?sm.regression

Old Faithful Data

In the context of the Old Faithful data, the cross-validation (CV) error suggests an optimal bandwidth in a specific region, with the precise value being 0.42. This value provides the fitted curve displayed on the right. The aim is to select the bandwidth that minimizes the CV error.

par(mfrow=c(1,2))
hm = hcv(faithful$eruptions,faithful$waiting,display="lines")
print(hm)
## [1] 0.4243375
sm.regression(faithful$eruptions, faithful$waiting, h=hm,
              xlab="duration", ylab="waiting")

However, a word of caution: If you’re crafting your own algorithm for such smoothing procedures using cross-validation, I recommend considering the one standard error principle. This approach is inherently more stable. If you strictly select based on the minimum CV error, you might find the region is relatively flat, which means different iterations of cross-validation could yield variable lambda choices within this range.

Simulated Data

We then conducted cross-validation on the simulated data exa, determining that the optimal bandwidth is approximately 0.02. The corresponding fitted curve is depicted on the right.

par(mfrow=c(1,2))
hm=hcv(exa$x, exa$y, display="lines")
print(hm)
## [1] 0.02240222
sm.regression(exa$x, exa$y, h=hm, xlab="x", ylab="y")

However, when applying the cross-validation method to exb, the example with a straightforward line containing two outliers, we consistently received error messages. These messages indicated that the selected bandwidth was too large, suggesting that the underlying cross-validation process consistently seeks even smaller bandwidths.

hcv(exb$x,exb$y)

# try a smaller hstart, still error
par(mfrow=c(1,2))
hm = hcv(exb$x, exb$y, dislay = "lines", hstart=0.005)
hm
sm.regression(exb$x,exb$y,h=0.005)

The challenge with Example B is that the primary concern isn’t curve estimation—the actual curve is quite straightforward, with just a couple of outliers. For such tasks, local regression and kernel smoothing may not be the most suitable analytical approaches.

Loess

?loess

Default: span = 0.75, degree = 2

par(mfrow=c(1,3))
plot(waiting ~ eruptions, faithful,col="gray", cex=0.5)
f=loess(waiting ~ eruptions, faithful)
i = order(faithful$eruptions)
lines(f$x[i],f$fitted[i], lwd=1.5, col="red")

plot(y ~ x, exa, col="gray", cex=0.5)
lines(exa$x,exa$m,lty=1)
f = loess(y ~ x,exa)
lines(f$x,f$fitted,lty=2, lwd=1.5, col="red")
f = loess(y ~ x, exa, span=0.22)
lines(f$x,f$fitted,lty=5, lwd=1.5, col="blue")

plot(y ~ x, exb, col="gray", cex=0.5)
lines(exb$x,exb$m, lty=1)
f =loess(y ~ x, exb)
lines(f$x,f$fitted,lty=2,lwd=1.5, col="red")
f = loess(y ~ x, exb,span=1)
lines(f$x,f$fitted,lty=5, lwd=1.5, col="blue")