To illustrate local regression techniques, I’ve chosen three distinct data sets.
The first two are simulated.
exa
, the true curve is smooth.
Notably, on the left side, it remains relatively flat, while on the
right, there’s a noticeable fluctuation.exb
, the true curve is a simple
straight line, , but two outliers may impact the estimated curve.The third data set is derived from observations of the Old Faithful Geyser. Familiar to many statistics courses, each data point here denotes the duration of a specific eruption, while the y-axis represents the waiting time between eruptions. There’s an evident positive correlation between these two variables. While it might be tempting to fit a linear model, in this session, we will explore non-linear modeling to uncover deeper insights the data might offer.
par(mfrow=c(1,3))
= "https://liangfgithub.github.io/Data/Example_A.csv"
url = read.csv(url)
exa plot (y ~ x, exa, main="Example A")
lines(m ~ x, exa)
= "https://liangfgithub.github.io/Data/Example_B.csv"
url = read.csv(url)
exb plot(y ~ x, exb, main="Example B")
lines(m ~ x, exb)
= "https://liangfgithub.github.io/Data/faithful.dat"
url = read.table(url, header=TRUE)
faithful plot(waiting ~ eruptions, faithful,main="Old Faithful")
We will employ the ksmooth
command for Kernel smoothing,
a method we recently discussed. To get acquainted, you can refer to the
ksmooth
function documentation.
?ksmooth
Using the Old Faithful data as an example, we applied
ksmooth
with varying bandwidths. As the bandwidth
increases, it’s similar to enlarging the kernel neighbors. Consequently,
the function becomes progressively flatter since we’re averaging across
a broader set of data points.
par(mfrow=c(1,3))
plot(waiting ~ eruptions,
main="bandwidth=0.1", col="gray", cex=0.5)
faithful,lines(ksmooth(faithful$eruptions, faithful$waiting, "normal", 0.1), lwd=2)
plot(waiting ~ eruptions, faithful, main="bandwidth=0.5", col="gray", cex=0.5)
lines(ksmooth(faithful$eruptions, faithful$waiting,"normal", 0.5), lwd=2)
plot(waiting ~ eruptions, faithful, main="bandwidth=2", col="gray", cex=0.5)
lines(ksmooth(faithful$eruptions, faithful$waiting, "normal", 2), lwd=2)
There’s a package available that assists in performing cross-validation to determine the best bandwidth.
library(sm)
## Package 'sm', version 2.2-5.7: type help(sm) for summary information
?hcv# check the options
?sm.options ?sm.regression
In the context of the Old Faithful data, the cross-validation (CV) error suggests an optimal bandwidth in a specific region, with the precise value being 0.42. This value provides the fitted curve displayed on the right. The aim is to select the bandwidth that minimizes the CV error.
par(mfrow=c(1,2))
= hcv(faithful$eruptions,faithful$waiting,display="lines")
hm print(hm)
## [1] 0.4243375
sm.regression(faithful$eruptions, faithful$waiting, h=hm,
xlab="duration", ylab="waiting")
However, a word of caution: If you’re crafting your own algorithm for such smoothing procedures using cross-validation, I recommend considering the one standard error principle. This approach is inherently more stable. If you strictly select based on the minimum CV error, you might find the region is relatively flat, which means different iterations of cross-validation could yield variable lambda choices within this range.
We then conducted cross-validation on the simulated data
exa
, determining that the optimal bandwidth is
approximately 0.02. The corresponding fitted curve is depicted on the
right.
par(mfrow=c(1,2))
=hcv(exa$x, exa$y, display="lines")
hmprint(hm)
## [1] 0.02240222
sm.regression(exa$x, exa$y, h=hm, xlab="x", ylab="y")
However, when applying the cross-validation method to
exb
, the example with a straightforward line containing two
outliers, we consistently received error messages. These messages
indicated that the selected bandwidth was too large, suggesting that the
underlying cross-validation process consistently seeks even smaller
bandwidths.
hcv(exb$x,exb$y)
# try a smaller hstart, still error
par(mfrow=c(1,2))
= hcv(exb$x, exb$y, dislay = "lines", hstart=0.005)
hm
hmsm.regression(exb$x,exb$y,h=0.005)
The challenge with Example B is that the primary concern isn’t curve estimation—the actual curve is quite straightforward, with just a couple of outliers. For such tasks, local regression and kernel smoothing may not be the most suitable analytical approaches.
?loess
Default: span = 0.75, degree = 2
par(mfrow=c(1,3))
plot(waiting ~ eruptions, faithful,col="gray", cex=0.5)
=loess(waiting ~ eruptions, faithful)
f= order(faithful$eruptions)
i lines(f$x[i],f$fitted[i], lwd=1.5, col="red")
plot(y ~ x, exa, col="gray", cex=0.5)
lines(exa$x,exa$m,lty=1)
= loess(y ~ x,exa)
f lines(f$x,f$fitted,lty=2, lwd=1.5, col="red")
= loess(y ~ x, exa, span=0.22)
f lines(f$x,f$fitted,lty=5, lwd=1.5, col="blue")
plot(y ~ x, exb, col="gray", cex=0.5)
lines(exb$x,exb$m, lty=1)
=loess(y ~ x, exb)
f lines(f$x,f$fitted,lty=2,lwd=1.5, col="red")
= loess(y ~ x, exb,span=1)
f lines(f$x,f$fitted,lty=5, lwd=1.5, col="blue")