(From ESL 5.2.3 Example: Phoneme Recognition)
In this example we use splines to reduce flexibility rather than increase it; the application comes under the general heading of functional modeling.
The Phoneme data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows:
From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, represented by 512 samples at a 16kHz sampling rate, and each frame represents one of the above five phonemes.
From each speech frame, a log-periodogram is computed, which is one of several widely used methods for casting speech data in a form suitable for speech recognition. Thus the data used in what follows consist of 4509 log-periodograms of length 256, with known class (phoneme) memberships.
= "https://hastie.su.domains/ElemStatLearn/datasets/phoneme.data"
url = read.csv(url, header = TRUE)
phoneme = phoneme[phoneme$g %in% c("aa", "ao"), ]
mydata rm(phoneme)
1:2, ]
mydata[dim(mydata)
= data.matrix(mydata[, -c(1, 258, 259)])
X = mydata[, 258]
Y table(Y)
## Y
## aa ao
## 695 1022
= ifelse(Y=="ao", 1, 0)
Y table(Y)
## Y
## 0 1
## 695 1022
The figure below are displays a sample of 15 log-periodograms for each of the two phonemes “aa” and “ao” measured at 256 frequencies. The goal is to use such data to classify a spoken phoneme. These two phonemes were chosen because they are difficult to separate.
= which(Y==0)[sample(1:500, 15)]
id0 = which(Y==1)[sample(1:500, 15)]
id1
plot(c(1, 256), range(X[c(id0, id1), ]), type="n",
xlab = "Frequency",
ylab = "Log-periodogram"
)
for(i in id0)
lines(X[i, ], col="green")
for(i in id1)
lines(X[i, ], col="orange")
legend("topright", c("aa", "ao"), lty=c(1,1),
col = c("green", "orange"))
\[ \log \frac{P(ao | x)}{P(aa | x)} = \beta_0 + \sum_{j=1}^{256} x_j \beta_j \]
= glm(Y ~ X, family="binomial")
logfit1 = logfit1$coef[-1] coef1
Applications such as this permit a natural regularization. The 256 measurements for each sample are not the same as measurements collected from 256 independent predictors. They are observations (at discretized frequencies) from a continuous Log-periodogram function.
Naturally we would expect the 256 coefficients \(\beta_j\) are also continuous in the frequency domain. So we model it by natural cubic splines, where \(h_m(v)\)’s are the basis functions of natural cubic splines, defined on the set of frequencies.
\[ \beta(v) = \sum_{m=1}^M h_m(v) \theta_m, \quad v = 1, 2, \dots, 256. \] In practice, this means \(\beta_{p \times 1} = H \theta_{M \times 1}\), where \(p=256\) and \(H\) is a \(p \times M\) with the \(m\)-th column being \(h_m(v)\) evaluated on \(v=1, 2, \dots, 256.\)
\[ X_{n \times p} \color{blue}{\beta}_{p \times 1} = X_{n \times p} \color{blue}{H}_{p \times M} \color{blue}{\theta}_{M \times 1} = B_{n \times M} \color{blue}{\theta}_{M \times 1}. \] So we can simply fit a logistic regression model with design matrix \(B = X H\).
Here we used \(M = 12\) basis functions, with knots uniformly placed over the integers 1,2,…,256 representing the frequencies.
library(splines)
= ns(1:256, df = 12)
H = X %*% H
B = glm(Y ~ B, family="binomial")
logfit2 = H %*% logfit2$coef[-1]
coef2
plot(1:256, coef1, type="n",
xlab="Frequency", ylab="Logistic Regression Coefficients")
lines(1:256, coef1, col="gray")
lines(1:256, coef2, col="red")
Next we conduct a simulation study to evaluate the classification accuracy of the two models:
= 5
T = dim(X)[1]
n = matrix(0, T, 2) # col 1: training; col 2: test
Err1 = matrix(0, T, 2) # col 1: training; col 2: test
Err2 set.seed(123)
for(t in 1:T){
= sample(1:n, round(n*0.2))
testid
= glm(Y[-testid] ~ X[-testid, ], family="binomial")
logfit1 = as.numeric(logfit1$fitted.values>0.5)
yhat1 1] = mean(yhat1 != Y[-testid])
Err1[t, = X[testid, ] %*% logfit1$coef[-1] + logfit1$coef[1]
yhat1 = as.numeric(yhat1>0.5)
yhat1 2] = mean(yhat1 != Y[testid])
Err1[t,
= glm(Y[-testid] ~ B[-testid, ], family="binomial")
logfit2 = as.numeric(logfit2$fitted.values>0.5)
yhat2 1] = mean(yhat2 != Y[-testid])
Err2[t, = B[testid, ] %*% logfit2$coef[-1] + logfit2$coef[1]
yhat2 = as.numeric(yhat2>0.5)
yhat2 2] = mean(yhat2 != Y[testid])
Err2[t,
}
round(cbind(Err1, Err2), dig=3)
## [,1] [,2] [,3] [,4]
## [1,] 0.100 0.262 0.158 0.242
## [2,] 0.112 0.192 0.172 0.195
## [3,] 0.095 0.236 0.170 0.210
## [4,] 0.103 0.251 0.175 0.195
## [5,] 0.100 0.222 0.168 0.210