(From ESL 5.2.3 Example: Phoneme Recognition)

In this example we use splines to reduce flexibility rather than increase it; the application comes under the general heading of functional modeling.

Phoneme Data

The Phoneme data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows:

  • “sh” as in “she”
  • “dcl” as in “dark”
  • “iy” as the vowel in “she”
  • “aa” as the vowel in “dark”
  • “ao” as the first vowel in “water”.

From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, represented by 512 samples at a 16kHz sampling rate, and each frame represents one of the above five phonemes.

From each speech frame, a log-periodogram is computed, which is one of several widely used methods for casting speech data in a form suitable for speech recognition. Thus the data used in what follows consist of 4509 log-periodograms of length 256, with known class (phoneme) memberships.

url = "https://hastie.su.domains/ElemStatLearn/datasets/phoneme.data"
phoneme = read.csv(url, header = TRUE)
mydata = phoneme[phoneme$g %in% c("aa", "ao"), ]
rm(phoneme)
mydata[1:2, ]
dim(mydata)
X = data.matrix(mydata[, -c(1, 258, 259)])
Y = mydata[, 258]
table(Y)
## Y
##   aa   ao 
##  695 1022
Y = ifelse(Y=="ao", 1, 0)
table(Y)
## Y
##    0    1 
##  695 1022

The figure below are displays a sample of 15 log-periodograms for each of the two phonemes “aa” and “ao” measured at 256 frequencies. The goal is to use such data to classify a spoken phoneme. These two phonemes were chosen because they are difficult to separate.

id0 = which(Y==0)[sample(1:500, 15)]
id1 = which(Y==1)[sample(1:500, 15)]

plot(c(1, 256), range(X[c(id0, id1), ]), type="n", 
     xlab = "Frequency", 
     ylab = "Log-periodogram"
)

for(i in id0)
  lines(X[i, ], col="green")
for(i in id1)
  lines(X[i, ], col="orange")
legend("topright", c("aa", "ao"), lty=c(1,1),
        col = c("green", "orange"))

Fit a Logitsic Regressio Model

\[ \log \frac{P(ao | x)}{P(aa | x)} = \beta_0 + \sum_{j=1}^{256} x_j \beta_j \]

logfit1 = glm(Y ~ X, family="binomial")
coef1 = logfit1$coef[-1]

Force the Coefficients to be Smooth

Applications such as this permit a natural regularization. The 256 measurements for each sample are not the same as measurements collected from 256 independent predictors. They are observations (at discretized frequencies) from a continuous Log-periodogram function.

Naturally we would expect the 256 coefficients \(\beta_j\) are also continuous in the frequency domain. So we model it by natural cubic splines, where \(h_m(v)\)’s are the basis functions of natural cubic splines, defined on the set of frequencies.

\[ \beta(v) = \sum_{m=1}^M h_m(v) \theta_m, \quad v = 1, 2, \dots, 256. \] In practice, this means \(\beta_{p \times 1} = H \theta_{M \times 1}\), where \(p=256\) and \(H\) is a \(p \times M\) with the \(m\)-th column being \(h_m(v)\) evaluated on \(v=1, 2, \dots, 256.\)

\[ X_{n \times p} \color{blue}{\beta}_{p \times 1} = X_{n \times p} \color{blue}{H}_{p \times M} \color{blue}{\theta}_{M \times 1} = B_{n \times M} \color{blue}{\theta}_{M \times 1}. \] So we can simply fit a logistic regression model with design matrix \(B = X H\).

Here we used \(M = 12\) basis functions, with knots uniformly placed over the integers 1,2,…,256 representing the frequencies.

library(splines)
H = ns(1:256, df = 12)
B = X %*% H
logfit2 = glm(Y ~ B, family="binomial")
coef2 = H %*% logfit2$coef[-1]

plot(1:256, coef1, type="n", 
      xlab="Frequency", ylab="Logistic Regression Coefficients")
lines(1:256, coef1, col="gray")
lines(1:256, coef2, col="red")

Compare Classification Performance

Next we conduct a simulation study to evaluate the classification accuracy of the two models:

  • M1: A logistic regression model with 256 regression coefficients
  • M2: A logistic regression model that models the 256 coefficients using a NCS with 12 dfs.
T = 5
n = dim(X)[1]
Err1 = matrix(0, T, 2)  # col 1: training; col 2: test
Err2 = matrix(0, T, 2)  # col 1: training; col 2: test
set.seed(123)
for(t in 1:T){
  testid = sample(1:n, round(n*0.2))
  
  logfit1 = glm(Y[-testid] ~ X[-testid, ], family="binomial")
  yhat1 = as.numeric(logfit1$fitted.values>0.5)
  Err1[t, 1] = mean(yhat1 != Y[-testid])
  yhat1 = X[testid, ] %*% logfit1$coef[-1] + logfit1$coef[1]
  yhat1 = as.numeric(yhat1>0.5)
  Err1[t, 2] = mean(yhat1 != Y[testid])
  
  logfit2 = glm(Y[-testid] ~ B[-testid, ], family="binomial")
  yhat2 = as.numeric(logfit2$fitted.values>0.5)
  Err2[t, 1] = mean(yhat2 != Y[-testid])
  yhat2 = B[testid, ] %*% logfit2$coef[-1] + logfit2$coef[1]
  yhat2 = as.numeric(yhat2>0.5)
  Err2[t, 2] = mean(yhat2 != Y[testid])
}

round(cbind(Err1, Err2), dig=3)
##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.100 0.262 0.158 0.242
## [2,] 0.112 0.192 0.172 0.195
## [3,] 0.095 0.236 0.170 0.210
## [4,] 0.103 0.251 0.175 0.195
## [5,] 0.100 0.222 0.168 0.210
  • M1 exhibits lower training errors, primarily due to its higher number of regression coefficients.
  • However, when it comes to the test error, M2 surpasses M1, indicating that the former model delivers better performance on unseen data.