(PSL) Example: Phoneme Recognition

(From ESL 5.2.3 Example: Phoneme Recognition)

In this example we use splines to reduce flexibility rather than increase it; the application comes under the general heading of functional modeling.

Phoneme Data

The Phoneme data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows:

“sh” as in “she”
“dcl” as in “dark”
“iy” as the vowel in “she”
“aa” as the vowel in “dark”
“ao” as the first vowel in “water”.

From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, represented by 512 samples at a 16kHz sampling rate, and each frame represents one of the above five phonemes.

From each speech frame, a log-periodogram is computed, which is one of several widely used methods for casting speech data in a form suitable for speech recognition. Thus the data used in what follows consist of 4509 log-periodograms of length 256, with known class (phoneme) memberships.

url = "https://hastie.su.domains/ElemStatLearn/datasets/phoneme.data"
phoneme = read.csv(url, header = TRUE)
mydata = phoneme[phoneme$g %in% c("aa", "ao"), ]
rm(phoneme)

mydata[1:2, ]
dim(mydata)

X = data.matrix(mydata[, -c(1, 258, 259)])
Y = mydata[, 258]
table(Y)

## Y
##   aa   ao 
##  695 1022

Y = ifelse(Y=="ao", 1, 0)
table(Y)

## Y
##    0    1 
##  695 1022

The figure below are displays a sample of 15 log-periodograms for each of the two phonemes “aa” and “ao” measured at 256 frequencies. The goal is to use such data to classify a spoken phoneme. These two phonemes were chosen because they are difficult to separate.

id0 = which(Y==0)[sample(1:500, 15)]
id1 = which(Y==1)[sample(1:500, 15)]

plot(c(1, 256), range(X[c(id0, id1), ]), type="n", 
     xlab = "Frequency", 
     ylab = "Log-periodogram"
)

for(i in id0)
  lines(X[i, ], col="green")
for(i in id1)
  lines(X[i, ], col="orange")
legend("topright", c("aa", "ao"), lty=c(1,1),
        col = c("green", "orange"))

Fit a Logitsic Regressio Model

\[ \log \frac{P(ao | x)}{P(aa | x)} = \beta_0 + \sum_{j=1}^{256} x_j \beta_j \]

logfit1 = glm(Y ~ X, family="binomial")
coef1 = logfit1$coef[-1]

Force the Coefficients to be Smooth

Applications such as this permit a natural regularization. The 256 measurements for each sample are not the same as measurements collected from 256 independent predictors. They are observations (at discretized frequencies) from a continuous Log-periodogram function.

Naturally we would expect the 256 coefficients \(\beta_j\) are also continuous in the frequency domain. So we model it by natural cubic splines, where \(h_m(v)\)’s are the basis functions of natural cubic splines, defined on the set of frequencies.

\[ \beta(v) = \sum_{m=1}^M h_m(v) \theta_m, \quad v = 1, 2, \dots, 256. \] In practice, this means \(\beta_{p \times 1} = H \theta_{M \times 1}\), where \(p=256\) and \(H\) is a \(p \times M\) with the \(m\)-th column being \(h_m(v)\) evaluated on \(v=1, 2, \dots, 256.\)

\[ X_{n \times p} \color{blue}{\beta}_{p \times 1} = X_{n \times p} \color{blue}{H}_{p \times M} \color{blue}{\theta}_{M \times 1} = B_{n \times M} \color{blue}{\theta}_{M \times 1}. \] So we can simply fit a logistic regression model with design matrix \(B = X H\).

Here we used \(M = 12\) basis functions, with knots uniformly placed over the integers 1,2,…,256 representing the frequencies.

library(splines)
H = ns(1:256, df = 12)
B = X %*% H
logfit2 = glm(Y ~ B, family="binomial")
coef2 = H %*% logfit2$coef[-1]

plot(1:256, coef1, type="n", 
      xlab="Frequency", ylab="Logistic Regression Coefficients")
lines(1:256, coef1, col="gray")
lines(1:256, coef2, col="red")

Compare Classification Performance

Next we conduct a simulation study to evaluate the classification accuracy of the two models:

M1: A logistic regression model with 256 regression coefficients
M2: A logistic regression model that models the 256 coefficients using a NCS with 12 dfs.

T = 5
n = dim(X)[1]
Err1 = matrix(0, T, 2)  # col 1: training; col 2: test
Err2 = matrix(0, T, 2)  # col 1: training; col 2: test
set.seed(123)
for(t in 1:T){
  testid = sample(1:n, round(n*0.2))
  
  logfit1 = glm(Y[-testid] ~ X[-testid, ], family="binomial")
  yhat1 = as.numeric(logfit1$fitted.values>0.5)
  Err1[t, 1] = mean(yhat1 != Y[-testid])
  yhat1 = X[testid, ] %*% logfit1$coef[-1] + logfit1$coef[1]
  yhat1 = as.numeric(yhat1>0.5)
  Err1[t, 2] = mean(yhat1 != Y[testid])
  
  logfit2 = glm(Y[-testid] ~ B[-testid, ], family="binomial")
  yhat2 = as.numeric(logfit2$fitted.values>0.5)
  Err2[t, 1] = mean(yhat2 != Y[-testid])
  yhat2 = B[testid, ] %*% logfit2$coef[-1] + logfit2$coef[1]
  yhat2 = as.numeric(yhat2>0.5)
  Err2[t, 2] = mean(yhat2 != Y[testid])
}

round(cbind(Err1, Err2), dig=3)

##       [,1]  [,2]  [,3]  [,4]
## [1,] 0.100 0.262 0.158 0.242
## [2,] 0.112 0.192 0.172 0.195
## [3,] 0.095 0.236 0.170 0.210
## [4,] 0.103 0.251 0.175 0.195
## [5,] 0.100 0.222 0.168 0.210

M1 exhibits lower training errors, primarily due to its higher number of regression coefficients.
However, when it comes to the test error, M2 surpasses M1, indicating that the former model delivers better performance on unseen data.