(From ESL 5.2.3 Example: Phoneme Recognition)
In this example we use splines to reduce flexibility rather than increase it; the application comes under the general heading of functional modeling.
The Phoneme data were extracted from the TIMIT database (TIMIT Acoustic-Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce) which is a widely used resource for research in speech recognition. A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows:
From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, represented by 512 samples at a 16kHz sampling rate, and each frame represents one of the above five phonemes.
From each speech frame, a log-periodogram is computed, which is one of several widely used methods for casting speech data in a form suitable for speech recognition. Thus the data used in what follows consist of 4509 log-periodograms of length 256, with known class (phoneme) memberships.
import pandas as pd
import numpy as np
url = "https://hastie.su.domains/ElemStatLearn/datasets/phoneme.data"
df = pd.read_csv(url, index_col=0)
mydata = df.loc[(df['g'] == "aa") | (df['g'] == "ao")]
mydata.iloc[0,np.concatenate([np.arange(5), np.arange(249,258)])]
x.1 12.96705 x.2 13.69454 x.3 14.91182 x.4 18.22292 x.5 18.4539 x.250 7.58624 x.251 6.65202 x.252 7.69109 x.253 6.93683 x.254 7.036 x.255 7.01278 x.256 8.52197 g aa speaker train.dr1.mcpm0.sa1 Name: 5, dtype: object
X = mydata.drop(['g', 'speaker'], axis = 1).astype(float)
Y = mydata['g'].copy()
np.stack(np.unique(Y, return_counts=True))
array([['aa', 'ao'], [695, 1022]], dtype=object)
Y[Y == 'ao'] = 1
Y[Y == 'aa'] = 0
Y = Y.astype(float)
np.stack(np.unique(Y, return_counts=True))
array([[0.000e+00, 1.000e+00], [6.950e+02, 1.022e+03]])
import matplotlib.pyplot as plt
id0 = np.random.choice(np.where(Y == 0)[0], size=15, replace=False)
id1 = np.random.choice(np.where(Y == 1)[0], size=15, replace=False)
aa_label, ao_label = {'label': 'aa'}, {'label': 'ao'}
plt.figure(figsize=(8, 6))
plt.xlabel("Frequency")
plt.ylabel("Log-periodogram")
plt.plot([], color='#00FF00', linewidth=2, label='aa')
for i in id0:
plt.plot(X.iloc[i], linewidth=0.5, color='#00FF00')
plt.plot([], color='#FF8C00', linewidth=2, label='ao')
for i in id1:
plt.plot(X.iloc[i], linewidth=0.5, color='#FF8C00')
plt.legend(loc="upper right")
custom_ticks = [0, 50, 100, 150, 200, 250]
plt.xticks(custom_ticks, custom_ticks)
plt.show()
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
logfit1 = sm.GLM(Y, add_constant(X), family=sm.families.Binomial()).fit()
coef1 = logfit1.params[1:]
Applications such as this permit a natural regularization. The 256 measurements for each sample are not the same as measurements collected from 256 independent predictors. They are observations (at discretized frequencies) from a continuous Log-periodogram function.
Naturally we would expect the 256 coefficients $\beta_j$ are also continuous in the frequency domain. So we model it by natural cubic splines, where $h_m(v)$'s are the basis functions of natural cubic splines, defined on the set of frequencies.
$$ \beta(v) = \sum_{m=1}^M h_m(v) \theta_m, \quad v = 1, 2, \dots, 256. $$In practice, this means $\beta_{p \times 1} = H \theta_{M \times 1}$, where $p=256$ and $H$ is a $p \times M$ with the $m$-th column being $h_m(v)$ evaluated on $v=1, 2, \dots, 256.$
$$ X_{n \times p} \color{blue}{\beta}_{p \times 1} = X_{n \times p} \color{blue}{H}_{p \times M} \color{blue}{\theta}_{M \times 1} = B_{n \times M} \color{blue}{\theta}_{M \times 1}. $$So we can simply fit a logistic regression model with design matrix $B = X H$.
Here we used $M = 12$ basis functions, with knots uniformly placed over the integers 1,2,...,256 representing the frequencies.
from scipy.interpolate import splev
def ns(x, df=None, knots=None, boundary_knots=None, include_intercept=False):
degree = 3
if boundary_knots is None:
boundary_knots = [np.min(x), np.max(x)]
else:
boundary_knots = np.sort(boundary_knots).tolist()
oleft = x < boundary_knots[0]
oright = x > boundary_knots[1]
outside = oleft | oright
inside = ~outside
if df is not None:
nIknots = df - 1 - include_intercept
if nIknots < 0:
nIknots = 0
if nIknots > 0:
knots = np.linspace(0, 1, num=nIknots + 2)[1:-1]
knots = np.quantile(x[~outside], knots)
Aknots = np.sort(np.concatenate((boundary_knots * 4, knots)))
n_bases = len(Aknots) - (degree + 1)
if any(outside):
basis = np.empty((x.shape[0], n_bases), dtype=float)
e = 1 / 4 # in theory anything in (0, 1); was (implicitly) 0 in R <= 3.2.2
if any(oleft):
k_pivot = boundary_knots[0]
xl = x[oleft] - k_pivot
xl = np.c_[np.ones(xl.shape[0]), xl]
# equivalent to splineDesign(Aknots, rep(k.pivot, ord), ord, derivs)
tt = np.empty((xl.shape[1], n_bases), dtype=float)
for j in range(xl.shape[1]):
for i in range(n_bases):
coefs = np.zeros((n_bases,))
coefs[i] = 1
tt[j, i] = splev(k_pivot, (Aknots, coefs, degree), der=j)
basis[oleft, :] = xl @ tt
if any(oright):
k_pivot = boundary_knots[1]
xr = x[oright] - k_pivot
xr = np.c_[np.ones(xr.shape[0]), xr]
tt = np.empty((xr.shape[1], n_bases), dtype=float)
for j in range(xr.shape[1]):
for i in range(n_bases):
coefs = np.zeros((n_bases,))
coefs[i] = 1
tt[j, i] = splev(k_pivot, (Aknots, coefs, degree), der=j)
basis[oright, :] = xr @ tt
if any(inside):
xi = x[inside]
tt = np.empty((len(xi), n_bases), dtype=float)
for i in range(n_bases):
coefs = np.zeros((n_bases,))
coefs[i] = 1
tt[:, i] = splev(xi, (Aknots, coefs, degree))
basis[inside, :] = tt
else:
basis = np.empty((x.shape[0], n_bases), dtype=float)
for i in range(n_bases):
coefs = np.zeros((n_bases,))
coefs[i] = 1
basis[:, i] = splev(x, (Aknots, coefs, degree))
const = np.empty((2, n_bases), dtype=float)
for i in range(n_bases):
coefs = np.zeros((n_bases,))
coefs[i] = 1
const[:, i] = splev(boundary_knots, (Aknots, coefs, degree), der=2)
if include_intercept is False:
basis = basis[:, 1:]
const = const[:, 1:]
qr_const = np.linalg.qr(const.T, mode='complete')[0]
basis = (qr_const.T @ basis.T).T[:, 2:]
return basis
H = ns(np.arange(1, 257), df = 12)
B = X @ H
logfit2 = sm.GLM(Y, add_constant(B), family=sm.families.Binomial()).fit()
coef2 = H @ logfit2.params.to_numpy()[1:]
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
mpl.rcParams['figure.dpi'] = 250
sns.set()
plt.plot(np.arange(1,257), coef1, color = 'gray', label = 'Regular');
plt.plot(np.arange(1,257), coef2, color = 'red', label = 'Spline Smoothed');
plt.legend();
plt.xlabel("Frequency");
plt.ylabel("Logistic Regression Coefficients");
Next we conduct a simulation study to evaluate the classification accuracy of the two models:
T = 5
n = len(X)
Err1 = np.zeros((T,2))
Err2 = np.zeros((T,2))
np.random.seed(0)
for t in range(T):
mask = np.in1d(np.arange(n), np.random.choice(n, int(n*0.2)))
logfit1 = sm.GLM(Y[~mask], add_constant(X)[~mask], family = sm.families.Binomial()).fit()
yhat1 = (logfit1.fittedvalues > 0.5).astype(float)
Err1[t, 0] = (yhat1 != Y[~mask]).mean()
yhat1 = (add_constant(X)[mask] @ logfit1.params > 0.5).astype(float)
Err1[t, 1] = (yhat1 != Y[mask]).mean()
logfit2 = sm.GLM(Y[~mask], add_constant(B)[~mask], family = sm.families.Binomial()).fit()
yhat2 = (logfit2.fittedvalues > 0.5).astype(float)
Err2[t, 0] = (yhat2 != Y[~mask]).mean()
yhat2 = (add_constant(B)[mask] @ logfit2.params > 0.5).astype(float)
Err2[t, 1] = (yhat2 != Y[mask]).mean()
pd.DataFrame(np.hstack([Err1, Err2]), columns = ['Regular Train Error', 'Regular Test Error',
'Spline Train Error', 'Spline Test Error'])
Regular Train Error | Regular Test Error | Spline Train Error | Spline Test Error | |
---|---|---|---|---|
0 | 0.101068 | 0.256410 | 0.168683 | 0.211538 |
1 | 0.110163 | 0.254839 | 0.180526 | 0.174194 |
2 | 0.120397 | 0.249180 | 0.178470 | 0.186885 |
3 | 0.114731 | 0.186885 | 0.177054 | 0.177049 |
4 | 0.100000 | 0.250814 | 0.176596 | 0.182410 |