library(MASS)
#help(lda)
#help(qda)

MNIST Data

We illustrate the Discriminant Analysis using the digits dataset. I’ve created my own dataset, 40 samples from each class, totally 400 training samples and 400 test samples. Each data point represents a 16x16 image, resulting in a feature vector of length 256. Specifically, there are 40 images for each of the 10 classes in both the training and test datasets.

githubURL = "https://liangfgithub.github.io/Data/digits.rdata"
load(url(githubURL))
ls()
dim(X)
dim(Xtest)
table(Y)
table(Ytest)
par(mfrow=c(2,5), mai = c(0.2, 0.2, 0.2, 0.2))
for(i in 0:9){
  x = matrix(X[40*i+1,], 16, 16)
  image(x[,16:1], axes=FALSE)
}

LDA

First, ran Linear Discriminant Analysis (LDA) and generated a prediction table. The accuracy achieved is approximately 75%.

dig.lda = lda(X,Y)
Ytest.pred = predict(dig.lda, Xtest)$class
table(Ytest, Ytest.pred)
1 - sum(Ytest != Ytest.pred) / length(Ytest)

QDA

Attempting Quadratic Discriminant Analysis (QDA) led to an error message due to the rank deficiency of the sigma matrix for each group. Unfortunately, the MASS package couldn’t handle this issue.

dig.qda = qda(X,Y) # error message

FDA

Move on to Fisher’s Discriminant Analysis (FDA). Project the data onto FDA directions (returned by the lda function), which amounted to nine (10-1 = 9) directions obtained from the LDA function.

FDA.dir = dig.lda$scaling
dim(FDA.dir)  # at most 10-1 = 9 directions
## [1] 256   9
F = X%*%FDA.dir
par(mfrow=c(1,2))
plot(F[,1],F[,2], type="n", xlab="", ylab="")
text(F[,1], F[,2], Y, col=Y+1)

Ftest = Xtest%*%dig.lda$scaling;
plot(Ftest[,1], Ftest[,2], type="n", xlab="", ylab="")
text(Ftest[,1], Ftest[,2], Ytest, col=Ytest+1)

For illustration purposes, we display the projection of the data onto the first two dimensions. On the left, you can observe that the training data is well-separated into ten classes when using these two directions. However, replicating the same result with the test data on the right is not possible. This discrepancy arises because of overfitting to the training data.

Please be cautious: demonstrating effective class separation on the training data (similar to showing a low training error) does not guarantee model performance; it’s essential to evaluate class separation on the test data to assess its practical applicability.

You might wonder if considering different combinations of the nine directions would yield better results. I encourage you to experiment with the code, but you’ll find that the overfitting issue persists. While the training data separates effectively, the test data does not replicate the same pattern.

xid = c(1,1,1,2); yid = c(2,3,4,3);
par(mfrow=c(2,2))
for(i in 1:4){
  plot(Ftest[,xid[i]], Ftest[,yid[i]], type="n", xlab="", ylab="");
  text(Ftest[,xid[i]], Ftest[,yid[i]], Ytest, col=Ytest+1);
}

Computation Details

The code below details on how to compute FDA directions.

  • Compute W (within group covariance matrix)
tmp.lm = lm(X ~ as.factor(Y) - 1); # each x_i - its group mean
W = cov(tmp.lm$res)
eigenW = eigen(W)
summary(eigenW)
##         Length Class  Mode   
## values    256  -none- numeric
## vectors 65536  -none- numeric
A = eigenW$ve %*% diag(1/sqrt(eigenW$va)) %*% t(eigenW$ve)
  • Compute B (between-group covariance matrix)
B = cov(tmp.lm$fitted)
  • Extract FDA directions. You’ll find that only the first 9 eigenvalues are non-zero.
tmp.eigen = eigen(A %*% B %*% t(A)) 
#round(tmp.eigen$va, dig=5)  
tmp = A %*% tmp.eigen$ve[, 1:9]

The 9 directions computed by us, stored in tmp, are essentially the same as the ones returned by lda up to some scale change.

for(j in 1:9)
  print(cor(tmp[, j], FDA.dir[, j]))

FDA vs PCA

Obtain top 100 PCs

Sigma = cov(X)
dig.eigen=eigen(Sigma);
PCA.dir = dig.eigen$ve[, 1:100]  # pick top 100 PCA directions
dim(PCA.dir)
## [1] 256 100

Note that directions from PCA are orthogonal, but directions from FDA are not. FDA directions are orthogonal with respect to the within-class covariance matrix W.

round(t(FDA.dir) %*% FDA.dir, dig=5)
round(t(PCA.dir[, 1:10]) %*% PCA.dir[, 1:10], dig=5)  
round(t(FDA.dir) %*% W %*% FDA.dir, dig=5)

FDA after PCA

For the MNIST dataset, applying Principal Component Analysis (PCA) to reduce the dimensionality from the original 256 pixel values to 100, followed by Linear Discriminant Analysis (LDA) using the first 100 directions, led to an improved classification accuracy. Specifically, the number of misclassified images decreased from over 100 to 51.

newX = X %*% PCA.dir[, 1:100];
newdig.lda = lda(newX, Y);
newXtest = Xtest %*% PCA.dir[, 1:100]
Ytest.pred = predict(newdig.lda, newXtest)$class
table(Ytest, Ytest.pred)
##      Ytest.pred
## Ytest  0  1  2  3  4  5  6  7  8  9
##     0 39  0  0  1  0  0  0  0  0  0
##     1  0 40  0  0  0  0  0  0  0  0
##     2  1  0 33  2  1  2  0  0  1  0
##     3  0  0  0 35  0  1  0  3  1  0
##     4  0  3  2  0 31  0  1  0  0  3
##     5  2  0  2  4  0 29  0  0  2  1
##     6  0  0  2  0  1  0 35  0  2  0
##     7  0  0  0  0  3  0  0 35  0  2
##     8  0  0  0  2  1  2  0  0 34  1
##     9  0  0  0  1  0  0  0  1  0 38
sum(Ytest != Ytest.pred)
## [1] 51

However, it’s worth noting that the clear separation observed in training data was not as pronounced as before. This reduction in separation is expected due to the substantial dimension reduction from 256 to 100. While the training data’s separation may not be as good as before, this approach produces more consistent results between the training and test datasets, which can help mitigate overfitting issues.

newF=newX%*%newdig.lda$scaling;
newFtest=Xtest%*%dig.eigen$ve[,1:100]%*%newdig.lda$scaling

par(mfrow=c(1,2))
plot(newF[,1],newF[,2], type="n", xlab="", ylab="")
text(newF[,1], newF[,2], Y, col= Y+1)
plot(newFtest[,1], newFtest[,2], type="n", xlab="", ylab="")
text(newFtest[,1], newFtest[,2], Ytest, col=Ytest+1)