library(MASS)
#help(lda)
#help(qda)
We illustrate the Discriminant Analysis using the digits dataset. I’ve created my own dataset, 40 samples from each class, totally 400 training samples and 400 test samples. Each data point represents a 16x16 image, resulting in a feature vector of length 256. Specifically, there are 40 images for each of the 10 classes in both the training and test datasets.
= "https://liangfgithub.github.io/Data/digits.rdata"
githubURL load(url(githubURL))
ls()
dim(X)
dim(Xtest)
table(Y)
table(Ytest)
par(mfrow=c(2,5), mai = c(0.2, 0.2, 0.2, 0.2))
for(i in 0:9){
= matrix(X[40*i+1,], 16, 16)
x image(x[,16:1], axes=FALSE)
}
First, ran Linear Discriminant Analysis (LDA) and generated a prediction table. The accuracy achieved is approximately 75%.
= lda(X,Y) dig.lda
= predict(dig.lda, Xtest)$class
Ytest.pred table(Ytest, Ytest.pred)
1 - sum(Ytest != Ytest.pred) / length(Ytest)
Attempting Quadratic Discriminant Analysis (QDA) led to an error message due to the rank deficiency of the sigma matrix for each group. Unfortunately, the MASS package couldn’t handle this issue.
= qda(X,Y) # error message dig.qda
Move on to Fisher’s Discriminant Analysis (FDA). Project the data
onto FDA directions (returned by the lda
function), which
amounted to nine (10-1 = 9) directions obtained from the LDA
function.
= dig.lda$scaling
FDA.dir dim(FDA.dir) # at most 10-1 = 9 directions
## [1] 256 9
= X%*%FDA.dir
F par(mfrow=c(1,2))
plot(F[,1],F[,2], type="n", xlab="", ylab="")
text(F[,1], F[,2], Y, col=Y+1)
= Xtest%*%dig.lda$scaling;
Ftest plot(Ftest[,1], Ftest[,2], type="n", xlab="", ylab="")
text(Ftest[,1], Ftest[,2], Ytest, col=Ytest+1)
For illustration purposes, we display the projection of the data onto the first two dimensions. On the left, you can observe that the training data is well-separated into ten classes when using these two directions. However, replicating the same result with the test data on the right is not possible. This discrepancy arises because of overfitting to the training data.
Please be cautious: demonstrating effective class separation on the training data (similar to showing a low training error) does not guarantee model performance; it’s essential to evaluate class separation on the test data to assess its practical applicability.
You might wonder if considering different combinations of the nine directions would yield better results. I encourage you to experiment with the code, but you’ll find that the overfitting issue persists. While the training data separates effectively, the test data does not replicate the same pattern.
= c(1,1,1,2); yid = c(2,3,4,3);
xid par(mfrow=c(2,2))
for(i in 1:4){
plot(Ftest[,xid[i]], Ftest[,yid[i]], type="n", xlab="", ylab="");
text(Ftest[,xid[i]], Ftest[,yid[i]], Ytest, col=Ytest+1);
}
The code below details on how to compute FDA directions.
= lm(X ~ as.factor(Y) - 1); # each x_i - its group mean
tmp.lm = cov(tmp.lm$res)
W = eigen(W)
eigenW summary(eigenW)
## Length Class Mode
## values 256 -none- numeric
## vectors 65536 -none- numeric
= eigenW$ve %*% diag(1/sqrt(eigenW$va)) %*% t(eigenW$ve) A
= cov(tmp.lm$fitted) B
= eigen(A %*% B %*% t(A))
tmp.eigen #round(tmp.eigen$va, dig=5)
= A %*% tmp.eigen$ve[, 1:9] tmp
The 9 directions computed by us, stored in tmp
, are
essentially the same as the ones returned by lda
up to some
scale change.
for(j in 1:9)
print(cor(tmp[, j], FDA.dir[, j]))
Obtain top 100 PCs
= cov(X)
Sigma =eigen(Sigma);
dig.eigen= dig.eigen$ve[, 1:100] # pick top 100 PCA directions
PCA.dir dim(PCA.dir)
## [1] 256 100
Note that directions from PCA are orthogonal, but directions from FDA
are not. FDA directions are orthogonal with respect to the within-class
covariance matrix W
.
round(t(FDA.dir) %*% FDA.dir, dig=5)
round(t(PCA.dir[, 1:10]) %*% PCA.dir[, 1:10], dig=5)
round(t(FDA.dir) %*% W %*% FDA.dir, dig=5)
For the MNIST dataset, applying Principal Component Analysis (PCA) to reduce the dimensionality from the original 256 pixel values to 100, followed by Linear Discriminant Analysis (LDA) using the first 100 directions, led to an improved classification accuracy. Specifically, the number of misclassified images decreased from over 100 to 51.
= X %*% PCA.dir[, 1:100];
newX = lda(newX, Y);
newdig.lda = Xtest %*% PCA.dir[, 1:100]
newXtest = predict(newdig.lda, newXtest)$class
Ytest.pred table(Ytest, Ytest.pred)
## Ytest.pred
## Ytest 0 1 2 3 4 5 6 7 8 9
## 0 39 0 0 1 0 0 0 0 0 0
## 1 0 40 0 0 0 0 0 0 0 0
## 2 1 0 33 2 1 2 0 0 1 0
## 3 0 0 0 35 0 1 0 3 1 0
## 4 0 3 2 0 31 0 1 0 0 3
## 5 2 0 2 4 0 29 0 0 2 1
## 6 0 0 2 0 1 0 35 0 2 0
## 7 0 0 0 0 3 0 0 35 0 2
## 8 0 0 0 2 1 2 0 0 34 1
## 9 0 0 0 1 0 0 0 1 0 38
sum(Ytest != Ytest.pred)
## [1] 51
However, it’s worth noting that the clear separation observed in training data was not as pronounced as before. This reduction in separation is expected due to the substantial dimension reduction from 256 to 100. While the training data’s separation may not be as good as before, this approach produces more consistent results between the training and test datasets, which can help mitigate overfitting issues.
=newX%*%newdig.lda$scaling;
newF=Xtest%*%dig.eigen$ve[,1:100]%*%newdig.lda$scaling
newFtest
par(mfrow=c(1,2))
plot(newF[,1],newF[,2], type="n", xlab="", ylab="")
text(newF[,1], newF[,2], Y, col= Y+1)
plot(newFtest[,1], newFtest[,2], type="n", xlab="", ylab="")
text(newFtest[,1], newFtest[,2], Ytest, col=Ytest+1)