(PSL) Clustering

Multidimensional Scaling

To elucidate the concept of multidimensional scaling, let’s consider a dataset from a study that aimed to determine which pairs of letters people frequently confuse. The table below is from Wolford and Hollingsworth (1974). The set comprises eight letters: C, D, G, H, M, N, Q, and W. This results in an 8×8 matrix. Each cell in this matrix indicates the frequency with which people confuse the corresponding row and column letters.

D = matrix(0, 8, 8)
letters = c("C", "D", "G", "H", "M", "N", "Q", "W")
colnames(D) = letters
rownames(D) = letters
D[2:8, 1] = c(5, 12, 2, 2, 2, 9, 1)
D[3:8, 2] = c(2, 4, 3, 4, 20, 5)
D[4:8, 3] = c(3, 2, 1, 9, 2)
D[5:8, 4] = c(19, 18, 1, 5)
D[6:8, 5] = c(16, 2, 18)
D[7:8, 6] = c(8, 13)
D[8, 7] = 4
D = (D+t(D))
D

##    C  D  G  H  M  N  Q  W
## C  0  5 12  2  2  2  9  1
## D  5  0  2  4  3  4 20  5
## G 12  2  0  3  2  1  9  2
## H  2  4  3  0 19 18  1  5
## M  2  3  2 19  0 16  2 18
## N  2  4  1 18 16  0  8 13
## Q  9 20  9  1  2  8  0  4
## W  1  5  2  5 18 13  4  0

The matrix \(\mathbf{D}\) is symmetric and represents similarity, not distance. A higher value signifies that the two letters (row and column) are easily confused, indicating their similarity.

First, we change the similarity matrix \(\mathbf{D}\) to a distance matrix \(\mathbf{D}_0\) using the formula \(21-\mathbf{D}\). This essentially inverts the sign, ensuring all entries remain positive (given the highest value in \(\mathbf{D}\) is 20). The resultant matrix, \(\mathbf{D}_0\), serves as our distance matrix. We also set the diagonal of \(\mathbf{D}_0\) to zero since the distance between identical letters should always be zero.

D0 = 21 - D
diag(D0) = 0

Applying multidimensional scaling to this matrix and by default embedding the data in a two-dimensional space, we get a visual representation. An alternative transformation for the similarity-to-distance matrix conversion, using \(41-\mathbf{D}\), was also tried. The results were largely consistent, albeit with slight variations in scale.

tmp = cmdscale(D0)
par(mfrow=c(1,2))
plot(tmp[, 1], tmp[, 2], type="n", xlab="", ylab="", 
     xlim = c(-15, 15), ylim=c(-15, 15))
text(tmp[, 1], tmp[, 2], label = letters)

D1 = 41 - D
diag(D1) = 0
tmp = cmdscale(D1)
plot(tmp[, 1], tmp[, 2], type="n", xlab="", ylab="", 
     xlim = c(-20, 20), ylim=c(-20, 20))
text(tmp[, 1], tmp[, 2], label = letters)

From the visual representation, it becomes evident that the letters D and Q are similar, as are C and G. Additionally, a cluster of four other letters seems to share certain similarities.

K-means

In our study, we apply clustering algorithms in R using the dataset “sportsranks”. This dataset contains over a hundred entries, each consisting of seven columns, with each column corresponding to a specific sport. Each entry, or row, represents an individual’s ranking preference for these seven sports, ranging from 1 to 7, without any ties.

url = "https://raw.githubusercontent.com/liangfgithub/liangfgithub.github.io/master/Data/sportsranks.csv"
sportsranks = read.csv(url)
head(sportsranks)

##   Baseball Football Basketball Tennis Cycling Swimming Jogging
## 1        1        3          7      2       4        5       6
## 2        1        3          2      5       4        7       6
## 3        1        3          2      5       4        7       6
## 4        4        7          3      1       5        6       2
## 5        2        3          1      7       6        5       4
## 6        6        3          4      7       2        1       5

To analyze this data, we employed the K-means clustering method with K set to 2 and 3.

km2=kmeans(sportsranks, centers=2, nstart=10);
km3=kmeans(sportsranks, centers=3, nstart=10);

To visualize the results, we utilized a multidimensional scaling (MDS) representation, plotted in a two-dimensional space. This MDS was calculated using the cmdscale command in R.

D=dist(sportsranks)
Xmds=cmdscale(D);

par(mfrow=c(1,2));
plot(Xmds[,1], Xmds[,2], type="n", xlab="", ylab="")
points(Xmds[,1], Xmds[,2], pch=km2$cluster, col=km2$cluster);
title("MDS plot for K=2")
plot(Xmds[,1], Xmds[,2], type="n", xlab="", ylab="");
points(Xmds[,1], Xmds[,2], pch=km3$cluster, col=km3$cluster);
title("MDS plot for K=3")

Moreover, we manually computed the MDS by first applying double-centering, followed by singular value decomposition. We extracted the first two columns, essentially representing the two-dimensional space. A correlation check between our computed two-dimensional MDS and that returned by R revealed that they were virtually identical.

D2 = as.matrix(D^2)
n = dim(D2)[1]
tmp = D2 - matrix(colMeans(D2), nrow=n, ncol=n, byrow=TRUE) - 
  matrix(rowMeans(D2), nrow=n, ncol=n, byrow=FALSE)
tmp = -(tmp + mean(D2))/2
tmp.svd = svd(tmp)
F = tmp.svd$u[, 1:2]  %*% diag(sqrt(tmp.svd$d[1:2]))
cor(F, Xmds)

##               [,1]          [,2]
## [1,]  1.000000e+00 -6.602771e-17
## [2,] -1.697386e-16 -1.000000e+00

The plots indicate that setting K to 2 might be sufficient since using K = 3 began to display overlapping clusters. To verify our choice of K, we employed the gap statistic method.

Gap Statistics

In R, there’s a specific command clusGap in the cluster package for computing this statistic. Our analysis using this method indicated that the optimal K is indeed 2.

library(cluster)
set.seed(123)
gap_stat = clusGap(sportsranks, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = sportsranks, FUNcluster = kmeans, K.max = 10, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 2
##           logW   E.logW       gap     SE.sim
##  [1,] 5.429866 5.647963 0.2180970 0.01818769
##  [2,] 5.229886 5.547803 0.3179171 0.01742745
##  [3,] 5.157134 5.474076 0.3169413 0.01756228
##  [4,] 5.099957 5.411790 0.3118326 0.01790591
##  [5,] 5.045671 5.358850 0.3131785 0.01753858
##  [6,] 4.997477 5.313087 0.3156102 0.01756705
##  [7,] 4.946452 5.271046 0.3245940 0.01776266
##  [8,] 4.909225 5.232372 0.3231469 0.01785827
##  [9,] 4.870644 5.197057 0.3264129 0.01841879
## [10,] 4.839642 5.163225 0.3235833 0.01786732

plot(gap_stat, frame = FALSE, xlab = "Number of clusters k")
abline(v = 2, lty = 2)

We further broke down the process for computing the gap statistic.

Step 1: Calculate Within-Cluster-SS. For various K values, we compute the Within-Cluster-SS for the observed data using K-means. This involved specifying K as the number of clusters and instructing R to run ten randomized initializations nstart = 10, selecting the one yielding the smallest sum.

n=130;
SS=rep(0,10)
SS[1]=sum(diag(var(sportsranks)))*(n-1)
for(k in 2:10){
      kms=kmeans(sportsranks, centers=k, nstart=10)
      SS[k]=sum(kms$withinss)
      }
lss=log(SS)

Step 2: Calculate SS_0(K) under Reference Distribution. For our data, the reference distribution was generated by creating samples where each entry was a random permutation of the numbers 1 through 7. This approach ensured that our reference data was devoid of any inherent clustering structure. We created 50 sets of this data and computed the Within-Cluster-SS for each set.

n = dim(sportsranks)[1] 
m = 7  # each review is a permutation of numbers from 1 to 7
B = 50 # number of iterations
lss0 = matrix(0,B,10)
for(b in 1:B){
  xstar=NULL
  for(i in 1:n){
    xstar = rbind(xstar, sample(m))
  }
  lss0[b,1] = log(sum(diag(var(xstar)))*(n-1))
  for(k in 2:10){
    kms = kmeans(xstar, centers=k, nstart=10)
    lss0[b,k] = log(sum(kms$withinss))
  }
}

Upon plotting the log(SS) against K values, we observed that the red curve, representing the log(SS) from the original data, consistently decreased. In contrast, the 50 grey curves, representing the log(SS) from the reference data, displayed a more varied pattern. This difference in patterns highlighted that while the sum of squares always diminishes with an increasing K, the gap doesn’t necessarily follow the same trend.

plot(1:10, lss, type="l", col="red", xlab="k", 
     ylab="log(Within SS)");
for(b in 1:B){
      points(1:10, lss0[b,], type="l", col=8)
}

Plot the Gap curve. Based on the 1se rule, the most considerable gap difference appeared at K=2. Thus, we concluded that K=2 is the optimal choice based on the gap statistic, confirming our earlier visual observations.

lss0m = apply(lss0, 2, mean)
lss0sd = sqrt(apply(lss0, 2, var))*sqrt(1+1/B)
diff = lss0m-lss
matplot(1:10, cbind(diff, diff+lss0sd, diff-lss0sd), type="l",
        xlab="k", ylab="Gap")
points(1:10, diff)
points(1:10, diff-lss0sd, pch="+", col="green")

Silhouette Plots

sil2=silhouette(km2$cluster,D)
sil3=silhouette(km3$cluster,D)

par(mfrow=c(1,2))
plot(sil2)
plot(sil3)

Compute and plot average silhouette over a range of K values.

k.max = 10
sil = rep(0, k.max)

# Compute the average silhouette width for 
# k = 2 to k = 15
for(i in 2:k.max){
  tmp = kmeans(sportsranks, centers = i, nstart = 10)
  ss <- silhouette(tmp$cluster, dist(sportsranks))
  sil[i] <- mean(ss[, 3])
}

# Plot the  average silhouette width
plot(1:k.max, sil, type = "b", pch = 19, 
     frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)

Prediction Strength

library(fpc)
ps = prediction.strength(sportsranks, Gmax=10,
                          clustermethod=kmeansCBI)

plot(1:10, ps$mean.pred, type='b', ylim=c(0,1), 
     xlab='Number of clusters', ylab='Prediction strength')
abline(h=ps$cutoff, col="red", lty=2, lwd=2)

PAM

pam2=pam(sportsranks, k=2);
plot(pam2)

pam3=pam(sportsranks, k=3);
plot(pam3)

Visualize the clustering results

par(mfrow=c(1,2));
plot(Xmds[,1], Xmds[,2], type="n");
points(Xmds[,1], Xmds[,2], pch=pam2$clustering, col=pam2$clustering);
title("MDS plot for K=2")
plot(Xmds[,1], Xmds[,2], type="n");
points(Xmds[,1], Xmds[,2], pch=pam3$clustering, col=pam3$clustering);
title("MDS plot for K=3")

Check how PAM results differ from the ones from K-means.

table(pam2$clustering, km2$cluster)
table(pam3$clustering, km3$cluster)

Hierarchical Clustering

plot(hclust(dist(sportsranks)), xlab="Complete Linkage", sub="");
plot(hclust(dist(sportsranks), method="single"), xlab="Single Linkage", sub="");
plot(hclust(dist(sportsranks), method="average"), xlab="Average Linkage", sub="");

clusters = hclust(dist(sportsranks), method="average")
clusterCut = cutree(clusters, 3)
table(clusterCut)

Check how hclust results differ from the ones from K-means.

table(cutree(clusters, 2), km2$cluster)
table(cutree(clusters, 3), km3$cluster)

clusters = hclust(dist(sportsranks))
table(cutree(clusters, 2), km2$cluster)
table(cutree(clusters, 3), km3$cluster)

Vector Quantization

Let’s demonstrate an interesting application of k-means: image compression. Take, for instance, this image of my son alongside his cherished toy, Curious George.

X = read.table("../image.txt"); 
X = as.matrix(X); 
X = t(X[270:1,]); 
image(X, col=gray((0:128)/128))

To compress this image, we segment it into small 6x6 patches. For each patch, we extract its grayscale values, representing them as a 36x1 vector. Consequently, our data matrix, which I refer to as Z, has dimensions 2700x36. This means the original image has been divided into nearly 3,000 individual patches.

p = 36
Z = NULL 
n = (270*360)/36
for(i in 1:(360/6))
    for(j in 1:(270/6)){
        k=6*(i-1); l=6*(j-1);
        Z=rbind(Z,as.vector(X[(k+1):(k+6), (l+1):(l+6)]))
        }
dim(Z)  # 2700x36

## [1] 2700   36

Every row in matrix Z represents a patch, depicted as a vector in this 36-dimensional space. We’ll then apply K-means to matrix Z, setting k=40. Consequently, we’ll identify 40 distinct cluster centers. Each of these centers is represented by a 36x1 vector, which can also be visualized as a 6x6 patch.

With these centers in hand, we’ll reconstruct the image, replacing the original 6x6 patches with their nearest cluster centers. The end result is an image reconstructed using only 40 unique patches. Naturally, this compressed image isn’t as detailed as the original, which utilized 2,700 distinct patches, but it significantly reduces the file size. Despite this compression, the main features remain discernible: we can still identify both my son’s face and Curious George. And impressively, the entire compression process hinges on the simple K-means algorithm.

# run k-means with K=40 centers
mykm = kmeans(Z, centers=40, nstart=10)

# Use the 40 patches to replace 2700 patches in the original image
# The new image uses only 40 disctive patches
newX = matrix(0,360,270)
m = 0
for(i in 1:(360/6))
    for(j in 1:(270/6)){
        k = 6*(i-1)
        l = 6*(j-1)
        m = m+1;
        tmp = mykm$center[mykm$cluster[m],]
        newX[(k+1):(k+6), (l+1):(l+6)] = matrix(tmp,6,6);
    }
image(newX, col=gray((0:128)/128))