To elucidate the concept of multidimensional scaling, let’s consider a dataset from a study that aimed to determine which pairs of letters people frequently confuse. The table below is from Wolford and Hollingsworth (1974). The set comprises eight letters: C, D, G, H, M, N, Q, and W. This results in an 8×8 matrix. Each cell in this matrix indicates the frequency with which people confuse the corresponding row and column letters.
= matrix(0, 8, 8)
D = c("C", "D", "G", "H", "M", "N", "Q", "W")
letters colnames(D) = letters
rownames(D) = letters
2:8, 1] = c(5, 12, 2, 2, 2, 9, 1)
D[3:8, 2] = c(2, 4, 3, 4, 20, 5)
D[4:8, 3] = c(3, 2, 1, 9, 2)
D[5:8, 4] = c(19, 18, 1, 5)
D[6:8, 5] = c(16, 2, 18)
D[7:8, 6] = c(8, 13)
D[8, 7] = 4
D[= (D+t(D))
D D
## C D G H M N Q W
## C 0 5 12 2 2 2 9 1
## D 5 0 2 4 3 4 20 5
## G 12 2 0 3 2 1 9 2
## H 2 4 3 0 19 18 1 5
## M 2 3 2 19 0 16 2 18
## N 2 4 1 18 16 0 8 13
## Q 9 20 9 1 2 8 0 4
## W 1 5 2 5 18 13 4 0
The matrix \(\mathbf{D}\) is symmetric and represents similarity, not distance. A higher value signifies that the two letters (row and column) are easily confused, indicating their similarity.
First, we change the similarity matrix \(\mathbf{D}\) to a distance matrix \(\mathbf{D}_0\) using the formula \(21-\mathbf{D}\). This essentially inverts the sign, ensuring all entries remain positive (given the highest value in \(\mathbf{D}\) is 20). The resultant matrix, \(\mathbf{D}_0\), serves as our distance matrix. We also set the diagonal of \(\mathbf{D}_0\) to zero since the distance between identical letters should always be zero.
= 21 - D
D0 diag(D0) = 0
Applying multidimensional scaling to this matrix and by default embedding the data in a two-dimensional space, we get a visual representation. An alternative transformation for the similarity-to-distance matrix conversion, using \(41-\mathbf{D}\), was also tried. The results were largely consistent, albeit with slight variations in scale.
= cmdscale(D0)
tmp par(mfrow=c(1,2))
plot(tmp[, 1], tmp[, 2], type="n", xlab="", ylab="",
xlim = c(-15, 15), ylim=c(-15, 15))
text(tmp[, 1], tmp[, 2], label = letters)
= 41 - D
D1 diag(D1) = 0
= cmdscale(D1)
tmp plot(tmp[, 1], tmp[, 2], type="n", xlab="", ylab="",
xlim = c(-20, 20), ylim=c(-20, 20))
text(tmp[, 1], tmp[, 2], label = letters)
From the visual representation, it becomes evident that the letters D and Q are similar, as are C and G. Additionally, a cluster of four other letters seems to share certain similarities.
In our study, we apply clustering algorithms in R using the dataset “sportsranks”. This dataset contains over a hundred entries, each consisting of seven columns, with each column corresponding to a specific sport. Each entry, or row, represents an individual’s ranking preference for these seven sports, ranging from 1 to 7, without any ties.
= "https://raw.githubusercontent.com/liangfgithub/liangfgithub.github.io/master/Data/sportsranks.csv"
url = read.csv(url)
sportsranks head(sportsranks)
## Baseball Football Basketball Tennis Cycling Swimming Jogging
## 1 1 3 7 2 4 5 6
## 2 1 3 2 5 4 7 6
## 3 1 3 2 5 4 7 6
## 4 4 7 3 1 5 6 2
## 5 2 3 1 7 6 5 4
## 6 6 3 4 7 2 1 5
To analyze this data, we employed the K-means clustering method with K set to 2 and 3.
=kmeans(sportsranks, centers=2, nstart=10);
km2=kmeans(sportsranks, centers=3, nstart=10); km3
To visualize the results, we utilized a multidimensional scaling
(MDS) representation, plotted in a two-dimensional space. This MDS was
calculated using the cmdscale
command in R.
=dist(sportsranks)
D=cmdscale(D);
Xmds
par(mfrow=c(1,2));
plot(Xmds[,1], Xmds[,2], type="n", xlab="", ylab="")
points(Xmds[,1], Xmds[,2], pch=km2$cluster, col=km2$cluster);
title("MDS plot for K=2")
plot(Xmds[,1], Xmds[,2], type="n", xlab="", ylab="");
points(Xmds[,1], Xmds[,2], pch=km3$cluster, col=km3$cluster);
title("MDS plot for K=3")
Moreover, we manually computed the MDS by first applying double-centering, followed by singular value decomposition. We extracted the first two columns, essentially representing the two-dimensional space. A correlation check between our computed two-dimensional MDS and that returned by R revealed that they were virtually identical.
= as.matrix(D^2)
D2 = dim(D2)[1]
n = D2 - matrix(colMeans(D2), nrow=n, ncol=n, byrow=TRUE) -
tmp matrix(rowMeans(D2), nrow=n, ncol=n, byrow=FALSE)
= -(tmp + mean(D2))/2
tmp = svd(tmp)
tmp.svd = tmp.svd$u[, 1:2] %*% diag(sqrt(tmp.svd$d[1:2]))
F cor(F, Xmds)
## [,1] [,2]
## [1,] 1.000000e+00 -6.602771e-17
## [2,] -1.697386e-16 -1.000000e+00
The plots indicate that setting K to 2 might be sufficient since using K = 3 began to display overlapping clusters. To verify our choice of K, we employed the gap statistic method.
In R, there’s a specific command clusGap
in the
cluster
package for computing this statistic. Our analysis
using this method indicated that the optimal K is indeed 2.
library(cluster)
set.seed(123)
= clusGap(sportsranks, FUN = kmeans, nstart = 25,
gap_stat K.max = 10, B = 50)
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = sportsranks, FUNcluster = kmeans, K.max = 10, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 2
## logW E.logW gap SE.sim
## [1,] 5.429866 5.647963 0.2180970 0.01818769
## [2,] 5.229886 5.547803 0.3179171 0.01742745
## [3,] 5.157134 5.474076 0.3169413 0.01756228
## [4,] 5.099957 5.411790 0.3118326 0.01790591
## [5,] 5.045671 5.358850 0.3131785 0.01753858
## [6,] 4.997477 5.313087 0.3156102 0.01756705
## [7,] 4.946452 5.271046 0.3245940 0.01776266
## [8,] 4.909225 5.232372 0.3231469 0.01785827
## [9,] 4.870644 5.197057 0.3264129 0.01841879
## [10,] 4.839642 5.163225 0.3235833 0.01786732
plot(gap_stat, frame = FALSE, xlab = "Number of clusters k")
abline(v = 2, lty = 2)
We further broke down the process for computing the gap statistic.
nstart = 10
, selecting the
one yielding the smallest sum.=130;
n=rep(0,10)
SS1]=sum(diag(var(sportsranks)))*(n-1)
SS[for(k in 2:10){
=kmeans(sportsranks, centers=k, nstart=10)
kms=sum(kms$withinss)
SS[k]
}=log(SS) lss
= dim(sportsranks)[1]
n = 7 # each review is a permutation of numbers from 1 to 7
m = 50 # number of iterations
B = matrix(0,B,10)
lss0 for(b in 1:B){
=NULL
xstarfor(i in 1:n){
= rbind(xstar, sample(m))
xstar
}1] = log(sum(diag(var(xstar)))*(n-1))
lss0[b,for(k in 2:10){
= kmeans(xstar, centers=k, nstart=10)
kms = log(sum(kms$withinss))
lss0[b,k]
} }
Upon plotting the log(SS) against K values, we observed that the red curve, representing the log(SS) from the original data, consistently decreased. In contrast, the 50 grey curves, representing the log(SS) from the reference data, displayed a more varied pattern. This difference in patterns highlighted that while the sum of squares always diminishes with an increasing K, the gap doesn’t necessarily follow the same trend.
plot(1:10, lss, type="l", col="red", xlab="k",
ylab="log(Within SS)");
for(b in 1:B){
points(1:10, lss0[b,], type="l", col=8)
}
Plot the Gap curve. Based on the 1se rule, the most considerable gap difference appeared at K=2. Thus, we concluded that K=2 is the optimal choice based on the gap statistic, confirming our earlier visual observations.
= apply(lss0, 2, mean)
lss0m = sqrt(apply(lss0, 2, var))*sqrt(1+1/B)
lss0sd = lss0m-lss
diff matplot(1:10, cbind(diff, diff+lss0sd, diff-lss0sd), type="l",
xlab="k", ylab="Gap")
points(1:10, diff)
points(1:10, diff-lss0sd, pch="+", col="green")
=silhouette(km2$cluster,D)
sil2=silhouette(km3$cluster,D)
sil3
par(mfrow=c(1,2))
plot(sil2)
plot(sil3)
Compute and plot average silhouette over a range of K values.
= 10
k.max = rep(0, k.max)
sil
# Compute the average silhouette width for
# k = 2 to k = 15
for(i in 2:k.max){
= kmeans(sportsranks, centers = i, nstart = 10)
tmp <- silhouette(tmp$cluster, dist(sportsranks))
ss <- mean(ss[, 3])
sil[i]
}
# Plot the average silhouette width
plot(1:k.max, sil, type = "b", pch = 19,
frame = FALSE, xlab = "Number of clusters k")
abline(v = which.max(sil), lty = 2)
library(fpc)
= prediction.strength(sportsranks, Gmax=10,
ps clustermethod=kmeansCBI)
plot(1:10, ps$mean.pred, type='b', ylim=c(0,1),
xlab='Number of clusters', ylab='Prediction strength')
abline(h=ps$cutoff, col="red", lty=2, lwd=2)
=pam(sportsranks, k=2);
pam2plot(pam2)
=pam(sportsranks, k=3);
pam3plot(pam3)
Visualize the clustering results
par(mfrow=c(1,2));
plot(Xmds[,1], Xmds[,2], type="n");
points(Xmds[,1], Xmds[,2], pch=pam2$clustering, col=pam2$clustering);
title("MDS plot for K=2")
plot(Xmds[,1], Xmds[,2], type="n");
points(Xmds[,1], Xmds[,2], pch=pam3$clustering, col=pam3$clustering);
title("MDS plot for K=3")
Check how PAM results differ from the ones from K-means.
table(pam2$clustering, km2$cluster)
table(pam3$clustering, km3$cluster)
plot(hclust(dist(sportsranks)), xlab="Complete Linkage", sub="");
plot(hclust(dist(sportsranks), method="single"), xlab="Single Linkage", sub="");
plot(hclust(dist(sportsranks), method="average"), xlab="Average Linkage", sub="");
= hclust(dist(sportsranks), method="average")
clusters = cutree(clusters, 3)
clusterCut table(clusterCut)
Check how hclust
results differ from the ones from
K-means.
table(cutree(clusters, 2), km2$cluster)
table(cutree(clusters, 3), km3$cluster)
= hclust(dist(sportsranks))
clusters table(cutree(clusters, 2), km2$cluster)
table(cutree(clusters, 3), km3$cluster)
Let’s demonstrate an interesting application of k-means: image compression. Take, for instance, this image of my son alongside his cherished toy, Curious George.
= read.table("../image.txt");
X = as.matrix(X);
X = t(X[270:1,]);
X image(X, col=gray((0:128)/128))
To compress this image, we segment it into small 6x6 patches. For
each patch, we extract its grayscale values, representing them as a 36x1
vector. Consequently, our data matrix, which I refer to as
Z
, has dimensions 2700x36. This means the original image
has been divided into nearly 3,000 individual patches.
= 36
p = NULL
Z = (270*360)/36
n for(i in 1:(360/6))
for(j in 1:(270/6)){
=6*(i-1); l=6*(j-1);
k=rbind(Z,as.vector(X[(k+1):(k+6), (l+1):(l+6)]))
Z
}dim(Z) # 2700x36
## [1] 2700 36
Every row in matrix Z
represents a patch, depicted as a
vector in this 36-dimensional space. We’ll then apply K-means to matrix
Z, setting k=40. Consequently, we’ll identify 40 distinct cluster
centers. Each of these centers is represented by a 36x1 vector, which
can also be visualized as a 6x6 patch.
With these centers in hand, we’ll reconstruct the image, replacing the original 6x6 patches with their nearest cluster centers. The end result is an image reconstructed using only 40 unique patches. Naturally, this compressed image isn’t as detailed as the original, which utilized 2,700 distinct patches, but it significantly reduces the file size. Despite this compression, the main features remain discernible: we can still identify both my son’s face and Curious George. And impressively, the entire compression process hinges on the simple K-means algorithm.
# run k-means with K=40 centers
= kmeans(Z, centers=40, nstart=10)
mykm
# Use the 40 patches to replace 2700 patches in the original image
# The new image uses only 40 disctive patches
= matrix(0,360,270)
newX = 0
m for(i in 1:(360/6))
for(j in 1:(270/6)){
= 6*(i-1)
k = 6*(j-1)
l = m+1;
m = mykm$center[mykm$cluster[m],]
tmp +1):(k+6), (l+1):(l+6)] = matrix(tmp,6,6);
newX[(k
}image(newX, col=gray((0:128)/128))