To elucidate the concept of multidimensional scaling, let’s consider a dataset from a study that aimed to determine which pairs of letters people frequently confuse. The table below is from Wolford and Hollingsworth (1974). The set comprises eight letters: C, D, G, H, M, N, Q, and W. This results in an 8×8 matrix. Each cell in this matrix indicates the frequency with which people confuse the corresponding row and column letters.
= matrix(0, 8, 8)
D = c("C", "D", "G", "H", "M", "N", "Q", "W")
letters colnames(D) = letters
rownames(D) = letters
2:8, 1] = c(5, 12, 2, 2, 2, 9, 1)
D[3:8, 2] = c(2, 4, 3, 4, 20, 5)
D[4:8, 3] = c(3, 2, 1, 9, 2)
D[5:8, 4] = c(19, 18, 1, 5)
D[6:8, 5] = c(16, 2, 18)
D[7:8, 6] = c(8, 13)
D[8, 7] = 4
D[= (D+t(D))
D D
## C D G H M N Q W
## C 0 5 12 2 2 2 9 1
## D 5 0 2 4 3 4 20 5
## G 12 2 0 3 2 1 9 2
## H 2 4 3 0 19 18 1 5
## M 2 3 2 19 0 16 2 18
## N 2 4 1 18 16 0 8 13
## Q 9 20 9 1 2 8 0 4
## W 1 5 2 5 18 13 4 0
The matrix \(\mathbf{D}\) is symmetric and represents similarity, not distance. A higher value signifies that the two letters (row and column) are easily confused, indicating their similarity.
First, we change the similarity matrix \(\mathbf{D}\) to a distance matrix \(\mathbf{D}_0\) using the formula \(21-\mathbf{D}\). This essentially inverts the sign, ensuring all entries remain positive (given the highest value in \(\mathbf{D}\) is 20). The resultant matrix, \(\mathbf{D}_0\), serves as our distance matrix. We also set the diagonal of \(\mathbf{D}_0\) to zero since the distance between identical letters should always be zero.
= 21 - D
D0 diag(D0) = 0
Applying multidimensional scaling to this matrix and by default embedding the data in a two-dimensional space, we get a visual representation. An alternative transformation for the similarity-to-distance matrix conversion, using \(41-\mathbf{D}\), was also tried. The results were largely consistent, albeit with slight variations in scale.
= cmdscale(D0)
tmp par(mfrow=c(1,2))
plot(tmp[, 1], tmp[, 2], type="n", xlab="", ylab="",
xlim = c(-15, 15), ylim=c(-15, 15))
text(tmp[, 1], tmp[, 2], label = letters)
= 41 - D
D1 diag(D1) = 0
= cmdscale(D1)
tmp plot(tmp[, 1], tmp[, 2], type="n", xlab="", ylab="",
xlim = c(-20, 20), ylim=c(-20, 20))
text(tmp[, 1], tmp[, 2], label = letters)