Stat542: Model-based Clustering

Load the old-faithful geyser data.

data=read.table("faithful.dat", header=TRUE)
plot(data[,1], data[,2], xlab="Interval", ylab="Duration")

Load the `mclust` package

library(mclust)

## Package 'mclust' version 5.2.3

## Type 'citation("mclust")' for citing this R package in publications.

?em
?mclustModelNames

Different Gaussian mixture models:

“EII”: spherical, equal volume
“VII”: spherical, unequal volume
“EEI”: diagonal, equal volume and shape
“VEI”: diagonal, varying volume, equal shape
“EVI”: diagonal, equal volume, varying shape
“VVI”: diagonal, varying volume and shape
“EEE”: ellipsoidal, equal volume, shape, and orientation
“EEV”: ellipsoidal, equal volume and equal shape
“VEV”: ellipsoidal, equal shape
“VVV”: ellipsoidal, varying volume, shape, and orientation

Try K=2 with model “VVI”

K=2; n=length(data[,1])
z=matrix(0,n, K); 
id=sample(1:K, n, replace=TRUE);
for(i in 1:n) z[i,id[i]]=1;

tmp=mstep(modelName="VVI", data, z)
myout=em(modelName="VVI", data, parameters=tmp$parameters);

surfacePlot(data, parameters=myout$parameters, nlevels=10)
points(data, col="blue")

Try K=3 with model “VVI”

K=3; n=length(data[,1])
z=matrix(0,n, K); 
id=sample(1:K, n, replace=TRUE);
for(i in 1:n) z[i,id[i]]=1;

tmp=mstep(modelName="VVI", data, z)
myout=em(modelName="VVI", data, parameters=tmp$parameters);
surfacePlot(data, parameters=myout$parameters)
points(data, col="blue")

Use BIC to select both K and model.

geyserBIC=mclustBIC(data, G=1:5)
plot(geyserBIC)

summary.mclustBIC(geyserBIC, data)

## Best BIC values:
##          EEE,3     EEE,4     VVE,2
## BIC      -2314 -2320.207 -2320.433
## BIC diff     0    -5.821    -6.047
## 
## Classification table for model (EEE,3):
##   1   2   3 
## 130  97  45

Plot the best model selected by BIC

BICbest=mclustModel(data, geyserBIC)
coordProj(data,parameters=BICbest$parameters, what=c("classification"), z=BICbest$z)

surfacePlot(data, parameters=BICbest$parameters)
points(data, col="blue")

The top two models selected by BIC are “EEE”, however, when we use the EEE model to cluster a subset of the data, we find the clustering result is not stable.

mysubset=sample(1:n, n/2);
newdata=data[mysubset,]
newbest=Mclust(newdata, modelNames="EEE")
par(mfrow=c(1,2))
plot(newbest, newdata, what="BIC")
coordProj(newdata,parameters=newbest$parameters, what=c("classification"), z=newbest$z)

The VVV model more stable than the EEE model. So I would suggest to choose a two-component VVV model.