Spam Data

spam = read.table("https://liangfgithub.github.io/Data/spam.txt")

spam.name=read.table("https://liangfgithub.github.io/Data/spam_name.txt")
for(i in 1:(ncol(spam)-1)) names(spam)[i]=as.character(spam.name[i,]);

names(spam)[ncol(spam)]="Y"
spam$Y = as.factor(spam$Y)
n=dim(spam)[1];
test.id = sample(1:n, round(n*0.2))

Fit a Classification Tree

In our code for analyzing the spam data, I utilized the tree package, which differs from the rpart package we employed for [regression trees]. It’s worth noting that rpart can also be utilized for classification tasks. I encourage you to explore the rpart package yourself.

library(tree)

First, we fit a large classification tree on the Spam Data. You can use the summary function to obtain some summary information about the tree and understand the outputs it returns.

library(tree)
spam.tr=tree(Y~., spam[-test.id,], mindev=0.005, minsize=2); # Grow a BIG tree
summary(spam.tr)
## 
## Classification tree:
## tree(formula = Y ~ ., data = spam[-test.id, ], mindev = 0.005, 
##     minsize = 2)
## Variables actually used in tree construction:
##  [1] "Cdollar"    "Wremove"    "Cexclam"    "Whp"        "CAPlongest"
##  [6] "Wfree"      "Wour"       "W1999"      "Wedu"       "W650"      
## [11] "CAPave"     "Wbusiness"  "W85"        "Wpm"        "Wgeorge"   
## [16] "Wmeeting"   "Wemail"    
## Number of terminal nodes:  24 
## Residual mean deviance:  0.3669 = 1342 / 3657 
## Misclassification error rate: 0.06031 = 222 / 3681
#spam.tr

For example, let’s delve into how the ‘df’ and ‘deviance’ are computed.

leaf.nodes is a 26-by-6 matrix, where each row represents a leaf node. The column ‘n’ denotes the number of training samples in that node.

myout=spam.tr$frame
leaf.nodes = myout[myout$var == "<leaf>",]
head(leaf.nodes)
dim(leaf.nodes)

To illustrate, the sum of the ‘n’ column should be equal to the size of the training dataset.

c(sum(leaf.nodes$n), nrow(spam) - length(test.id))

The number of leaf nodes is the same as the number of parameters. Therefore, the degrees of freedom (‘df’) are equal to the size of the training samples minus the number of leaf nodes.

c(sum(leaf.nodes$n) - summary(spam.tr)$size, summary(spam.tr)$df)

Next, let’s explore how the ‘deviance’ is computed. mydev, computed below, should agree with the output from summary(spam.tr)

myp = leaf.nodes[,6];
id = apply(myp, 1, min) > 0;
mydev = 2*sum(leaf.nodes$n[id]*(myp[id,1]*log(1/myp[id,1])+myp[id,2]*log(1/myp[id,2])))
c(mydev, summary(spam.tr)$dev)

Optimal Tree Size via CV

Cross-validation based on deviance.

cv.spam1=cv.tree(spam.tr)  
names(cv.spam1)
cv.spam1
plot(cv.spam1$size ,cv.spam1$dev ,type="b")
cv.spam1$size[which.min(cv.spam1$dev)]

Cross-validation based on misclassification rate.

cv.spam2=cv.tree(spam.tr, FUN = prune.misclass)
par(mfrow=c(1,2))
plot(cv.spam2$size ,cv.spam2$dev ,type="b")
plot(cv.spam2$k ,cv.spam2$dev ,type="b")

cv.spam2$size[which.min(cv.spam2$dev)]

Pruning

Cut the tree to the desired size using two different criteria: misclass and deviance. The results may be slightly different.

bestsize=14;
spam.tr1=prune.tree(spam.tr, best=bestsize);
spam.tr2=prune.misclass(spam.tr, best=bestsize); # should be the same as: 
#spam.tr2=prune.tree(spam.tr, method="misclass", best=bestsize);

par(mfrow=c(1,2));
plot(spam.tr1);
text(spam.tr1, cex=0.5);
plot(spam.tr2);
text(spam.tr2, cex=0.5);