(PSL) Classification Tree

Spam Data

spam = read.table("https://liangfgithub.github.io/Data/spam.txt")

spam.name=read.table("https://liangfgithub.github.io/Data/spam_name.txt")
for(i in 1:(ncol(spam)-1)) names(spam)[i]=as.character(spam.name[i,]);

names(spam)[ncol(spam)]="Y"
spam$Y = as.factor(spam$Y)

n=dim(spam)[1];
test.id = sample(1:n, round(n*0.2))

Fit a Classification Tree

In our code for analyzing the spam data, I utilized the tree package, which differs from the rpart package we employed for [regression trees]. It’s worth noting that rpart can also be utilized for classification tasks. I encourage you to explore the rpart package yourself.

library(tree)

First, we fit a large classification tree on the Spam Data. You can use the summary function to obtain some summary information about the tree and understand the outputs it returns.

library(tree)
spam.tr=tree(Y~., spam[-test.id,], mindev=0.005, minsize=2); # Grow a BIG tree
summary(spam.tr)

## 
## Classification tree:
## tree(formula = Y ~ ., data = spam[-test.id, ], mindev = 0.005, 
##     minsize = 2)
## Variables actually used in tree construction:
##  [1] "Cdollar"    "Wremove"    "Cexclam"    "Whp"        "CAPlongest"
##  [6] "Wfree"      "Wour"       "W1999"      "Wedu"       "W650"      
## [11] "CAPave"     "Wbusiness"  "W85"        "Wpm"        "Wgeorge"   
## [16] "Wmeeting"   "Wemail"    
## Number of terminal nodes:  24 
## Residual mean deviance:  0.3669 = 1342 / 3657 
## Misclassification error rate: 0.06031 = 222 / 3681

#spam.tr

For example, let’s delve into how the ‘df’ and ‘deviance’ are computed.

leaf.nodes is a 26-by-6 matrix, where each row represents a leaf node. The column ‘n’ denotes the number of training samples in that node.

myout=spam.tr$frame
leaf.nodes = myout[myout$var == "<leaf>",]
head(leaf.nodes)
dim(leaf.nodes)

To illustrate, the sum of the ‘n’ column should be equal to the size of the training dataset.

c(sum(leaf.nodes$n), nrow(spam) - length(test.id))

The number of leaf nodes is the same as the number of parameters. Therefore, the degrees of freedom (‘df’) are equal to the size of the training samples minus the number of leaf nodes.

c(sum(leaf.nodes$n) - summary(spam.tr)$size, summary(spam.tr)$df)

Next, let’s explore how the ‘deviance’ is computed. mydev, computed below, should agree with the output from summary(spam.tr)

myp = leaf.nodes[,6];
id = apply(myp, 1, min) > 0;
mydev = 2*sum(leaf.nodes$n[id]*(myp[id,1]*log(1/myp[id,1])+myp[id,2]*log(1/myp[id,2])))
c(mydev, summary(spam.tr)$dev)

Optimal Tree Size via CV

Cross-validation based on deviance.

cv.spam1=cv.tree(spam.tr)  
names(cv.spam1)
cv.spam1
plot(cv.spam1$size ,cv.spam1$dev ,type="b")
cv.spam1$size[which.min(cv.spam1$dev)]

Cross-validation based on misclassification rate.

cv.spam2=cv.tree(spam.tr, FUN = prune.misclass)
par(mfrow=c(1,2))
plot(cv.spam2$size ,cv.spam2$dev ,type="b")
plot(cv.spam2$k ,cv.spam2$dev ,type="b")

cv.spam2$size[which.min(cv.spam2$dev)]

Pruning

Cut the tree to the desired size using two different criteria: misclass and deviance. The results may be slightly different.

bestsize=14;
spam.tr1=prune.tree(spam.tr, best=bestsize);
spam.tr2=prune.misclass(spam.tr, best=bestsize); # should be the same as: 
#spam.tr2=prune.tree(spam.tr, method="misclass", best=bestsize);

par(mfrow=c(1,2));
plot(spam.tr1);
text(spam.tr1, cex=0.5);
plot(spam.tr2);
text(spam.tr2, cex=0.5);

It’s important to note that in the left tree (which has been trimmed based on deviance), there is a particular node where the two leaf nodes have the same prediction. Such a branch might be pruned if we use the mis-classification rate as a criterion for tree pruning, as the mis-classification rate would remain the same if we were to cut that branch. However, it’s worth mentioning that such a branch could still contribute significantly to reducing the deviance.

Prediction

Check how to obtain predictions using the fitted classification tree.

??predict.tree

The training and testing mis-classification errors are given below. It seems prune.misclass performs better if the performance is measured by 0/1 loss. But as mentioned before, it’s better to use Gini or deviance to grow a tree.

train.pred1 = predict(spam.tr1, spam[-test.id,], type="class")
table(train.pred1, spam$Y[-test.id])


train.pred2 = predict(spam.tr2, spam[-test.id,], type="class")
table(train.pred2, spam$Y[-test.id])

test.pred1 = predict(spam.tr1, spam[test.id,], type="class")
table(test.pred1, spam$Y[test.id])

test.pred2 = predict(spam.tr2, spam[test.id,], type="class")
table(test.pred2, spam$Y[test.id])

```