In our code for analyzing the spam data, I utilized the
tree
package, which differs from the rpart
package we employed for [regression
trees]. It’s worth noting that rpart
can also be
utilized for classification tasks. I encourage you to explore the
rpart
package yourself.
First, we fit a large classification tree on the Spam Data. You can
use the summary
function to obtain some summary information
about the tree and understand the outputs it returns.
library(tree)
spam.tr=tree(Y~., spam[-test.id,], mindev=0.005, minsize=2); # Grow a BIG tree
summary(spam.tr)
##
## Classification tree:
## tree(formula = Y ~ ., data = spam[-test.id, ], mindev = 0.005,
## minsize = 2)
## Variables actually used in tree construction:
## [1] "Cdollar" "Wremove" "Cexclam" "Whp" "CAPlongest"
## [6] "Wfree" "Wour" "W1999" "Wedu" "W650"
## [11] "CAPave" "Wbusiness" "W85" "Wpm" "Wgeorge"
## [16] "Wmeeting" "Wemail"
## Number of terminal nodes: 24
## Residual mean deviance: 0.3669 = 1342 / 3657
## Misclassification error rate: 0.06031 = 222 / 3681
For example, let’s delve into how the ‘df’ and ‘deviance’ are computed.
leaf.nodes
is a 26-by-6 matrix, where each row
represents a leaf node. The column ‘n’ denotes the number of training
samples in that node.
To illustrate, the sum of the ‘n’ column should be equal to the size of the training dataset.
The number of leaf nodes is the same as the number of parameters. Therefore, the degrees of freedom (‘df’) are equal to the size of the training samples minus the number of leaf nodes.
Next, let’s explore how the ‘deviance’ is computed.
mydev
, computed below, should agree with the output from
summary(spam.tr)
Cross-validation based on deviance.
cv.spam1=cv.tree(spam.tr)
names(cv.spam1)
cv.spam1
plot(cv.spam1$size ,cv.spam1$dev ,type="b")
cv.spam1$size[which.min(cv.spam1$dev)]
Cross-validation based on misclassification rate.
Cut the tree to the desired size using two different criteria: misclass and deviance. The results may be slightly different.