library(ggplot2)
library(rpart)
library(rpart.plot)
library(tree)
url = "https://liangfgithub.github.io/Data/HousingData.csv"
Housing = read.csv(url)
There are two R packages for tree models, tree
and
rpart
. We will mainly use rpart
. The package
tree
is called for its command partition.tree
,
which we use to generate the first figure.
The tree Package
The syntax for fitting a regression tree with tree
is
similar to that used for linear regression models. In this example, we
use just two predictors, longitude and latitude, from the Housing data
to predict the variable Y.
trfit= tree(Y ~ lon + lat, data = Housing)
small.tree = prune.tree(trfit, best = 7)
small.tree
node), split, n, deviance, yval
* denotes terminal node
1) root 506 84.1800 3.035
2) lon < -71.0667 202 16.8800 3.297 *
3) lon > -71.0667 304 44.1400 2.860
6) lon < -71.0155 185 32.8300 2.752
12) lat < 42.241 147 27.4400 2.671
24) lat < 42.1698 18 0.8337 3.104 *
25) lat > 42.1698 129 22.7700 2.611
50) lon < -71.0332 102 15.4200 2.707
100) lat < 42.2011 51 3.5830 2.525 *
101) lat > 42.2011 51 8.4700 2.888 *
51) lon > -71.0332 27 2.8910 2.250 *
13) lat > 42.241 38 0.6923 3.066 *
7) lon > -71.0155 119 5.8160 3.028 *
par(mfrow = c(1, 2))
plot(small.tree)
text(small.tree, cex = .75)
price.quantiles = cut(Housing$Y, quantile(Housing$Y, 0: 20 / 20),
include.lowest = TRUE)
plot(Housing$lat, Housing$lon, col = grey(20: 1 / 21)[price.quantiles],
pch = 20, ylab = "Longitude", xlab = "Latitude")
partition.tree(small.tree, ordvars = c("lat", "lon"), add = TRUE)
Detach the tree
package and switch to the
rpart
package.
The rpart Package
Fit a Tree
The syntax for fitting a regression tree with rpart
closely resembles that used for linear regression models. In this
example, we use the housing dataset to build a regression tree that
predicts the target variable Y, using all other available columns as
potential predictors. The default tree plot may not be visually
appealing, so for better visualization, we can use the
rpart.plot
package.
set.seed(1234)
tr1 = rpart(Y~., data = Housing)
par(mfrow = c(1, 2))
plot(tr1)
rpart.plot(tr1)
Understanding the Fitted Tree:
Starting at the root node, the average value for all observations is
3, representing 100% of the data points. The first split occurs based on
the variable lsat
. The left node contains roughly 35% of
the samples with an average of 2.7, while the right node contains the
remaining 65% with an average of 3.2. Leaf nodes appear at the bottom,
and their color shading indicates the magnitude of their
predictions.
Cross-validation
Cross-validation is automatically performed when using the
rpart
command. To inspect the cross-validation for pruning,
use the printcp
command, which provides a table.
CP
means Complexity Parameter, which
is a scaled version of \(\alpha\) (the
price or penalty we pay for keeping a split)
nsplit
means the number of splits; the number of leaf
nodes (or size of a tree) is equal to nsplit
+ 1
rel error
and xerror
are the training
error (RSS) and CV error, but these values are relative to a reference
training error at the root node.
Connection between \(\alpha\) and
CP
\[
RSS(T) + \alpha |T|, \quad \frac{RSS(T)}{RSS(\text{root})} + CP \cdot
|T|
\]
Regression tree:
rpart(formula = Y ~ ., data = Housing)
Variables actually used in tree construction:
[1] age crim lstat nox rm
Root node error: 84.178/506 = 0.16636
n= 506
CP nsplit rel error xerror xstd
1 0.464749 0 1.00000 1.00266 0.074595
2 0.157620 1 0.53525 0.55009 0.037433
3 0.077637 2 0.37763 0.39504 0.034672
4 0.034542 3 0.29999 0.33653 0.031127
5 0.022416 4 0.26545 0.31014 0.031366
6 0.020665 5 0.24304 0.28779 0.027946
7 0.016709 6 0.22237 0.27085 0.026521
8 0.012136 7 0.20566 0.25969 0.025669
9 0.011932 8 0.19353 0.25558 0.025637
10 0.010000 9 0.18159 0.25558 0.025637
Pruning
We can employ the prune
command to obtain the optimal
subtree by providing specific Complexity Parameter (CP) values.
prune(tr1, cp = 0.3)
prune(tr1, cp = 0.2)
prune(tr1, cp = 0.156)
A CV path plot
If you’re uncertain whether the cross-validation error has reached
its minimum, consider starting with a larger tree.
tr2 = rpart(Y~., data = Housing,
control = list(cp = 0, xval = 10))
printcp(tr2)
plot(tr2)
plotcp(tr2)
Optimal CP Value
We can use either the minimum cross-validation error or the One
Standard Error (1-SE) rule to choose the optimal CP value for
pruning.
# index of CP with lowest xerror
opt = which.min(tr2$cptable[, "xerror"])
# the optimal CP value
tr2$cptable[opt, 4]
# upper bound for equivalent optimal xerror
tr2$cptable[opt, 4] + tr2$cptable[opt, 5]
# row IDs for CPs whose xerror are equivalent to min(xerror)
tmp.id = which(tr2$cptable[, 4] <= tr2$cptable[opt, 4] +
tr2$cptable[opt, 5])
tmp.id = min(tmp.id)
# CP.1se = any value between row(tmp.id) and(tmp.id - 1)
CP.1se = (tr2$cptable[tmp.id -1, 1] + tr2$cptable[tmp.id, 1]) / 2
# Prune tree with CP.1se
tr3 = prune(tr2, cp = CP.1se)
More on the CP Table
Understand the relationship between the 1st column and the 3rd column
of the CP table.
The difference between adjacent rel error
in the table
is equivalent to the corresponding CP values. This difference signifies
the improvement gained in RSS by adding an additional split, effectively
setting the “cost” of that split.
cbind(tr2$cptable[, 1], c(-diff(tr2$cptable[, 3]), 0))
Making Predictions
After the tree is fit, the predict
function from the
rpart
package can be used for making predictions on new or
existing data.
LS0tCnRpdGxlOiAiKFBTTCkgUmVncmVzc2lvbiBUcmVlIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRoZW1lOiByZWFkYWJsZQogICAgdG9jOiB5ZXMKICAgIHRvY19mbG9hdDogeWVzCiAgICBoaWdobGlnaHQ6IHRhbmdvCi0tLQoKCmBgYHtyfQpsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkocnBhcnQpCmxpYnJhcnkocnBhcnQucGxvdCkKbGlicmFyeSh0cmVlKQpgYGAKCmBgYHtyfQp1cmwgPSAiaHR0cHM6Ly9saWFuZ2ZnaXRodWIuZ2l0aHViLmlvL0RhdGEvSG91c2luZ0RhdGEuY3N2IgpIb3VzaW5nID0gcmVhZC5jc3YodXJsKQpgYGAKCgpUaGVyZSBhcmUgdHdvIFIgcGFja2FnZXMgZm9yIHRyZWUgbW9kZWxzLCBgdHJlZWAgYW5kIGBycGFydGAuIFdlIHdpbGwgbWFpbmx5IHVzZSBgcnBhcnRgLiBUaGUgcGFja2FnZSBgdHJlZWAgaXMgY2FsbGVkIGZvciBpdHMgY29tbWFuZCBgcGFydGl0aW9uLnRyZWVgLCB3aGljaCB3ZSB1c2UgdG8gZ2VuZXJhdGUgdGhlIGZpcnN0IGZpZ3VyZS4KCiMjIFRoZSB0cmVlIFBhY2thZ2UKClRoZSBzeW50YXggZm9yIGZpdHRpbmcgYSByZWdyZXNzaW9uIHRyZWUgd2l0aCBgdHJlZWAgaXMgc2ltaWxhciB0byB0aGF0IHVzZWQgZm9yIGxpbmVhciByZWdyZXNzaW9uIG1vZGVscy4gSW4gdGhpcyBleGFtcGxlLCB3ZSB1c2UganVzdCB0d28gcHJlZGljdG9ycywgbG9uZ2l0dWRlIGFuZCBsYXRpdHVkZSwgZnJvbSB0aGUgSG91c2luZyBkYXRhIHRvIHByZWRpY3QgdGhlIHZhcmlhYmxlIFkuCgoKYGBge3J9CnRyZml0PSB0cmVlKFkgfiBsb24gKyBsYXQsIGRhdGEgPSBIb3VzaW5nKQpzbWFsbC50cmVlID0gcHJ1bmUudHJlZSh0cmZpdCwgYmVzdCA9IDcpCnNtYWxsLnRyZWUKYGBgCgpgYGB7cn0KcGFyKG1mcm93ID0gYygxLCAyKSkKcGxvdChzbWFsbC50cmVlKQp0ZXh0KHNtYWxsLnRyZWUsIGNleCA9IC43NSkKCnByaWNlLnF1YW50aWxlcyA9IGN1dChIb3VzaW5nJFksIHF1YW50aWxlKEhvdXNpbmckWSwgMDogMjAgLyAyMCksCiAgaW5jbHVkZS5sb3dlc3QgPSBUUlVFKQpwbG90KEhvdXNpbmckbGF0LCBIb3VzaW5nJGxvbiwgY29sID0gZ3JleSgyMDogMSAvIDIxKVtwcmljZS5xdWFudGlsZXNdLAogIHBjaCA9IDIwLCB5bGFiID0gIkxvbmdpdHVkZSIsIHhsYWIgPSAiTGF0aXR1ZGUiKQpwYXJ0aXRpb24udHJlZShzbWFsbC50cmVlLCBvcmR2YXJzID0gYygibGF0IiwgImxvbiIpLCBhZGQgPSBUUlVFKQpgYGAKRGV0YWNoIHRoZSBgdHJlZWAgcGFja2FnZSBhbmQgc3dpdGNoIHRvIHRoZSBgcnBhcnRgIHBhY2thZ2UuIAoKYGBge3J9CmRldGFjaCgicGFja2FnZTp0cmVlIikKYGBgCgojIyBUaGUgcnBhcnQgUGFja2FnZQoKIyMjIEZpdCBhIFRyZWUKClRoZSBzeW50YXggZm9yIGZpdHRpbmcgYSByZWdyZXNzaW9uIHRyZWUgd2l0aCBgcnBhcnRgIGNsb3NlbHkgcmVzZW1ibGVzIHRoYXQgdXNlZCBmb3IgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWxzLiBJbiB0aGlzIGV4YW1wbGUsIHdlIHVzZSB0aGUgaG91c2luZyBkYXRhc2V0IHRvIGJ1aWxkIGEgcmVncmVzc2lvbiB0cmVlIHRoYXQgcHJlZGljdHMgdGhlIHRhcmdldCB2YXJpYWJsZSBZLCB1c2luZyBhbGwgb3RoZXIgYXZhaWxhYmxlIGNvbHVtbnMgYXMgcG90ZW50aWFsIHByZWRpY3RvcnMuIFRoZSBkZWZhdWx0IHRyZWUgcGxvdCBtYXkgbm90IGJlIHZpc3VhbGx5IGFwcGVhbGluZywgc28gZm9yIGJldHRlciB2aXN1YWxpemF0aW9uLCB3ZSBjYW4gdXNlIHRoZSBgcnBhcnQucGxvdGAgcGFja2FnZS4KCgpgYGB7cn0Kc2V0LnNlZWQoMTIzNCkKdHIxID0gcnBhcnQoWX4uLCBkYXRhID0gSG91c2luZykKcGFyKG1mcm93ID0gYygxLCAyKSkKcGxvdCh0cjEpCnJwYXJ0LnBsb3QodHIxKQpgYGAKCgotIFVuZGVyc3RhbmRpbmcgdGhlIEZpdHRlZCBUcmVlOgogIAogIFN0YXJ0aW5nIGF0IHRoZSByb290IG5vZGUsIHRoZSBhdmVyYWdlIHZhbHVlIGZvciBhbGwgb2JzZXJ2YXRpb25zIGlzIDMsIHJlcHJlc2VudGluZyAxMDAlIG9mIHRoZSBkYXRhIHBvaW50cy4gVGhlIGZpcnN0IHNwbGl0IG9jY3VycyBiYXNlZCBvbiB0aGUgdmFyaWFibGUgYGxzYXRgLiBUaGUgbGVmdCBub2RlIGNvbnRhaW5zIHJvdWdobHkgMzUlIG9mIHRoZSBzYW1wbGVzIHdpdGggYW4gYXZlcmFnZSBvZiAyLjcsIHdoaWxlIHRoZSByaWdodCBub2RlIGNvbnRhaW5zIHRoZSByZW1haW5pbmcgNjUlIHdpdGggYW4gYXZlcmFnZSBvZiAzLjIuIExlYWYgbm9kZXMgYXBwZWFyIGF0IHRoZSBib3R0b20sIGFuZCB0aGVpciBjb2xvciBzaGFkaW5nIGluZGljYXRlcyB0aGUgbWFnbml0dWRlIG9mIHRoZWlyIHByZWRpY3Rpb25zLgoKIyMjIENyb3NzLXZhbGlkYXRpb24KCkNyb3NzLXZhbGlkYXRpb24gaXMgYXV0b21hdGljYWxseSBwZXJmb3JtZWQgd2hlbiB1c2luZyB0aGUgYHJwYXJ0YCBjb21tYW5kLiBUbyBpbnNwZWN0IHRoZSBjcm9zcy12YWxpZGF0aW9uIGZvciBwcnVuaW5nLCB1c2UgdGhlIGBwcmludGNwYCBjb21tYW5kLCB3aGljaCBwcm92aWRlcyBhIHRhYmxlLiAKCi0gYENQYCBtZWFucyAqKkNvbXBsZXhpdHkgUGFyYW1ldGVyKiosIHdoaWNoIGlzIGEgc2NhbGVkIHZlcnNpb24gb2YgJFxhbHBoYSQgKHRoZSBwcmljZSBvciBwZW5hbHR5IHdlIHBheSBmb3Iga2VlcGluZyBhIHNwbGl0KQotIGBuc3BsaXRgIG1lYW5zIHRoZSBudW1iZXIgb2Ygc3BsaXRzOyB0aGUgbnVtYmVyIG9mIGxlYWYgbm9kZXMgKG9yIHNpemUgb2YgYSB0cmVlKSBpcyBlcXVhbCB0byBgbnNwbGl0YCArIDEgCi0gYHJlbCBlcnJvcmAgYW5kIGB4ZXJyb3JgIGFyZSB0aGUgdHJhaW5pbmcgZXJyb3IgKFJTUykgYW5kIENWIGVycm9yLCBidXQgdGhlc2UgdmFsdWVzIGFyZSByZWxhdGl2ZSB0byBhIHJlZmVyZW5jZSB0cmFpbmluZyBlcnJvciBhdCB0aGUgcm9vdCBub2RlLiAKCkNvbm5lY3Rpb24gYmV0d2VlbiAkXGFscGhhJCBhbmQgQ1AKCiQkClJTUyhUKSArIFxhbHBoYSB8VHwsIFxxdWFkIFxmcmFje1JTUyhUKX17UlNTKFx0ZXh0e3Jvb3R9KX0gKyBDUCBcY2RvdCB8VHwKJCQKCmBgYHtyfQpwcmludGNwKHRyMSkKYGBgCgojIyMgUHJ1bmluZwoKV2UgY2FuIGVtcGxveSB0aGUgYHBydW5lYCBjb21tYW5kIHRvIG9idGFpbiB0aGUgb3B0aW1hbCBzdWJ0cmVlIGJ5IHByb3ZpZGluZyBzcGVjaWZpYyBDb21wbGV4aXR5IFBhcmFtZXRlciAoQ1ApIHZhbHVlcy4KCmBgYHtyLCBldmFsPUZBTFNFfQpwcnVuZSh0cjEsIGNwID0gMC4zKQpwcnVuZSh0cjEsIGNwID0gMC4yKQpwcnVuZSh0cjEsIGNwID0gMC4xNTYpCmBgYAoKCkEgQ1YgcGF0aCBwbG90CgpgYGB7cn0KcGxvdGNwKHRyMSkKYGBgCgoKSWYgeW91J3JlIHVuY2VydGFpbiB3aGV0aGVyIHRoZSBjcm9zcy12YWxpZGF0aW9uIGVycm9yIGhhcyByZWFjaGVkIGl0cyBtaW5pbXVtLCBjb25zaWRlciBzdGFydGluZyB3aXRoIGEgbGFyZ2VyIHRyZWUuCgpgYGB7ciwgZXZhbD1GQUxTRX0KdHIyID0gcnBhcnQoWX4uLCBkYXRhID0gSG91c2luZywKICBjb250cm9sID0gbGlzdChjcCA9IDAsIHh2YWwgPSAxMCkpCnByaW50Y3AodHIyKQpwbG90KHRyMikKcGxvdGNwKHRyMikKYGBgCgoKIyMjIE9wdGltYWwgQ1AgVmFsdWUKCldlIGNhbiB1c2UgZWl0aGVyIHRoZSBtaW5pbXVtIGNyb3NzLXZhbGlkYXRpb24gZXJyb3Igb3IgdGhlIE9uZSBTdGFuZGFyZCBFcnJvciAoMS1TRSkgcnVsZSB0byBjaG9vc2UgdGhlIG9wdGltYWwgQ1AgdmFsdWUgZm9yIHBydW5pbmcuCiAKYGBge3IsIGV2YWw9RkFMU0V9CiMgaW5kZXggb2YgQ1Agd2l0aCBsb3dlc3QgeGVycm9yCm9wdCA9IHdoaWNoLm1pbih0cjIkY3B0YWJsZVssICJ4ZXJyb3IiXSkgCiMgdGhlIG9wdGltYWwgQ1AgdmFsdWUKdHIyJGNwdGFibGVbb3B0LCA0XQpgYGAKCgpgYGB7ciwgZXZhbD1GQUxTRX0KIyB1cHBlciBib3VuZCBmb3IgZXF1aXZhbGVudCBvcHRpbWFsIHhlcnJvcgp0cjIkY3B0YWJsZVtvcHQsIDRdICsgdHIyJGNwdGFibGVbb3B0LCA1XQoKIyByb3cgSURzIGZvciBDUHMgd2hvc2UgeGVycm9yIGFyZSBlcXVpdmFsZW50IHRvIG1pbih4ZXJyb3IpCnRtcC5pZCA9IHdoaWNoKHRyMiRjcHRhYmxlWywgNF0gPD0gdHIyJGNwdGFibGVbb3B0LCA0XSArCiAgdHIyJGNwdGFibGVbb3B0LCA1XSkKdG1wLmlkID0gbWluKHRtcC5pZCkKIyBDUC4xc2UgPSBhbnkgdmFsdWUgYmV0d2VlbiByb3codG1wLmlkKSBhbmQodG1wLmlkIC0gMSkKQ1AuMXNlID0gKHRyMiRjcHRhYmxlW3RtcC5pZCAtMSwgMV0gKyB0cjIkY3B0YWJsZVt0bXAuaWQsIDFdKSAvIDIKCiMgUHJ1bmUgdHJlZSB3aXRoIENQLjFzZQp0cjMgPSBwcnVuZSh0cjIsIGNwID0gQ1AuMXNlKQpgYGAKCgojIyMgTW9yZSBvbiB0aGUgQ1AgVGFibGUKClVuZGVyc3RhbmQgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHRoZSAxc3QgY29sdW1uIGFuZCB0aGUgM3JkIGNvbHVtbiBvZiB0aGUgQ1AgdGFibGUuCgpUaGUgZGlmZmVyZW5jZSBiZXR3ZWVuIGFkamFjZW50IGByZWwgZXJyb3JgIGluIHRoZSB0YWJsZSBpcyBlcXVpdmFsZW50IHRvIHRoZSBjb3JyZXNwb25kaW5nIENQIHZhbHVlcy4gVGhpcyBkaWZmZXJlbmNlIHNpZ25pZmllcyB0aGUgaW1wcm92ZW1lbnQgZ2FpbmVkIGluIFJTUyBieSBhZGRpbmcgYW4gYWRkaXRpb25hbCBzcGxpdCwgZWZmZWN0aXZlbHkgc2V0dGluZyB0aGUgImNvc3QiIG9mIHRoYXQgc3BsaXQuCgoKYGBge3IsIGV2YWw9RkFMU0V9CmNiaW5kKHRyMiRjcHRhYmxlWywgMV0sIGMoLWRpZmYodHIyJGNwdGFibGVbLCAzXSksIDApKQpgYGAKCgojIyMgTWFraW5nIFByZWRpY3Rpb25zCgoKQWZ0ZXIgdGhlIHRyZWUgaXMgZml0LCB0aGUgYHByZWRpY3RgIGZ1bmN0aW9uIGZyb20gdGhlIGBycGFydGAgcGFja2FnZSBjYW4gYmUgdXNlZCBmb3IgbWFraW5nIHByZWRpY3Rpb25zIG9uIG5ldyBvciBleGlzdGluZyBkYXRhLgoKYGBge3IsIGV2YWw9RkFMU0V9Cj9wcmVkaWN0LnJwYXJ0IApgYGAK