library(ggplot2)
library(rpart)
library(rpart.plot)
library(tree)
url = "https://liangfgithub.github.io/Data/HousingData.csv"
Housing = read.csv(url)

There are two R packages for tree models, tree and rpart. We will mainly use rpart. The package tree is called for its command partition.tree, which we use to generate the first figure.

The tree Package

The syntax for fitting a regression tree with tree is similar to that used for linear regression models. In this example, we use just two predictors, longitude and latitude, from the Housing data to predict the variable Y.

trfit= tree(Y ~ lon + lat, data = Housing)
small.tree = prune.tree(trfit, best = 7)
small.tree
node), split, n, deviance, yval
      * denotes terminal node

  1) root 506 84.1800 3.035  
    2) lon < -71.0667 202 16.8800 3.297 *
    3) lon > -71.0667 304 44.1400 2.860  
      6) lon < -71.0155 185 32.8300 2.752  
       12) lat < 42.241 147 27.4400 2.671  
         24) lat < 42.1698 18  0.8337 3.104 *
         25) lat > 42.1698 129 22.7700 2.611  
           50) lon < -71.0332 102 15.4200 2.707  
            100) lat < 42.2011 51  3.5830 2.525 *
            101) lat > 42.2011 51  8.4700 2.888 *
           51) lon > -71.0332 27  2.8910 2.250 *
       13) lat > 42.241 38  0.6923 3.066 *
      7) lon > -71.0155 119  5.8160 3.028 *
par(mfrow = c(1, 2))
plot(small.tree)
text(small.tree, cex = .75)

price.quantiles = cut(Housing$Y, quantile(Housing$Y, 0: 20 / 20),
  include.lowest = TRUE)
plot(Housing$lat, Housing$lon, col = grey(20: 1 / 21)[price.quantiles],
  pch = 20, ylab = "Longitude", xlab = "Latitude")
partition.tree(small.tree, ordvars = c("lat", "lon"), add = TRUE)

Detach the tree package and switch to the rpart package.

detach("package:tree")

The rpart Package

Fit a Tree

The syntax for fitting a regression tree with rpart closely resembles that used for linear regression models. In this example, we use the housing dataset to build a regression tree that predicts the target variable Y, using all other available columns as potential predictors. The default tree plot may not be visually appealing, so for better visualization, we can use the rpart.plot package.

set.seed(1234)
tr1 = rpart(Y~., data = Housing)
par(mfrow = c(1, 2))
plot(tr1)
rpart.plot(tr1)

  • Understanding the Fitted Tree:

    Starting at the root node, the average value for all observations is 3, representing 100% of the data points. The first split occurs based on the variable lsat. The left node contains roughly 35% of the samples with an average of 2.7, while the right node contains the remaining 65% with an average of 3.2. Leaf nodes appear at the bottom, and their color shading indicates the magnitude of their predictions.

Cross-validation

Cross-validation is automatically performed when using the rpart command. To inspect the cross-validation for pruning, use the printcp command, which provides a table.

  • CP means Complexity Parameter, which is a scaled version of \(\alpha\) (the price or penalty we pay for keeping a split)
  • nsplit means the number of splits; the number of leaf nodes (or size of a tree) is equal to nsplit + 1
  • rel error and xerror are the training error (RSS) and CV error, but these values are relative to a reference training error at the root node.

Connection between \(\alpha\) and CP

\[ RSS(T) + \alpha |T|, \quad \frac{RSS(T)}{RSS(\text{root})} + CP \cdot |T| \]

printcp(tr1)

Regression tree:
rpart(formula = Y ~ ., data = Housing)

Variables actually used in tree construction:
[1] age   crim  lstat nox   rm   

Root node error: 84.178/506 = 0.16636

n= 506 

         CP nsplit rel error  xerror     xstd
1  0.464749      0   1.00000 1.00266 0.074595
2  0.157620      1   0.53525 0.55009 0.037433
3  0.077637      2   0.37763 0.39504 0.034672
4  0.034542      3   0.29999 0.33653 0.031127
5  0.022416      4   0.26545 0.31014 0.031366
6  0.020665      5   0.24304 0.28779 0.027946
7  0.016709      6   0.22237 0.27085 0.026521
8  0.012136      7   0.20566 0.25969 0.025669
9  0.011932      8   0.19353 0.25558 0.025637
10 0.010000      9   0.18159 0.25558 0.025637

Pruning

We can employ the prune command to obtain the optimal subtree by providing specific Complexity Parameter (CP) values.

prune(tr1, cp = 0.3)
prune(tr1, cp = 0.2)
prune(tr1, cp = 0.156)

A CV path plot

plotcp(tr1)

If you’re uncertain whether the cross-validation error has reached its minimum, consider starting with a larger tree.

tr2 = rpart(Y~., data = Housing,
  control = list(cp = 0, xval = 10))
printcp(tr2)
plot(tr2)
plotcp(tr2)

Optimal CP Value

We can use either the minimum cross-validation error or the One Standard Error (1-SE) rule to choose the optimal CP value for pruning.

# index of CP with lowest xerror
opt = which.min(tr2$cptable[, "xerror"]) 
# the optimal CP value
tr2$cptable[opt, 4]
# upper bound for equivalent optimal xerror
tr2$cptable[opt, 4] + tr2$cptable[opt, 5]

# row IDs for CPs whose xerror are equivalent to min(xerror)
tmp.id = which(tr2$cptable[, 4] <= tr2$cptable[opt, 4] +
  tr2$cptable[opt, 5])
tmp.id = min(tmp.id)
# CP.1se = any value between row(tmp.id) and(tmp.id - 1)
CP.1se = (tr2$cptable[tmp.id -1, 1] + tr2$cptable[tmp.id, 1]) / 2

# Prune tree with CP.1se
tr3 = prune(tr2, cp = CP.1se)

More on the CP Table

Understand the relationship between the 1st column and the 3rd column of the CP table.

The difference between adjacent rel error in the table is equivalent to the corresponding CP values. This difference signifies the improvement gained in RSS by adding an additional split, effectively setting the “cost” of that split.

cbind(tr2$cptable[, 1], c(-diff(tr2$cptable[, 3]), 0))

Making Predictions

After the tree is fit, the predict function from the rpart package can be used for making predictions on new or existing data.

?predict.rpart 
LS0tCnRpdGxlOiAiKFBTTCkgUmVncmVzc2lvbiBUcmVlIgpvdXRwdXQ6CiAgaHRtbF9ub3RlYm9vazoKICAgIHRoZW1lOiByZWFkYWJsZQogICAgdG9jOiB5ZXMKICAgIHRvY19mbG9hdDogeWVzCiAgICBoaWdobGlnaHQ6IHRhbmdvCi0tLQoKCmBgYHtyfQpsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkocnBhcnQpCmxpYnJhcnkocnBhcnQucGxvdCkKbGlicmFyeSh0cmVlKQpgYGAKCmBgYHtyfQp1cmwgPSAiaHR0cHM6Ly9saWFuZ2ZnaXRodWIuZ2l0aHViLmlvL0RhdGEvSG91c2luZ0RhdGEuY3N2IgpIb3VzaW5nID0gcmVhZC5jc3YodXJsKQpgYGAKCgpUaGVyZSBhcmUgdHdvIFIgcGFja2FnZXMgZm9yIHRyZWUgbW9kZWxzLCBgdHJlZWAgYW5kIGBycGFydGAuIFdlIHdpbGwgbWFpbmx5IHVzZSBgcnBhcnRgLiBUaGUgcGFja2FnZSBgdHJlZWAgaXMgY2FsbGVkIGZvciBpdHMgY29tbWFuZCBgcGFydGl0aW9uLnRyZWVgLCB3aGljaCB3ZSB1c2UgdG8gZ2VuZXJhdGUgdGhlIGZpcnN0IGZpZ3VyZS4KCiMjIFRoZSB0cmVlIFBhY2thZ2UKClRoZSBzeW50YXggZm9yIGZpdHRpbmcgYSByZWdyZXNzaW9uIHRyZWUgd2l0aCBgdHJlZWAgaXMgc2ltaWxhciB0byB0aGF0IHVzZWQgZm9yIGxpbmVhciByZWdyZXNzaW9uIG1vZGVscy4gSW4gdGhpcyBleGFtcGxlLCB3ZSB1c2UganVzdCB0d28gcHJlZGljdG9ycywgbG9uZ2l0dWRlIGFuZCBsYXRpdHVkZSwgZnJvbSB0aGUgSG91c2luZyBkYXRhIHRvIHByZWRpY3QgdGhlIHZhcmlhYmxlIFkuCgoKYGBge3J9CnRyZml0PSB0cmVlKFkgfiBsb24gKyBsYXQsIGRhdGEgPSBIb3VzaW5nKQpzbWFsbC50cmVlID0gcHJ1bmUudHJlZSh0cmZpdCwgYmVzdCA9IDcpCnNtYWxsLnRyZWUKYGBgCgpgYGB7cn0KcGFyKG1mcm93ID0gYygxLCAyKSkKcGxvdChzbWFsbC50cmVlKQp0ZXh0KHNtYWxsLnRyZWUsIGNleCA9IC43NSkKCnByaWNlLnF1YW50aWxlcyA9IGN1dChIb3VzaW5nJFksIHF1YW50aWxlKEhvdXNpbmckWSwgMDogMjAgLyAyMCksCiAgaW5jbHVkZS5sb3dlc3QgPSBUUlVFKQpwbG90KEhvdXNpbmckbGF0LCBIb3VzaW5nJGxvbiwgY29sID0gZ3JleSgyMDogMSAvIDIxKVtwcmljZS5xdWFudGlsZXNdLAogIHBjaCA9IDIwLCB5bGFiID0gIkxvbmdpdHVkZSIsIHhsYWIgPSAiTGF0aXR1ZGUiKQpwYXJ0aXRpb24udHJlZShzbWFsbC50cmVlLCBvcmR2YXJzID0gYygibGF0IiwgImxvbiIpLCBhZGQgPSBUUlVFKQpgYGAKRGV0YWNoIHRoZSBgdHJlZWAgcGFja2FnZSBhbmQgc3dpdGNoIHRvIHRoZSBgcnBhcnRgIHBhY2thZ2UuIAoKYGBge3J9CmRldGFjaCgicGFja2FnZTp0cmVlIikKYGBgCgojIyBUaGUgcnBhcnQgUGFja2FnZQoKIyMjIEZpdCBhIFRyZWUKClRoZSBzeW50YXggZm9yIGZpdHRpbmcgYSByZWdyZXNzaW9uIHRyZWUgd2l0aCBgcnBhcnRgIGNsb3NlbHkgcmVzZW1ibGVzIHRoYXQgdXNlZCBmb3IgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWxzLiBJbiB0aGlzIGV4YW1wbGUsIHdlIHVzZSB0aGUgaG91c2luZyBkYXRhc2V0IHRvIGJ1aWxkIGEgcmVncmVzc2lvbiB0cmVlIHRoYXQgcHJlZGljdHMgdGhlIHRhcmdldCB2YXJpYWJsZSBZLCB1c2luZyBhbGwgb3RoZXIgYXZhaWxhYmxlIGNvbHVtbnMgYXMgcG90ZW50aWFsIHByZWRpY3RvcnMuIFRoZSBkZWZhdWx0IHRyZWUgcGxvdCBtYXkgbm90IGJlIHZpc3VhbGx5IGFwcGVhbGluZywgc28gZm9yIGJldHRlciB2aXN1YWxpemF0aW9uLCB3ZSBjYW4gdXNlIHRoZSBgcnBhcnQucGxvdGAgcGFja2FnZS4KCgpgYGB7cn0Kc2V0LnNlZWQoMTIzNCkKdHIxID0gcnBhcnQoWX4uLCBkYXRhID0gSG91c2luZykKcGFyKG1mcm93ID0gYygxLCAyKSkKcGxvdCh0cjEpCnJwYXJ0LnBsb3QodHIxKQpgYGAKCgotIFVuZGVyc3RhbmRpbmcgdGhlIEZpdHRlZCBUcmVlOgogIAogIFN0YXJ0aW5nIGF0IHRoZSByb290IG5vZGUsIHRoZSBhdmVyYWdlIHZhbHVlIGZvciBhbGwgb2JzZXJ2YXRpb25zIGlzIDMsIHJlcHJlc2VudGluZyAxMDAlIG9mIHRoZSBkYXRhIHBvaW50cy4gVGhlIGZpcnN0IHNwbGl0IG9jY3VycyBiYXNlZCBvbiB0aGUgdmFyaWFibGUgYGxzYXRgLiBUaGUgbGVmdCBub2RlIGNvbnRhaW5zIHJvdWdobHkgMzUlIG9mIHRoZSBzYW1wbGVzIHdpdGggYW4gYXZlcmFnZSBvZiAyLjcsIHdoaWxlIHRoZSByaWdodCBub2RlIGNvbnRhaW5zIHRoZSByZW1haW5pbmcgNjUlIHdpdGggYW4gYXZlcmFnZSBvZiAzLjIuIExlYWYgbm9kZXMgYXBwZWFyIGF0IHRoZSBib3R0b20sIGFuZCB0aGVpciBjb2xvciBzaGFkaW5nIGluZGljYXRlcyB0aGUgbWFnbml0dWRlIG9mIHRoZWlyIHByZWRpY3Rpb25zLgoKIyMjIENyb3NzLXZhbGlkYXRpb24KCkNyb3NzLXZhbGlkYXRpb24gaXMgYXV0b21hdGljYWxseSBwZXJmb3JtZWQgd2hlbiB1c2luZyB0aGUgYHJwYXJ0YCBjb21tYW5kLiBUbyBpbnNwZWN0IHRoZSBjcm9zcy12YWxpZGF0aW9uIGZvciBwcnVuaW5nLCB1c2UgdGhlIGBwcmludGNwYCBjb21tYW5kLCB3aGljaCBwcm92aWRlcyBhIHRhYmxlLiAKCi0gYENQYCBtZWFucyAqKkNvbXBsZXhpdHkgUGFyYW1ldGVyKiosIHdoaWNoIGlzIGEgc2NhbGVkIHZlcnNpb24gb2YgJFxhbHBoYSQgKHRoZSBwcmljZSBvciBwZW5hbHR5IHdlIHBheSBmb3Iga2VlcGluZyBhIHNwbGl0KQotIGBuc3BsaXRgIG1lYW5zIHRoZSBudW1iZXIgb2Ygc3BsaXRzOyB0aGUgbnVtYmVyIG9mIGxlYWYgbm9kZXMgKG9yIHNpemUgb2YgYSB0cmVlKSBpcyBlcXVhbCB0byBgbnNwbGl0YCArIDEgCi0gYHJlbCBlcnJvcmAgYW5kIGB4ZXJyb3JgIGFyZSB0aGUgdHJhaW5pbmcgZXJyb3IgKFJTUykgYW5kIENWIGVycm9yLCBidXQgdGhlc2UgdmFsdWVzIGFyZSByZWxhdGl2ZSB0byBhIHJlZmVyZW5jZSB0cmFpbmluZyBlcnJvciBhdCB0aGUgcm9vdCBub2RlLiAKCkNvbm5lY3Rpb24gYmV0d2VlbiAkXGFscGhhJCBhbmQgQ1AKCiQkClJTUyhUKSArIFxhbHBoYSB8VHwsIFxxdWFkIFxmcmFje1JTUyhUKX17UlNTKFx0ZXh0e3Jvb3R9KX0gKyBDUCBcY2RvdCB8VHwKJCQKCmBgYHtyfQpwcmludGNwKHRyMSkKYGBgCgojIyMgUHJ1bmluZwoKV2UgY2FuIGVtcGxveSB0aGUgYHBydW5lYCBjb21tYW5kIHRvIG9idGFpbiB0aGUgb3B0aW1hbCBzdWJ0cmVlIGJ5IHByb3ZpZGluZyBzcGVjaWZpYyBDb21wbGV4aXR5IFBhcmFtZXRlciAoQ1ApIHZhbHVlcy4KCmBgYHtyLCBldmFsPUZBTFNFfQpwcnVuZSh0cjEsIGNwID0gMC4zKQpwcnVuZSh0cjEsIGNwID0gMC4yKQpwcnVuZSh0cjEsIGNwID0gMC4xNTYpCmBgYAoKCkEgQ1YgcGF0aCBwbG90CgpgYGB7cn0KcGxvdGNwKHRyMSkKYGBgCgoKSWYgeW91J3JlIHVuY2VydGFpbiB3aGV0aGVyIHRoZSBjcm9zcy12YWxpZGF0aW9uIGVycm9yIGhhcyByZWFjaGVkIGl0cyBtaW5pbXVtLCBjb25zaWRlciBzdGFydGluZyB3aXRoIGEgbGFyZ2VyIHRyZWUuCgpgYGB7ciwgZXZhbD1GQUxTRX0KdHIyID0gcnBhcnQoWX4uLCBkYXRhID0gSG91c2luZywKICBjb250cm9sID0gbGlzdChjcCA9IDAsIHh2YWwgPSAxMCkpCnByaW50Y3AodHIyKQpwbG90KHRyMikKcGxvdGNwKHRyMikKYGBgCgoKIyMjIE9wdGltYWwgQ1AgVmFsdWUKCldlIGNhbiB1c2UgZWl0aGVyIHRoZSBtaW5pbXVtIGNyb3NzLXZhbGlkYXRpb24gZXJyb3Igb3IgdGhlIE9uZSBTdGFuZGFyZCBFcnJvciAoMS1TRSkgcnVsZSB0byBjaG9vc2UgdGhlIG9wdGltYWwgQ1AgdmFsdWUgZm9yIHBydW5pbmcuCiAKYGBge3IsIGV2YWw9RkFMU0V9CiMgaW5kZXggb2YgQ1Agd2l0aCBsb3dlc3QgeGVycm9yCm9wdCA9IHdoaWNoLm1pbih0cjIkY3B0YWJsZVssICJ4ZXJyb3IiXSkgCiMgdGhlIG9wdGltYWwgQ1AgdmFsdWUKdHIyJGNwdGFibGVbb3B0LCA0XQpgYGAKCgpgYGB7ciwgZXZhbD1GQUxTRX0KIyB1cHBlciBib3VuZCBmb3IgZXF1aXZhbGVudCBvcHRpbWFsIHhlcnJvcgp0cjIkY3B0YWJsZVtvcHQsIDRdICsgdHIyJGNwdGFibGVbb3B0LCA1XQoKIyByb3cgSURzIGZvciBDUHMgd2hvc2UgeGVycm9yIGFyZSBlcXVpdmFsZW50IHRvIG1pbih4ZXJyb3IpCnRtcC5pZCA9IHdoaWNoKHRyMiRjcHRhYmxlWywgNF0gPD0gdHIyJGNwdGFibGVbb3B0LCA0XSArCiAgdHIyJGNwdGFibGVbb3B0LCA1XSkKdG1wLmlkID0gbWluKHRtcC5pZCkKIyBDUC4xc2UgPSBhbnkgdmFsdWUgYmV0d2VlbiByb3codG1wLmlkKSBhbmQodG1wLmlkIC0gMSkKQ1AuMXNlID0gKHRyMiRjcHRhYmxlW3RtcC5pZCAtMSwgMV0gKyB0cjIkY3B0YWJsZVt0bXAuaWQsIDFdKSAvIDIKCiMgUHJ1bmUgdHJlZSB3aXRoIENQLjFzZQp0cjMgPSBwcnVuZSh0cjIsIGNwID0gQ1AuMXNlKQpgYGAKCgojIyMgTW9yZSBvbiB0aGUgQ1AgVGFibGUKClVuZGVyc3RhbmQgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIHRoZSAxc3QgY29sdW1uIGFuZCB0aGUgM3JkIGNvbHVtbiBvZiB0aGUgQ1AgdGFibGUuCgpUaGUgZGlmZmVyZW5jZSBiZXR3ZWVuIGFkamFjZW50IGByZWwgZXJyb3JgIGluIHRoZSB0YWJsZSBpcyBlcXVpdmFsZW50IHRvIHRoZSBjb3JyZXNwb25kaW5nIENQIHZhbHVlcy4gVGhpcyBkaWZmZXJlbmNlIHNpZ25pZmllcyB0aGUgaW1wcm92ZW1lbnQgZ2FpbmVkIGluIFJTUyBieSBhZGRpbmcgYW4gYWRkaXRpb25hbCBzcGxpdCwgZWZmZWN0aXZlbHkgc2V0dGluZyB0aGUgImNvc3QiIG9mIHRoYXQgc3BsaXQuCgoKYGBge3IsIGV2YWw9RkFMU0V9CmNiaW5kKHRyMiRjcHRhYmxlWywgMV0sIGMoLWRpZmYodHIyJGNwdGFibGVbLCAzXSksIDApKQpgYGAKCgojIyMgTWFraW5nIFByZWRpY3Rpb25zCgoKQWZ0ZXIgdGhlIHRyZWUgaXMgZml0LCB0aGUgYHByZWRpY3RgIGZ1bmN0aW9uIGZyb20gdGhlIGBycGFydGAgcGFja2FnZSBjYW4gYmUgdXNlZCBmb3IgbWFraW5nIHByZWRpY3Rpb25zIG9uIG5ldyBvciBleGlzdGluZyBkYXRhLgoKYGBge3IsIGV2YWw9RkFMU0V9Cj9wcmVkaWN0LnJwYXJ0IApgYGAK