To demonstrate tree classifiers we will start with the very simple example we used for GLM.

x <- 1:10 y <- c(0,0,0,0,1,0,1,1,1,1) plot(x,y)

As you can see, at low values of x, species y is usually absent; at high values
of x it is usually present. With a tree classifier, we try and find a series of
splits that separates the presences from absences. In this example the dataset
is artificially small; `tree()` usually will not produce splits smaller
than 5 samples. So to make things work on this example we have to override some
of the defaults as shown below.

test <- tree(factor(y)~x,control=tree.control(nobs=10,mincut=3))

Notice that in contrast to GLM or GAM we did not have to specify a
To see the tree simply type its name.

test

node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 10 13.860 0 ( 0.5000 0.5000 ) 2) x < 4.5 4 0.000 0 ( 1.0000 0.0000 ) * 3) x > 4.5 6 5.407 1 ( 0.1667 0.8333 ) *

- the branch number (e.g. 2))
- the split (e.g. x < 4.5)
- the number of samples going along that split (e.g. 4)
- the deviance associated with that split (e.g. 0.000)
- the predicted value (e.g. 0=absent)
- the fraction of samples in that branch that are absent and present (e.g. ( 1.0000 0.0000 ))
- and, IF it is a terminal node (or leaf), a "*"

The "root" node is equivalent to the null deviance model in GLM or GAM; if you made no splits, what would the deviance be? In this case we have 5 presences and 5 absences, so

$$ -2 \times \Bigl(n_i \times log(n_i/N) + (N-n_i) \times log\bigl((N-n_i)/N\bigl)\Bigr) $$

-2 * 10 * log(0.5) = 13.86

The second branch has 4 samples, all absences, so the deviance associated with it is 0.

2) x < 4.5 4 0.000 0 ( 1.0000 0.0000 ) *

The tree keeps splitting "nodes" or branch points to make the resulting branches or leaves more "pure" (lower deviance).

There is a summary function associated with trees.

summary(test)

Classification tree: tree(formula = factor(y) ~ x, control = tree.control(nobs = 10, mincut = 3)) Number of terminal nodes: 2 Residual mean deviance: 0.6758 = 5.407 / 8 Misclassification error rate: 0.1 = 1 / 10

The `summary()` function shows

- the formula
- the number of terminal nodes
- the residual mean deviance (along with residual deviance)
- the misclassification error rate

The following plot shows the split identified by the tree.

demo <- tree(factor(berrep>0)~elev+slope+av) summary(demo)

Classification tree: tree(formula = factor(berrep > 0) ~ elev + slope + av) Number of terminal nodes: 15 Residual mean deviance: 0.4907 = 71.15 / 145 Misclassification error rate: 0.1125 = 18 / 160

demo

node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 160 218.200 FALSE ( 0.57500 0.42500 ) 2) elev < 8020 97 80.070 FALSE ( 0.85567 0.14433 ) 4) slope < 20 89 62.550 FALSE ( 0.88764 0.11236 ) 8) slope < 2.5 35 39.900 FALSE ( 0.74286 0.25714 ) 16) av < 0.785 22 28.840 FALSE ( 0.63636 0.36364 ) 32) elev < 7660 11 15.160 TRUE ( 0.45455 0.54545 ) 64) elev < 7370 6 7.638 FALSE ( 0.66667 0.33333 ) * 65) elev > 7370 5 5.004 TRUE ( 0.20000 0.80000 ) * 33) elev > 7660 11 10.430 FALSE ( 0.81818 0.18182 ) 66) av < 0.455 6 0.000 FALSE ( 1.00000 0.00000 ) * 67) av > 0.455 5 6.730 FALSE ( 0.60000 0.40000 ) * 17) av > 0.785 13 7.051 FALSE ( 0.92308 0.07692 ) * 9) slope > 2.5 54 9.959 FALSE ( 0.98148 0.01852 ) 18) av < 0.7 36 0.000 FALSE ( 1.00000 0.00000 ) * 19) av > 0.7 18 7.724 FALSE ( 0.94444 0.05556 ) 38) elev < 7490 5 5.004 FALSE ( 0.80000 0.20000 ) * 39) elev > 7490 13 0.000 FALSE ( 1.00000 0.00000 ) * 5) slope > 20 8 11.090 FALSE ( 0.50000 0.50000 ) * 3) elev > 8020 63 51.670 TRUE ( 0.14286 0.85714 ) 6) av < 0.055 7 9.561 TRUE ( 0.42857 0.57143 ) * 7) av > 0.055 56 38.140 TRUE ( 0.10714 0.89286 ) 14) av < 0.835 36 9.139 TRUE ( 0.02778 0.97222 ) 28) av < 0.215 8 6.028 TRUE ( 0.12500 0.87500 ) * 29) av > 0.215 28 0.000 TRUE ( 0.00000 1.00000 ) * 15) av > 0.835 20 22.490 TRUE ( 0.25000 0.75000 ) 30) av < 0.965 12 16.300 TRUE ( 0.41667 0.58333 ) 60) slope < 2.5 6 7.638 FALSE ( 0.66667 0.33333 ) * 61) slope > 2.5 6 5.407 TRUE ( 0.16667 0.83333 ) * 31) av > 0.965 8 0.000 TRUE ( 0.00000 1.00000 ) *

Notice how the same variables are used multiple times, often repeatedly in a

plot(demo) text(demo)

The height of the vertical lines in the plot is proportional to the reduction in deviance, so not all lines are equal, and you can see the important ones immediately.

demo.cv <- cv.tree(demo)
and then plot the result. E.g.

plot(demo.cv)

The curve (or jagged line) shows where the minimum deviance occurred with the cross-validated tree. Many people find these curves difficult to evaluate, as the stair-step curve seems ambiguous for a single value. For example, what is the value for X=4, ~150, or ~ 180? These curves are read as the last point along the line or any given X: thus at X=2, Y=147, not 220, and at X=4, Y=180, not 147.

In this case, the best tree would appear to be 2-3 terminal nodes. Actually,
there was no test for 3 terminal nodes, only 2 and 4, so the straight line from 2 to 4
is a little misleading. So, we'll get the best tree with two terminal nodes.
To get that tree, we use the `prune.tree()` function with a
`best=n` argument, where n = the number of terminal nodes we want.

demo.x <- prune.tree(demo,best=2)

This "pruned" tree is a tree object just like the others, and can be plotted or
summarized.
plot(demo.x) text(demo.x)

demo.x

node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 160 218.20 FALSE ( 0.5750 0.4250 ) 2) elev < 8020 97 80.07 FALSE ( 0.8557 0.1443 ) * 3) elev > 8020 63 51.67 TRUE ( 0.1429 0.8571 ) * summary(demo.x) Classification tree: snip.tree(tree = demo, nodes = c(3, 2)) Variables actually used in tree construction: [1] "elev" Number of terminal nodes: 2 Residual mean deviance: 0.8338 = 131.7 / 158 Misclassification error rate: 0.1438 = 23 / 160

In addition, note that the misclassification rate has gone up significantly. The previous number was an optimistic over-fit number; this one is much more realistic.

As you can see, the tree is remarkably simple with only two terminal nodes.
Surprisingly, it's fairly accurate. As you might remember from

Arguably, based on parsimony, we would want the smallest tree that has minimum cross-validated deviance. As a special case, if the minimum deviance occurs with a tree with 1 node, then none of your models are better than random, although some might be no worse. Alternatively, in an exploratory mode, we might want the most highly-resolved tree that is among the trees with lowest deviance (as long as that does not include the null tree with one node). For the sake of example, lets go with a four node tree.

demo.x <- prune.tree(demo,best=4) demo.x

node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 160 218.200 FALSE ( 0.57500 0.42500 ) 2) elev < 8020 97 80.070 FALSE ( 0.85567 0.14433 ) 4) slope < 20 89 62.550 FALSE ( 0.88764 0.11236 ) 8) slope < 2.5 35 39.900 FALSE ( 0.74286 0.25714 ) * 9) slope > 2.5 54 9.959 FALSE ( 0.98148 0.01852 ) * 5) slope > 20 8 11.090 FALSE ( 0.50000 0.50000 ) * 3) elev > 8020 63 51.670 TRUE ( 0.14286 0.85714 ) *

summary(demo.x)

Classification tree: snip.tree(tree = demo, nodes = c(9, 8, 3)) Variables actually used in tree construction: [1] "elev" "slope" Number of terminal nodes: 4 Residual mean deviance: 0.722 = 112.6 / 156 Misclassification error rate: 0.1438 = 23 / 160

Notice that while the residual deviance decreased a little (0.722 vs 0.834), the classification accuracy did not improve at all (23/160 misclassified). Despite the apparent increase in detail of the tree, it's arguably actually no better.

Let's try an example with topographic position, this time using
*Arctostaphylos patula*.

demo.2 <- tree(factor(arcpat>0)~elev+slope+pos) demo.2

node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 160 220.900 FALSE ( 0.53750 0.46250 ) 2) elev < 7980 96 105.700 FALSE ( 0.76042 0.23958 ) 4) elev < 6930 14 0.000 FALSE ( 1.00000 0.00000 ) * 5) elev > 6930 82 97.320 FALSE ( 0.71951 0.28049 ) 10) elev < 7460 26 35.890 TRUE ( 0.46154 0.53846 ) 20) pos: ridge,up_slope 12 13.500 FALSE ( 0.75000 0.25000 ) * 21) pos: bottom,low_slope,mid_slope 14 14.550 TRUE ( 0.21429 0.78571 ) * 11) elev > 7460 56 49.380 FALSE ( 0.83929 0.16071 ) 22) slope < 2.5 22 28.840 FALSE ( 0.63636 0.36364 ) 44) elev < 7845 12 16.300 TRUE ( 0.41667 0.58333 ) 88) pos: mid_slope,ridge 6 7.638 FALSE ( 0.66667 0.33333 ) * 89) pos: bottom,low_slope,up_slope 6 5.407 TRUE ( 0.16667 0.83333 ) * 45) elev > 7845 10 6.502 FALSE ( 0.90000 0.10000 ) * 23) slope > 2.5 34 9.023 FALSE ( 0.97059 0.02941 ) 46) elev < 7890 24 0.000 FALSE ( 1.00000 0.00000 ) * 47) elev > 7890 10 6.502 FALSE ( 0.90000 0.10000 ) * 3) elev > 7980 64 64.600 TRUE ( 0.20312 0.79688 ) *

demo2.cv <- cv.tree(demo.2) plot(demo2.cv)

demo2.x <- prune.tree(demo.2,best=4) plot(demo2.x) text(demo2.x)

demo2.x

node), split, n, deviance, yval, (yprob) * denotes terminal node 1) root 160 220.900 FALSE ( 0.53750 0.46250 ) 2) elev < 7980 96 105.700 FALSE ( 0.76042 0.23958 ) 4) elev < 6930 14 0.000 FALSE ( 1.00000 0.00000 ) * 5) elev > 6930 82 97.320 FALSE ( 0.71951 0.28049 ) 10) elev < 7460 26 35.890 TRUE ( 0.46154 0.53846 ) * 11) elev > 7460 56 49.380 FALSE ( 0.83929 0.16071 ) 22) slope < 2.5 22 28.840 FALSE ( 0.63636 0.36364 ) * 23) slope > 2.5 34 9.023 FALSE ( 0.97059 0.02941 ) * 3) elev > 7980 64 64.600 TRUE ( 0.20312 0.79688 ) *

summary(demo2.x)

Classification tree: snip.tree(tree = demo.2, nodes = c(23, 22, 10)) Variables actually used in tree construction: [1] "elev" "slope" Number of terminal nodes: 5 Residual mean deviance: 0.8926 = 138.4 / 155 Misclassification error rate: 0.2125 = 34 / 160

So, lets detach our `bryceveg` data frame, and attach the cover-class
midpoint data frame from Lab 1.

detach(bryceveg) attach(cover)

demo.3 <- tree(arcpat~elev+slope+pos) demo.3

node), split, n, deviance, yval * denotes terminal node 1) root 160 101800.0 14.600 2) elev < 7910 90 33000.0 7.289 4) slope < 20 83 26870.0 5.795 8) slope < 2.5 33 22500.0 12.940 16) elev < 7370 12 203.1 1.375 * 17) elev > 7370 21 19780.0 19.550 34) elev < 7690 6 5243.0 55.420 * 35) elev > 7690 15 3726.0 5.200 * 9) slope > 2.5 50 1574.0 1.080 * 5) slope > 20 7 3750.0 25.000 * 3) elev > 7910 70 57810.0 24.000 6) elev < 8650 53 48230.0 28.460 12) pos: bottom,low_slope,up_slope 19 11240.0 15.510 * 13) pos: mid_slope,ridge 34 32020.0 35.690 26) elev < 8135 7 6373.0 18.000 * 27) elev > 8135 27 22890.0 40.280 54) elev < 8325 8 6000.0 60.000 * 55) elev > 8325 19 12470.0 31.970 110) slope < 2.5 6 1710.0 13.670 * 111) slope > 2.5 13 7820.0 40.420 222) elev < 8490 5 2938.0 27.500 * 223) elev > 8490 8 3526.0 48.500 * 7) elev > 8650 17 5258.0 10.120 *

mean(arcpat)

[1] 14.60125

Exactly as we did for classification trees, we can cross-validate the regression tree as well.

demo3.cv <- cv.tree(demo.3) plot(demo3.cv)

demo3.x <- prune.tree(demo.3,best=11) summary(demo3.x)

Regression tree: snip.tree(tree = demo.3, nodes = 111) Number of terminal nodes: 11 Residual mean deviance: 355 = 52900 / 149 Distribution of residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -6.000e+01 -7.118e+00 -1.080e+00 1.588e-15 -5.626e-01 6.949e+01