Professional Documents
Culture Documents
http://ebooks.cambridge.org/
Chapter
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:25 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
138 Random Forests – trees everywhere
(3) Grow a single decision tree using the data and features selected;
don’t bother with pruning or finding the optimal size or complexity of
the tree.
(4) For this single tree above, use the OOB data to estimate the error value of
the tree.
(5) Do steps (1) through (4) again (and again); the default number of
repetitions is often 500, but more can be grown if necessary; the collec-
tion of trees is called a random forest.
(6) To classify a new case drop it down all the trees in the forest, and
take the simple majority vote across all trees to make a decision for
that case.
There continue to be additions and important refinements to the scheme
above, but RF is basically just that simple; see also Strobl et al. (2008, 2009). It
is not clear from this description why such a simple method should prove to
be proficient in so many practical applications to biomedical (and other)
data. We take up this question shortly, but let’s go over the steps in more
detail.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:26 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
139 Random treks through the features
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:26 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
140 Random Forests – trees everywhere
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:26 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
141 Weighted and unweighted voting
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:26 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
142 Random Forests – trees everywhere
This method seems sound enough but there is a problem. That is, after the
first boot draw, the remaining cases form two groups with a size ratio that is
different from the original group ratio. This means, therefore, that the case
testings (predictions) are done on groups that are not representative in
relative sizes to the original groups.
As an illustration, suppose the two groups are of size 25 and 300. A boot
draw of size 25 will select about two-thirds of the patients, that is about 16
cases, and about two-thirds of the controls, that is about about 200 cases.
Thus we now have a group ratio of 12.5, not 300/25 = 12. When the groups
are both smaller, the discrepancy is more pronounced. When presented with
grossly unequal sample sizes, most learning machines can be expected to
make this standard mistake, by limiting the overall error rate by classifying
most cases as members of the larger group at the expense of the smaller one.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:27 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
143 Finding subsets in the data using proximities
(1) Some subjects in Group A can be closer to parts of Group B than they are
to other Group A subjects. This suggests some less-than-obvious com-
monality in these subsets. Since each subject is uniquely identified in
these MDS plots, we can go back and study the clinical or biological
record to see why this might occur.
(2) The clustering evidenced in these MDS plots is based on using all the
predictors as chosen by RF in its tree-building activities. If we witness
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:27 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
144 Random Forests – trees everywhere
distinct subsets of subjects in the MDS plots we can ask: is it possible that
the subsets might be better separated by using different lists, that is sublists
of predictors, where the sublists were not necessarily selected as most
important by the original forest in its attempt to model the data. Simply
put, it is plausible that one subset of subjects might be better separated
from other subsets of subjects, by using different sets of predictors. Having
identified possible subsets of subjects it might be useful to see if RF can be
more accurate as a prediction engine when it is given just that subset upon
which to work. This process is an example of cycling from a supervised
learning machine, RF, to an unsupervised machine, MDS, and then back
to a supervised one, RF, again, now acting on a different subset. Such
cycling is not limited to RF and clustering through MDS plots: it is to be
encouraged when using machines quite generally.
Definitions
Groups are defined by the Barthel Index total score after 90 days (v394):
Group 1 = Complete restitution: v394 ≥ 95
Group 0 = Partial restitution: v394 < 95.
Subjects who died were excluded from the analysis.
The analysis of the training dataset goes as follows.
Training results
The larger group (complete restitution) was randomly sampled without
replacement to a size equal to the smaller group (partial restitution). The
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:27 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
145 Applying Random Forests to the Stroke data
Table 7.1 Mean errors on training data averaged over all forests, ± 2 Std errors
V324 V605
V605 V324
V310 V310
V297 V295
V299 V297
V296 V299
V211 V296
V298 V307
V295 V603
V291 V298
NEURSG_K V305
V603 V021
V292 V020
V021 V306
V307 V291
DV103 V294
V306 V300
V020 V211
V293 V010
V300 DV102
INTSG_K V025
V018 V292
V309 V293
V010 NEURSG_K
DV107 DV103
V290 V018
DV102 V309
V025 DV107
V294 INTSG_K
V059 V290
Figure 7.1 Variable importance plots for all 38 features, using training data.
total N analyzed was 1114, with 557 in each group. The analysis file had 38
candidate predictor variables. The analysis included 100 forests, each with
1000 trees. (Software versions were R version 2.9.2 (2009–08–24),
randomForest 4.5–33.)
The mean error rates are given in Table 7.1.
Next, we use the first forest (in the set of 100) to generate a variable
importance plot, shown in Figure 7.1, and a multidimensional scaling plot
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:28 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
146 Random Forests – trees everywhere
0.4
0.2
Dim 1
−0.2 0.0
−0.1 0.0 0.1 0.2 0.3 0.4
Dim 2
−0.2 0.0 0.2 0.4 −0.1 0.0 0.1 0.2 0.3 0.4
Legend: Light gray = partial (0), Dark gray = complete (1)
Figure 7.2 Multidimensional scaling (MDS) plots using the complete set of 38 features for the
stroke training data.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:28 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
147 Applying Random Forests to the Stroke data
Table 7.2 Variable importance rankings for all 38 features standardized to 0–100
scale and averaged over all forests
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:30 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
148 Random Forests – trees everywhere
The variable importances using just the top 10 features are presented in
Figure 7.3 and in Table 7.4.
Note that the ordering of the features in the top 10 list is not precisely the
same as in the listing for all 38 features: v310 and v 324 are switched, while
the arm and leg features get sorted differently as well. This is an example of
two processes:
Evidently using just the top 10 features results in error drift, from about
23.7% (using all 38 features) to 26.4%. Next, we display the MDS plots for the
top 10 features, Figure 7.4.
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:31 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
149 Applying Random Forests to the Stroke data
V605 V605
V324 V310
V310 V324
V297 V295
V296 V297
V298 V296
V299 V299
V291 V603
V603 V298
V295 V291
Figure 7.3 Variable importance plots using the top 10 features on the training data.
Careful inspection of the MDS plots, those for all 38 features and those
for the top 10 features, and the table of error rates, tells the same story:
there is more overlap in the two groups using a smaller feature list. That is,
the two groups, complete or partial restitution, are harder to separate using
only the top 10 features. It is good practice to check that the two methods
of presenting the RF results – error table and MDS plot – report similar
outcomes.
Finally we apply RF to the Stroke-B data, which we use as a validation
step. For this we use the first forest generated in the set of 100, and all 38
features. The error rates are given in Table 7.5. The first forest resulting
from the training file analysis, using all 38 features, was applied to the
Stroke-B dataset. The predict function in randomForest was invoked, and
prior to testing the prediction, the group sizes in the validation file were
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:31 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
150 Random Forests – trees everywhere
Table 7.4. Variable importance rankings for top 10 features on training data
standardized to 0–100 scale and averaged over all 100 forests
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:31 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
151 Random Forests in the universe of machines
Dim 2
−0.1
0.5
0.3
Dim 3
0.1
−0.1
−0.4 −0.2 0.0 0.2 0.4 −0.1 0.1 0.3 0.5
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:31 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
152 Random Forests – trees everywhere
Table 7.5 Validation results using all 38 features. Random Forests was
first applied to the Stroke-A data and then tested on the Stroke-B data
Sensitivity: 72.2%
Specificity: 77.8%
Error rate: 25.0%
Precision: 75.0%
Table 7.6 Validation results using the top 10 features. Random Forests
was applied to the Stroke-A data, the top 10 features were identified,
and then this machine was applied to the Stroke-B data
Sensitivity: 68.2%
Specificity: 77.2%
Error rate: 27.3%
Precision: 72.7%
for each feature at each node, and more – complicate the analysis. On the
other hand, recent work of Biau and his many very able collaborators, shows
that strong progress is being made; see Biau (2010), Biau and Devroye
(2010).
Thus, it is known that some simple versions of RF are not Bayes consis-
tent. For example, splitting to purity, such that each terminal node has only
subjects of the same group, is not universally Bayes consistent. A practical
message here is that terminal node sizes should be sufficiently large, and
generally increase as the dataset increases. The code for Random Forests and
Random Jungles allows this choice to be made by the user.
In the other direction, Random Forests can be understood as a layered
nearest neighbor (LNN) scheme; see DGL (problem 11.6), as well as Lin and
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:34 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
153 Notes
Notes
(1) Random Forests has undergone several incarnations and is freely avail-
able in the R library.
(a) A starting point is Leo Breiman’s still-maintained site: www.stat.
berkeley.edu/~breiman/RandomForests/.
(b) A newer, faster version is Random Jungles, written by Daniel
Schwarz, which implements several methods for feature selection
and variable importance evaluations. For more on RJ see: http://
randomjungle.com. And still more detail is given in Schwarz et al.
(2010).
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:34 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016
154 Random Forests – trees everywhere
Downloaded from Cambridge Books Online by IP 198.211.119.232 on Sun Apr 17 21:16:34 BST 2016.
http://dx.doi.org/10.1017/CBO9780511975820.008
Cambridge Books Online © Cambridge University Press, 2016