You are on page 1of 31

Random Forests

classification, variable selection and


consistency
Mikhail Traskin

University of Pennsylvania
The Wharton School
Department of Statistics

Random Forests, Stat 900, November 26, 2007 – p. 1/26


Random Forests
Ensemble classification (and regression) algorithm

Random Forests, Stat 900, November 26, 2007 – p. 2/26


Random Forests
Ensemble classification (and regression) algorithm
Proposed by Leo Breiman in 1999

Random Forests, Stat 900, November 26, 2007 – p. 2/26


Random Forests
Ensemble classification (and regression) algorithm
Proposed by Leo Breiman in 1999
Easy to implement

Random Forests, Stat 900, November 26, 2007 – p. 2/26


Random Forests
Ensemble classification (and regression) algorithm
Proposed by Leo Breiman in 1999
Easy to implement
Very effective in applications, has good generalization
properties

Random Forests, Stat 900, November 26, 2007 – p. 2/26


Random Forests
Ensemble classification (and regression) algorithm
Proposed by Leo Breiman in 1999
Easy to implement
Very effective in applications, has good generalization
properties
Algorithm outputs more information than just class
label

Random Forests, Stat 900, November 26, 2007 – p. 2/26


Breiman’s Experiments

Dataset AdaBoost RF Time ratio


Votes 4.8 4.1 N/A
German credit 23.5 24.4 N/A
Letters 3.4 3.5 N/A
Sat-images 8.8 8.6 N/A
Zip-code 6.2 6.3 0.025
Waveform 17.8 17.2 N/A
Twonorm 4.9 3.9 N/A

Random Forests, Stat 900, November 26, 2007 – p. 3/26


Classification or Regression Problem
We are given
Sn = {(Xi , Yi )}ni=1 — set of i.i.d. observations
distributed as P.
Xi ∈ X — predictors
Yi ∈ Y — responses
Goal: find fn = A(Sn ) s.t. E(ℓ(fn (X), Y )) is minimized.

Random Forests, Stat 900, November 26, 2007 – p. 4/26


Abstract Definition
Breiman (2001) defines random forest as follows.
Definition 1 A random forest is a classifier consisting of a
collection of tree-structured classifiers {h(x, Θk ), k = 1, . . .}
where the Θk are independent identically distributed
random vectors and each tree casts a unit vote for the
most popular class at input x.

Random Forests, Stat 900, November 26, 2007 – p. 5/26


The Random Forests Algorithm
1. Choose T —number of trees to grow.
2. Choose m—number of variables used to split each node.
m ≪ M , where M is the number of input variables. m is hold
constant while growing the forest.
3. Grow T trees. When growing each tree do the following.
(a) Construct a bootstrap sample of size n sampled from Sn with
replacement and grow a tree from this bootstrap sample.
(b) When growing a tree at each node select m variables at
random and use them to find the best split.
(c) Grow the tree to a maximal extent. There is no pruning.
4. To classify point X collect votes from every tree in the forest and
then use majority voting to decide on the class label.

Random Forests, Stat 900, November 26, 2007 – p. 6/26


Compare to: Bagging
Breiman, 1996
Works with any classification algorithm
Like Random Forests uses bootstrapping
Treats the underlying classification algorithm as a
"black box"
Variance reduction technique

Random Forests, Stat 900, November 26, 2007 – p. 7/26


Compare to: Random Split Selection
Dietterich, 2000
Grow multiple trees
When splitting, choose split uniformly at random from
K best splits
Can be used with or without pruning

Random Forests, Stat 900, November 26, 2007 – p. 8/26


Compare to: Random Subspace
Ho, 1998
Grow multiple trees
Each tree is grown using a fixed subset of variables
Do a majority vote or averaging to combine votes from
different trees

Random Forests, Stat 900, November 26, 2007 – p. 9/26


RF and Error Estimation
1. For each pairs (xi , yi ) in the training sample
Select only trees that do not contain the pair
Classify the pair with each of the selected trees
Compute misclassification rate for the pair
2. Average over computed estimates

Random Forests, Stat 900, November 26, 2007 – p. 10/26


RF and Variable Selection
1. For each tree in the forest
Classify out-of-bag cases and count number of
correct votes
Permute variable m in the out-of-bag sample
Classify permuted out-of-bag sample and count
number of correct votes
Compute the difference between the unpermuted
and permuted counts
2. Compute the average and sd of the differences
3. Compute z-statistic

Random Forests, Stat 900, November 26, 2007 – p. 11/26


RF and Interactions
Gini importance for each variable
Rank gini importance scores for each tree
For each pair of variables compute the average rank
difference over all trees

Random Forests, Stat 900, November 26, 2007 – p. 12/26


Unsupervised Learning
(Dis)similarity measure
For each tree put all the training sample down the tree
For each pair of observations compute fraction of trees
sij where they end up in the same node
p
Compute dissimilarity as dij = 1 − sij

Random Forests, Stat 900, November 26, 2007 – p. 13/26


Unsupervised Learning
Synthetic datasets
Mark observed data as “observed”
Generate a synthetic sample from the product of
marginal of observed data
Mark generated data as “unobserved”

Random Forests, Stat 900, November 26, 2007 – p. 14/26


Unsupervised Learning
Clustering
Train random forest on the synthetic data
Use the forest to compute the dissimilarity measure
only for the observed data
Use any clustering algorithm with the computed
dissimilarity measure

Random Forests, Stat 900, November 26, 2007 – p. 15/26


Universal Consistency
Assume i.i.d. data (X, Y ), Sn = {(Xi , Yi )}ni=1 from
X × Y, with Y = {−1, 1}.
Consider a method fn = A(Sn ), for example
fn = AdaBoost(Sn , tn ).
Definition 2 Method is universally consistent if for any
distribution P
a.s. ∗
L(fn ) →L ,
where L is the risk and L∗ is the Bayes risk:

L(fn ) = P(sign(fn (X)) 6= Y |Sn ), L∗ = inf L(f ).


f

Random Forests, Stat 900, November 26, 2007 – p. 16/26


Is Random Forests Consistent?
Breiman (2001) wrote:
Section 2 gives some theoretical background for random
forests. Use of the Strong Law of Large Numbers shows
that they always converge so that overfitting is not a
problem.

Random Forests, Stat 900, November 26, 2007 – p. 17/26


Is Random Forests Consistent?
Breiman (2001) wrote:
Section 2 gives some theoretical background for random
forests. Use of the Strong Law of Large Numbers shows
that they always converge so that overfitting is not a
problem.
···
This result explains why random forests do not overfit as
more trees are added, but produce a limiting value of the
generalization error.

Random Forests, Stat 900, November 26, 2007 – p. 17/26


One-Dimensional Case
Theorem 3 Consider binary classification problem. If
X = R then classification Random Forests algorithm is
equivalent to 1-nearest neighbor classifier and hence is not
consistent.
Theorem 4 Consider binary classification problem. If
X = R and bootstrap sample size k → ∞ s.t. k = o(n) then
classification Random Forests algorithm is consistent.

Random Forests, Stat 900, November 26, 2007 – p. 18/26


One-Dimensional Case
X = [0, 1], η(x) = P(Y = 1|x) = 0.25 + 0.5I{x≥0.5} ,
L1N N = 0.375

Random Forests, Stat 900, November 26, 2007 – p. 19/26


One-Dimensional Case

Random Forests, Stat 900, November 26, 2007 – p. 20/26


Two-Dimensional Case

Random Forests, Stat 900, November 26, 2007 – p. 21/26


Two-Dimensional Case

Random Forests, Stat 900, November 26, 2007 – p. 22/26


Four-Dimensional Case

Random Forests, Stat 900, November 26, 2007 – p. 23/26


Eight-Dimensional Case

Random Forests, Stat 900, November 26, 2007 – p. 24/26


Four-Dimensional Case
Decision boundary: hyperplane

Random Forests, Stat 900, November 26, 2007 – p. 25/26


Other versions of ensemble classifiers
Biau et al. (2007)
Consistency of purely random forest
Consistency of bagged nearest neighbor rules
Consistency of forest consisting of trees based on the
partitioning the space into nested rectangles

Random Forests, Stat 900, November 26, 2007 – p. 26/26