Professional Documents
Culture Documents
31
Building many trees
1. Each tree can split on only a random subset of the variables
32
Random Forest parameters
•Minimum number of observations in a subset (like minbucket in CART)
•In R, this is controlled by the parameter nodesize
•Smaller value of nodesize, leads to bigger trees, take longer in R.
•Random Forests is much more computationally intensive than CART.
•Number of trees
•In R, this is the parameter ntree
•Should not be too small, because bagging procedure may miss observations
•More trees take longer to build
•In CART, the value of ’minbucket’ can affect the model’s out-of-sample accuracy
•if minbucket is too small, over-fitting might occur
•if minbucket is too large, the model might be too simple
•Use k−1 folds to estimate a model, and test model on remaining one fold
("validation set") for each candidate parameter value
•Use k−1 folds to estimate a model, and test model on remaining one fold
("validation set") for each candidate parameter value
•Like Adjusted
•Measures trade-off between model complexity and accuracy on
the training set
An unusual property of the CART trees that Martin and his colleagues developed
They use predictions for some trees as independent variables for other trees. (The first
split is whether or not Justice Ginsburg predicted decision is liberal.)
•Experts only asked to predict within their area of expertise; more than one
expert to each case
•A model that predicts these decisions is both more accurate and faster
than experts
• CART model based on very high-level details of case beats experts
who can process much more detailed and complex information