You are on page 1of 2

• Pros of DT – Highly interpretable, even for non-statisticians.

– Cognitively (presumably) closer to


the process of human decisionmaking. – Efficiently handle qualitative predictors w/o the need to
create dummy variables. • Cons of DT – Relatively low predictive accuracy, compared with other
regression and classification models. 7.2 Bagging Bagging is a model-building method that aims at
reducing variance. Basically, we would like to build n models with n training sets, and obtain
prediction as the average of the prediction from the n models. If each model has a variance σ 2 i , the
variance of the prediction will be σ¯ 2 n ≤ σ 2 i . In practice, obtaining n training sets is oftentimes
unpractical, we therefore use bootstrapping to generate B different training sets by sampling from
the only training set we have. The prediction is thus made as follows: ˆfbag(x) = 1 B X B b=1 ˆf ∗b (x)
(7.8) For a classification task, we let n models “vote” with a classification decision, and take the
majority vote as our final classification. One nice property of bagging is that increasing its
parameter B doesn’t lead to increase in variance. In practice we use a value of B sufficiently large
that the error has settled down. The equivalent of LOOCV cross-validation in bagging is called Out-Of-
Bag (OOB), where each training set is left out in model-building and predictionmaking once as the
test set. The accuracy of the prediction at the round is then tested on it. The overall prediction
accuracy is obtained by averaging over all prediction accuracies recorded. Performing bagging on a
DT comes with the price of reduced interpretability. Specifically, with bagging, it is not clear which
predictor(s) are more important in the process of decision-making. The following method remedy
this problem: • RSS-reduction: Record the decrease in RSS at each split over a given predictor, and
average over all B trees. • Gini-reduction: Same as RSS-reduction, only this time the recorded is Gini
Indices at each split for a predictor.

7.3 Random Forest A serious problem that prevents bagging from effectively reducing the variance is
that bagging trees are highly correlated when there are a few strong predictors in the set of all
predictors — the trees built from the n training sets then end up very similar to each other. Random
Forest overcomes this problem by decorrelating the bagging trees with a clever splitting technique:
At each split, m predictors are sampled from all the p predictors (empirically m ≃ √p gives good
performance), and the split is allowed to use only one of these m predictors. Procedurally, RF is
otherwise very similar to bagging, with a moderate increase in variance reduction. 7.4 Boosting
There are two main differences between bagging and Boosting: • Fitting Target: Boosting fits to the
residual rather than the response per se. • Fitting Procedure: Boosting builds trees sequentially
rather than by simultaneous sampling. The following algorithm describes the boosting method: •
Set ˆf(x) = 0 and ri = yi for all i in the training set. • For b = 1, 2, ..., B, repeat: – Fit a tree ˆf b with d
split (d + 1 terminal node) to the training data (X, r). – Update ˆf by adding in a shrunken version of
the new tree: ˆf(x) ← ˆf(x) + λ ˆf b (x) (7.9) – Update the residuals, ri ← ri − λ ˆf b (xi ) (7.10) • Output
the boosted model, ˆf(x) = X B b=1 λ ˆf b (x) (7.11) The idea behind boosting is learning slowly and
improve ˆf in areas where it does not perform well, which is in a way highlighted by residuals. The
shrinkage parameter λ (typically 0.001 ≤ λ ≤ 0.01) tunes the process by showing it down further,
which allows thus fine-tuned attack on the residuals with different shaped trees. d here is a step
parameter which controls the depth (usually d = 1 works well, where each subtree is a stump) of the
subtree added for each iteration (at step 2 in the Boosting Algorithm)

7.5 Lab Code % Decision Tree library(tree) % R version 3.2.3 library(ISLR) attach(Carseats) High =
ifelse(Sales<=8, ‘No’, ‘Yes’) Carseats = data.frame(Carseats, High) % select ‘ Sales’ for response %
factorify ’Sale’ by 8 as threshold tree . carseats = tree(High˜.-Sales, Carseats) % build a DT for ‘High’
using all predictors (leave ‘Sale’ out) summary(tree.carseats) % see report of # of terminal nodes, %
residual mean deviance, misclassification error rate, % and predictors by importance plot(tree .
carseats ) text(tree . carseats , pretty=0) % plot tree and print node labels set .seed(2) train =
sample(1:nrow(Carseats), 200) Carseats. test = Carseats[-train ,] % half-half train - test split High.test
= High[-train] tree . carseats = tree(High˜.-Sales, Carseats, subset=train) tree .pred = predict(tree.
carseats , Carseats. test , type=‘class’ ) table(tree.pred, High.test) (86+57)/200 ... % = 0.715 % (No˜No
+ Yes˜Yes) / Total set.seed(3) cv. carseats = cv.tree(tree . carseats , FUN=prune.misclass) % 10-block
CV by default % report 10 trees of different sizes , and % corresponding misclass rate ($dev) % 9-
node tree has the best par(mfrow=c(1,2)) plot(cv.carseats$size , cv. carseats$dev, type=‘b’) plot(cv.
carseats$k, cv. carseats$dev, type=‘b’)

% plot size of nodes against error rate % plot # of CV-folds against error rate prune.carseats =
prune.misclass(tree. carseats , best=9) plot(prune.carseats) text(prune.carseats, pretty=0) % plot the
best tree tree .pred = predict(prune.carseats, Carseats.test, type=‘class’) table(tree .pred, High.test)
(94+60)/200 ... % = 0.77 % pruned tree performs better! % Regression Tree library (MASS) set
.seed(1) train = sample(1:nrow(Boston), nrow(Boston)/2) tree .boston = tree(medv˜.,Boston,
subset=trian) summary(tree.boston) % outputs # of terminal nodes, % residual mean deviance,
distribution of residuals plot(tree .boston) text(tree .boston, pretty=0) % plot tree cv.boston =
cv.tree(tree .boston) plot(cv.boston$size, cv.boston$dev, type=‘b’) prune.boston =
prune.tree(tree.boston, best=5) plot(prune.boston) text(prune.boston, pretty=0) % prune tree with
the best # of nodes yhat = predict(tree.boston, newdata=Boston[-train,]) boston.test = Boston[-
train,‘medv’] plot(yhat, boston.test ) abline (0,1) mean((yhat-boston.test)ˆ2) % evaluation % Bagging,
Random Forest library (randomForest) set .seed(1)

bag.boston = randomForest(medv˜., data=Boston, subset=train, mtry=13, importance=TRUE) %


mtry=13: all 13 predictors should be considered % i.e. bagging (m=p random forest) % importance:
assess importance of predictors yhat.bag = predict(bag.boston, newdata=Boston[-train,])
plot(yhat.bag, boston.test ) abline (0,1) mean((yhat.bag-boston.test)ˆ2) % evaluation % MSE greatly
reduced comparing with DT (reg DT) % ntree=500 by default bag.boston = randomForest(medv˜.,
data=Boston, subset=train, mtry=13, ntree=25) mean((yhat.bag-boston.test)ˆ2) % now MSE increase
by a bit % with much lower computation cost set .seed(1) rf .boston = randomForest(medv˜.,
data=Boston, subset=train, mtry=6, importance=TRUE) yhat. rf = predict(rf .boston,
newdata=Boston[-train,]) mean((yhat.rf-boston.test)ˆ2) % MSE lower than bagging
importance(rf.boston) % outputs MSE impact & impurity reduction % for each node/predictor
varImpPlot(rf.boston) % plot importance of nodes/predictors % Boosting library (gbm) set .seed(1)
boost.boston = gbm(medv˜., data=Boston[train,], distribution=‘gaussian’, n. trees=5000,
interaction.depth=4) summary(boost.boston) % outputs relative influence statistics
par(mfrow=c(1,2)) plot(boost.boston, i=‘rm’) plot(boost.boston, i=‘lstat ’ ) % plot important
predictors against response

You might also like