Professional Documents
Culture Documents
Bagging and Random Forest Presentation1
Bagging and Random Forest Presentation1
Prepared By:
1. Endale Daba SGS/0005/2011A
2. Birhanu Mesfin SGS/0002/2011A
3. Senait T/markos SGS/0034/2011A
June,2019
Outlines
Introduction
Bagging
Bagging Algorithm
Random Forest
Boosting
AdaBoost
References
Machine learning studies automatic techniques for learning to make accurate
predictions based on past observations.
Machine learning was defined in 90’s by Arthur Samuel described as it is a field
of study that gives the ability to the computer for self-learn without being
explicitly programmed.
Supervised machine learning about allocating labeled data so that a certain
pattern or function can be deduced from that data.
◦ Classification: Majority vote within the region.
◦ Regression: Mean of training data within the region.
◦ CART: Classification and regression trees.
Bootstrap aggregation or bagging is a general-purpose procedure for reducing the variance
of a statistical learning method. It is frequently used in the context of decision trees.
Bagging (bootstrap + aggregating) or simple Bootstrap Aggregating.
Bootstrapping is a statistical resampling technique that involves random sampling of a
dataset with replacement. It is a means of quantifying the uncertainty in machine learning
model.
The idea is to repeatedly sample data with replacement from the original training set in order
to produce multiple separate training sets.
Bagging seems to work especially well for high variance, low bias procedures such as trees.
Goal
Improve the accuracy of one model by using its multiple copies.
The population build a separate prediction model using each training set and
average the resulting predictions.
While bagging can improve predictions for many regression methods.
Here is how to apply bagging to regression trees:
Construct B regression trees using B bootstrapped training sets.
We then average the predictions.
These trees are grown deep and are not pruned.
Each tree has a high variance with low bias. Averaging the B trees brings down the
variance.
combining hundreds or thousands of trees in a single procedures.
Algorithm 1: Bagging of Random forest
Random Forest is a supervised learning algorithm.
It provides an improvement over bagged trees by way of a small tweak
that decorrelates the trees. As in bagging, we build a number of decision
trees on bootstrapped training samples.
Random forests is a substantial modification of bagging that builds a large
collection of de-correlated trees and then averages them.
Random Forest is very flexible and easy to use machine learning
algorithm.
Due to its simplicity and the fact that it can be used for both regression
and classification tasks, RF algorithm is widely used.
The pseudo code for random forest algorithm can split into two stages.
• Random forest creation pseudo code.
• Pseudo code to perform prediction from the created random forest
classifier.
Random forest creation pseudo code:
◦ 1. Randomly select “k” features from total “m” features.
Where k < m
◦ 2. Among the “k” features, calculate the node “d” using the best split
point.
◦ 3. Split the node into daughter nodes using the best split.
◦ 4. Repeat 1 to 3 steps until “l” number of nodes has been reached.
◦ 5. Build forest by repeating steps 1 to 4 for “n” number times to
create “n” number of trees.
To perform prediction using the trained random forest
algorithm uses the below pseudo code:
1.Takes the test features and use the rules of each randomly
created decision tree to predict the outcome and stores the
predicted outcome (target)
2.Calculate the votes for each predicted target.
3.Consider the high voted predicted target as the final
prediction from the random forest algorithm.
To perform the prediction using the trained random forest algorithm we need to
pass the test features through the rules of each randomly created trees. Suppose
let’s say we formed 100 random decision trees to from the random forest.
Each random forest will predict different target (outcome) for the same test
feature. Then by considering each predicted target votes will be calculated.
Suppose the 100 random decision trees are prediction some 3 unique targets x,
y, z then the votes of x is nothing but out of 100 random decision tree how
many trees prediction is x.
Likewise for other 2 targets (y, z). If x is getting high votes. Let’s say out of
100 random decision tree 60 trees are predicting the target will be x. Then the
final random forest returns the x as the predicted target.
This concept of voting is known as majority voting.
Banking: for finding the loyal customer and finding the fraud
customers.
Medicine: used identify the correct combination of the
components to validate the medicine. Random forest algorithm is
also helpful for identifying the disease by analyzing the patient’s
medical records.
Stock Market: used to identify the stock behavior as well as the
expected loss or profit by purchasing the particular stock.
E-commerce: used only in the small segment of the
recommendation engine for identifying the likely hood of customer
liking the recommend products base on the similar kinds of
customers.
Algorithm 2. Random Forest Algorithm
Random Forest has tremendous potential of becoming a popular
technique for future classifiers because its performance has been
found to be comparable with ensemble techniques bagging and
boosting.
Unlike bagging in the classical boosting the subset creation is not random and depends upon
the performance of the previous models every new subsets contains the elements that were
(likely to be) misclassified by previous models.
Boosting can Reduce variance (the same as Bagging) But also to eliminate the effect
of high bias of the weak learner (unlike Bagging).
Boosting works by primarily reducing bias in the early stages and primarily
reducing variance in latter stages.
Boosting can be:
sequential ensemble try to add new models that do well where previous models lack
aim to decrease bias not variance
suitable for low variance high bias models
AdaBoost is used with short decision trees. After the first tree is created, the
performance of the tree on each training instance is used to weight how much
attention the next tree that is created should pay attention to each training instance.
Bagging, random forests and boosting are good methods for improving the
prediction accuracy of trees.
They work by growing many trees on the training data and then combining the
predictions of the resulting ensemble of trees.
The latter two methods random forests and boosting are among the state-of-
the-art methods for supervised learning.
Combining multiple learners has been a popular topic in machine learning
since the early 1990s and research has been going on ever since.
[1]Allen, E., Horvath, S., Kraft, P., Tong, F., Spiteri, E., Riggs, A., and Marahrens, Y. (2003), “High
Concentrations
of LINE Sequence Distinguish Monoallelically-Expressed Genes,” in Proceedings of the National
Academy of Sciences, 100(17), pp. 9940–9945.
[2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), Classification and Regression
Trees, New York: Chapman and Hall.
[3] Cox, T. F., and Cox, M. A. A. (2001), Multidimensional Scaling, Boca Raton: Chapman and Hall/CRC.
[4] Hastie, T., Tibshirani,R., and Friedman, J. H. (2001), The Elements of Statistical Learning:
DataMining, Inference, and Prediction, New York: Springer.
[5] Hubert, L., and Arabie, P. (1985), “Comparing Partitions,” Journal of Classification, 2, 193–218.
Kaplan, E. L., and Meier, P. (1958), “Nonparametric Estimation from Incomplete Observations,” Journal
of the American Statistical Association, 53, 457–48.
[6] Kaufman, L., and Rousseeuw, P. J. (1990), Finding Groups in Data: An Introduction to Cluster
Analysis, NewYork: Wiley.
[7] Kruskal, J. B., and Wish, M. (1978), Multidimensional Scaling, Beverly Hills, CA: Sage Publications.
[8] Liaw, A., and Wiener, M. (2002), “Classification and Regression by randomForest,” R News: The
Newsletter of the R Project. Available online at http://cran.r-project.org/doc/Rnews/, 2(3), 18–22.
Thank You!!