Ens Embling

Concept of Ensembling :
•Bagging
•Boosting
•Random Forest
• Cross validation
Geetishree Mishra 1
• Ensemble technique combines multiple models
to increase the overall accuracy or performance.
• Ensembles are machine learning methods for
combining predictions from multiple separate
models.
• Creating a group of diverse set of models for
predictions in comparison to just relying on one.
• The central motivation is rooted under the belief
that a committee of experts working together can
perform better than a single expert.
Geetishree Mishra 2
Geetishree Mishra 3
Simple techniques..
• Averaging
• Max voting
• Weighted Average
Geetishree Mishra 4
Averaging
Max voting
Weighted Average
Geetishree Mishra 5
Advanced ensemble techniques:
 Bagging & Boosting

Bagging (Bootstrap-aggregating)
• Bagging is a technique of merging the outputs
of various models to get a final result. But it
has higher probability that these different
models might generate same results if they
are fed with same input data.
• This problem can be mitigated by the
technique known as Bootstrapping.
Geetishree Mishra 6
• The bootstrap method is a statistical technique
for estimating quantities about a population by
averaging estimates from multiple small data
samples.
• It is a technique of re-sampling a dataset with
replacement by which we can create samples of
observations.
• The bootstrap sample is the same size as the
original dataset. As a result, some samples will be
represented multiple times in the bootstrap
sample while others will not be selected at all.
Geetishree Mishra 7
Boosting..
• Boosting is a linear sequential process, where next model tries
to minimize the errors made by previous model in prediction.
• This method is different from bagging in the sense where
each succeeding model is dependent on the previous model.
• Various subset is created using original dataset. At first all the
points are given equal weights and a first model is trained
using one subset of data.
• The incorrectly predicted data points are given higher weight
and the second model is trained on new dataset.
• Sequentially, various models are trained and each model
keeps minimizing the errors from the last model. The final
model will be a strong learner in which prediction is the
weighted mean of all the weak learners.
Geetishree Mishra 8
Bagging..
Initial dataset Ensemble model
Bootstrap samples Weak learners
Geetishree Mishra 9
Random Forest..
Initial dataset Ensemble model
Bootstrap samples +
Deep trees
Selected features
Geetishree Mishra 10
Random Forest..
• It is a bagging method with trees as weak learners. Each tree
is fitted on a bootstrap sample considering only a subset of
the variables randomly chosen.
• Due to sampling attributes, all trees don’t look at the exact
same information which reduces the correlation between
different returned outputs. Also it makes RF more robust to
missing data.
• RF combines the concept of bagging and random feature
subspace selection to create more robust models.
Random Forest
Advantages
• It can be used for both classification and regression process.
• It is one of the most accurate algorithms present out there because of the
number of decision trees taking part in the process.
• Random Forest does not suffer overfitting.
• It is used to select features of relatively more importance and helps in
feature selection.
Disadvantages
• Random Forest algorithm is very slow compared to others because it
calculates prediction for each decision tree for every sub sample and then
votes on them to select the best one which is time consuming.
• It is more complex as a model.
Random Forest ~ Decision Tree

• Random Forest is a collection of multiple decision trees.
• Decision trees are computationally faster than Random Forest.
• Random forest does not suffer with overfitting but decision tree suffers.
Boosting..
Train a weak model and Update the training dataset

Aggregate it to the ensemble weights based on the results
model. Geetishree Mishra of the current ensemble model.
13
Boosting..
Bagging Boosting
Comparison
Similarities Use voting
Combine models of the same type.
Provide high model scalability.
Differences Individual models are built Each new model is influenced
separately: parallel ensembling. by the performance of the
previous model: sequential
ensembling.
Equal weight is given to all models. Weights a model’s
contribution by its
performance.
Random sampling with Random sampling with
replacement. replacement over weighted
data.
Reduces variance and solves the Reduces bias but is more
problem of overfitting. prone to overfitting which can
be further reduced by
parameter tuning.
Example Random Forest  Adaboost
Gradient Boosting
Geetishree Mishra Xgboost 15
Cross Validation
• A technique for evaluating ML models by
training several ML models on subsets of the
input dataset and evaluating them on the
complementary subset of the data.
• Use cross-validation to detect overfitting, ie,
failing to generalize a pattern.
exa: k-fold cross-validation
k-fold cross-validation..
• Split input data into k subsets of it known as k-
folds.
• Train an ML model on all but one (k-1) of the
subsets, and then evaluate the model on the
subset that was not used for training.
• This process is repeated k times, with a
different subset reserved for evaluation and
excluded from training each time.
4-fold cross validation..
1 2 3 4
Evaluation set Training set- complement of evaluation set
• Model one uses the first 25 percent of data for
evaluation, and the remaining 75 percent for
training. Model two uses the second subset of
25 percent (25 percent to 50 percent) for
evaluation, and the remaining three subsets of
the data for training, and so on.
• For example, in binary classification problem,
each of the evaluations reports an area under
curve (AUC) metric. The overall performance
can be measured by computing the average of
the four AUC metrics.

Ens Embling

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ens Embling

Uploaded by

Copyright:

Available Formats

Concept of Ensembling :

 Bagging & Boosting

Initial dataset Ensemble model

Bootstrap samples Weak learners

Initial dataset Ensemble model

Random Forest ~ Decision Tree

Train a weak model and Update the training dataset

exa: k-fold cross-validation

Evaluation set Training set- complement of evaluation set

You might also like