Ensemble Learning: Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 10.02.2021

Lecture 5
Ensemble Learning
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 1

Why Boosting?
Method of converting weak learners into strong learners.
Works on the principle of improving mistakes of the previous learner through the
next learner.
One of the very first boosting algorithms developed was AdaBoost.
Gradient boosting improvised upon some of the features of AdaBoost to create a

stronger and more efficient algorithm.

Gradient Boosting
Introduced by Jerome Friedman in 1999.
Gradient Boosting iteratively constructs an ensemble of weak decision tree learners

through boosting.
Used for both classification and regression problems.
It works by building simpler (weak) prediction models sequentially where each

model tries to predict the error left over by the previous model.
It relies on the intuition that the best possible next model, when combined with the
previous models, minimizes the overall prediction errors.

Gradient Boosting and AdaBoost
Adaboost Gradient Boost
Shortcomings’ (of previous weak learners) ‘Shortcomings’ (of previous weak learners)
are identified by high-weight data points. are identified by gradients.
The trees are usually grown as decision The trees are grown to a greater depth
stumps. usually ranging from 8 to 32 terminal nodes.
Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.

A Golfer Whacking A Golf Ball

Intuition Behind Gradient Boosting

Gradient Boosting
The weak learners are fit in such a way that
each new learner fits into the residuals of the
previous step so as the model improves.
residuals = observed value – predicted value
The final model adds up the result of each step

and thus a stronger learner is eventually
achieved.

Gradient Boosting
A loss function is used to detect the residuals.
For instance, mean squared error (MSE) for a

regression task or logarithmic loss (log loss) for
classification tasks.
Existing trees in the model do not change

when a new tree is added.
The added decision tree fits the residuals
from the current model.

Gradient Boosting: Input Requirements
1. A Loss Function to optimize.
2. A weak learner to make prediction (Generally Decision tree).
3. An additive model to add weak learners to minimize the loss function.

Loss Function
The loss function basically tells how the algorithm models the data set.
In simple terms it is difference between actual values and predicted values.
Regression Loss Functions: Binary Classification Loss Functions:
Mean Absolute Errors (MAE) Binary Cross Entropy Loss
Mean Square Error(MSE) Hinge Loss
Quadratic Loss
A gradient descent procedure is used to minimize the loss when adding trees.

Weak Learner
Weak learners are the models which is used sequentially to reduce the error generated
from the previous models and to return a strong model on the end.
Decision trees are used as weak learner in gradient boosting algorithm.

Additive Model
Decision trees are added one at a time (in sequence).
Existing trees in the model are not changed.

Gradient Boosting: Algorithm

Gradient Boosting for Regression

Dataset

Gradient Boosting Step by Step
Calculate the average/mean of the target variable.

Build a tree based on the error from the first tree.

Errors made by the previous tree:
Observed value – Predicted value

Save the difference (Pseudo Residual) to a new column.
Observed value – Predicted value

Now, build a tree, using Height, Favorite Color, and Weight to predict the residuals.
To predict the residuals

Here is the tree.
In this example, only 4 leaves are allowed.
By this restriction, there should be fewer

leaves than residuals.

---------- Replace the residuals with their average.

(-14.2 + -15.2) / 2 = -14.7
-14.7

-------- Replace the residuals with their average.

(1.8 + 5.8) / 2 = 3.8
3.8

Combine the original leaf with the new tree to make a new Prediction of an individual’s
Weight from the Training Data.

Predicted Weight = 71.2 + 16.8 = 88
Is This Awesome???
✗ No, the model fits the training data too
well.
The Learning Rate is a value

between 0 and 1.
In this example, it is 0.1

With 0.1, the new Prediction (71.2)
is not as good it was before.
But it is a little better than the

original Prediction with only one
leaf (71.2).
Scaling the tree by the Learning

Rate results in a small step in the
right direction.
Lots os small steps in the right

direction results in better
Predictions.

Now, build another tree to make another small step to the right direction.
Just like before, we calculate the Pseudo residuals.
Residuals = Observed - Predicted

Now, build another tree to make another small step to the right direction.
Just like before, we calculate the Pseudo residuals.
Residuals = Observed – Predicted

= 88 – Predicted
= 88 – (71.2 + 0.1 x 16.8)
= 15.1

Repeat for all the sample.
And save it in the column for Pseudo Residuals.

New Residuals are smaller than before.

✓ We are in the right direction ……….

e .
Now, build another tree to predict the new Residuals. t im
ch
t ea
fe ren
if
bed
n
s ca
re e
e t
Th
New Tree

Just like before, multiple samples ended up in these leaves.
Replace the Residuals with their averages.
---------------- ------------
-13.2 3.4

Combine the new Tree with the previous Trees, and the initial Leaf.

Make a new Prediction from the Training Data.

Just like before……..
71.2 + (0.1 x 16.8) + (0.1 x 15.1)

..
= 74.4
…
er
os
cl
p
te
rs
he
ot
An

Calculate all the New Residuals.

Single Leaf First Tree Second Tree
all he
sm e, t
.
er
re
ge d T
e
ua add
ts
h
ls
Re eac
sid
i th
W
Build another Tree to predict the new Residuals.
Add it to the chain of Trees that we have already created.
Continue keeping additional Trees, until
The maximum specified is reached, or
Adding additional Trees does not significantly reduce the size of Residuals.

Predicting new Weight ?
Starting with the initial prediction (Single Leaf) +
Adding the scaled value from the first Tree +
Second Tree. +
etc …. etc …. etc ….

Gradient Boosting for Classification

Dataset
???

Initial Prediction
Initial Prediction for every individual is log(odds).
Can be think as the Logistic Regression

equivalent to the average.
Overall log(odds) that someone Loves

Troll 2 is
log(4/2) = 0.7
Initial Prediction:

Initial Prediction
How to use it for Classification?
0.6931
Convert it to a Probability.
With a Logistic Function.
Probability of Loving Troll 2 = SAME !!!

Rounded.
= Up to 4 decimal digits?
= 0.7
0.6667

Initial Prediction
Probability of Loving Troll 2 = 0.7
Probability of Loving Troll 2 > 0.7.
Classify everyone in the Training Dataset

as someone who Loves Troll 2.
It is possible to use another Threshold other

than 0.5.
There are someone who does not Love

Troll 2 !!!

How Bad is the Initial Prediction?
Probability of Loving Troll 2 = 0.7 Predicted
By calculating Pseudo Residuals.
Residual = Observed – Predicted
Observed

How Bad is the Initial Prediction?
0.7
Predicted value

Residuals

Residuals for the Initial Leaf

Building the Tree
Now, building a Tree, using Likes Popcorn, Age, and Favourite Color to predict the
Resilduals.
The limit on the number of leaves is 3.

Output Values for Leaves
Calculate the Output Values for the leaves.
In Regression, a leaf with a single Residual had

an Output Value equal to that Residual.
In contrast, for Classification, the situation is a

little more complex.



The most common transformation formula:

Previous Probability refers to the Probability from the initial Leaf as it is the very
first tree.
-3.3
= = -3.3

For the other leaf.
-1
= = -1

Try the third leaf……..
???

Update Predictions
Update Predictions by combining the initial leaf with the new tree.
Log(odds) Prediction = 0.7 + (0.8 x 1.4)

= 1.8
Convert into A Probability
Convert the new Log(odds) Prediction into a Probability.
. rson
0. pe
is
th
7)
bi ce
=
ba in
lity
ro n s
r P tio
lie irec
=
(e e d
th
ar
ol to
Tr ep
l2
t
ve all s
sm
s
A
Lo
Save Predicted Probability
Save the new Predicted Probability, repeat for all the training data.

Calculate Residuals
Just like before, calculate the Residuals.
Residual = Observed – Predicted

Calculate Residuals
Just like before, calculate the Residuals..

Build A New Tree
Build a new tree from the new Residuals.

Build A New Tree

Build A New Tree
-------
2
------- ------------------
-2 0.6

Combine
Combine everything done so far.

Repaet ……. Until
Continue keeping additional Trees, until

The maximum specified is reached, or
Adding additional Trees does not
significantly reduce the size of Residuals.

Classify New Sample
Loves Troll 2? Or does not Love Troll 2?

Do The Math?
log(odds) Prediction that
someone Loves Troll 2 =
0.7 + (0.8 x 1.4) + (0.8 + 0.6)
= 2.3

Convert into A Probability
Convert the new Log(odds) Prediction into a Probability.
le shold
.5.
is 0
e
thr
op
ssi fied
pe
log(odds) Prediction that someone
i
to -spec
fy
Loves Troll 2 = 2.3
Cla
Pre
--------
YES!!!

Gradient Boosting: Pros
Feature normalization is not required.
Feature selection is inherently performed during the learning process.
Most of the time predictive accuracy of gradient boosting algorithm on higher side.
It provides lots of flexibility and can optimize on different loss functions and provides
several hyper parameter tuning options that make the function fit very flexible.
Most of the time no data pre-processing required.
Works great with categorical and numerical data.
Handles missing data — missing value imputation not required.
Models are relatively easy to interpret.

Gradient Boosting: Cons
Boosting is a sequential process, not parallelizable.
Continue improving to minimize all errors. This can overemphasize outliers and
cause over fitting. Must use cross-validation to neutralize.
Computationally very expensive — often require many trees (>1000) which can be
time and memory exhaustive.
The high flexibility results in many parameters that interact and influence heavily the
behavior of the approach (number of iterations, tree depth, regularization
parameters, etc.). This requires a large grid search during tuning.
Can perform poorly on high dimensional sparse data, e.g. bag of words
Thanks!!!

Ensemble Learning: Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 10.02.2021

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ensemble Learning: Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 10.02.2021

Uploaded by

Copyright:

Available Formats

Lecture 5

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 1

One of the very first boosting algorithms developed was AdaBoost.

Gradient boosting improvised upon some of the features of AdaBoost to create a

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 2

Gradient Boosting iteratively constructs an ensemble of weak decision tree learners

Used for both classification and regression problems.

It works by building simpler (weak) prediction models sequentially where each

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 3

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 4

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 5

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 6

The final model adds up the result of each step

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 7

For instance, mean squared error (MSE) for a

Existing trees in the model do not change

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 8

2. A weak learner to make prediction (Generally Decision tree).

3. An additive model to add weak learners to minimize the loss function.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 9

In simple terms it is difference between actual values and predicted values.

Regression Loss Functions: Binary Classification Loss Functions:

Mean Absolute Errors (MAE) Binary Cross Entropy Loss

Mean Square Error(MSE) Hinge Loss

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 10

Decision trees are used as weak learner in gradient boosting algorithm.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 11

Existing trees in the model are not changed.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 12

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 13

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 14

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 15

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 16

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 17

Observed value – Predicted value

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 18

Observed value – Predicted value

To predict the residuals

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 20

In this example, only 4 leaves are allowed.

By this restriction, there should be fewer

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 21

---------- Replace the residuals with their average.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 22

-------- Replace the residuals with their average.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 23

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 24

Predicted Weight = 71.2 + 16.8 = 88

The Learning Rate is a value

In this example, it is 0.1

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 26

But it is a little better than the

Scaling the tree by the Learning

Lots os small steps in the right

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 27

Residuals = Observed - Predicted

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 28

Residuals = Observed – Predicted

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 29

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 30

New Residuals are smaller than before.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 31