You are on page 1of 73

Lecture 5

Ensemble Learning

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 1


Why Boosting?
Method of converting weak learners into strong learners.

Works on the principle of improving mistakes of the previous learner through the
next learner.

One of the very first boosting algorithms developed was AdaBoost.

Gradient boosting improvised upon some of the features of AdaBoost to create a


stronger and more efficient algorithm.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 2


Gradient Boosting
Introduced by Jerome Friedman in 1999.

Gradient Boosting iteratively constructs an ensemble of weak decision tree learners


through boosting.

Used for both classification and regression problems.

It works by building simpler (weak) prediction models sequentially where each


model tries to predict the error left over by the previous model.

It relies on the intuition that the best possible next model, when combined with the
previous models, minimizes the overall prediction errors.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 3


Gradient Boosting and AdaBoost
Adaboost Gradient Boost

Shortcomings’ (of previous weak learners) ‘Shortcomings’ (of previous weak learners)
are identified by high-weight data points. are identified by gradients.

The trees are usually grown as decision The trees are grown to a greater depth
stumps. usually ranging from 8 to 32 terminal nodes.

Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 4


A Golfer Whacking A Golf Ball

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 5


Intuition Behind Gradient Boosting

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 6


Gradient Boosting
The weak learners are fit in such a way that
each new learner fits into the residuals of the
previous step so as the model improves.
residuals = observed value – predicted value

The final model adds up the result of each step


and thus a stronger learner is eventually
achieved.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 7


Gradient Boosting
A loss function is used to detect the residuals.

For instance, mean squared error (MSE) for a


regression task or logarithmic loss (log loss) for
classification tasks.

Existing trees in the model do not change


when a new tree is added.
The added decision tree fits the residuals
from the current model.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 8


Gradient Boosting: Input Requirements
1. A Loss Function to optimize.

2. A weak learner to make prediction (Generally Decision tree).

3. An additive model to add weak learners to minimize the loss function.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 9


Loss Function
The loss function basically tells how the algorithm models the data set.

In simple terms it is difference between actual values and predicted values.

Regression Loss Functions: Binary Classification Loss Functions:

Mean Absolute Errors (MAE) Binary Cross Entropy Loss

Mean Square Error(MSE) Hinge Loss

Quadratic Loss

A gradient descent procedure is used to minimize the loss when adding trees.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 10


Weak Learner
Weak learners are the models which is used sequentially to reduce the error generated
from the previous models and to return a strong model on the end.

Decision trees are used as weak learner in gradient boosting algorithm.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 11


Additive Model
Decision trees are added one at a time (in sequence).

Existing trees in the model are not changed.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 12


Gradient Boosting: Algorithm

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 13


Gradient Boosting for Regression

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 14


Dataset

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 15


Gradient Boosting Step by Step
Calculate the average/mean of the target variable.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 16


Gradient Boosting Step by Step
Build a tree based on the error from the first tree.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 17


Gradient Boosting Step by Step
Errors made by the previous tree:

Observed value – Predicted value

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 18


Gradient Boosting Step by Step
Save the difference (Pseudo Residual) to a new column.

Observed value – Predicted value


Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 19
Gradient Boosting Step by Step
Now, build a tree, using Height, Favorite Color, and Weight to predict the residuals.

To predict the residuals

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 20


Gradient Boosting Step by Step
Here is the tree.

In this example, only 4 leaves are allowed.

By this restriction, there should be fewer


leaves than residuals.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 21


Gradient Boosting Step by Step

---------- Replace the residuals with their average.


(-14.2 + -15.2) / 2 = -14.7
-14.7

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 22


Gradient Boosting Step by Step

-------- Replace the residuals with their average.


(1.8 + 5.8) / 2 = 3.8
3.8

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 23


Gradient Boosting Step by Step
Combine the original leaf with the new tree to make a new Prediction of an individual’s
Weight from the Training Data.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 24


Gradient Boosting Step by Step

Predicted Weight = 71.2 + 16.8 = 88

Is This Awesome???
✗ No, the model fits the training data too
well.
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 25
Gradient Boosting Step by Step

The Learning Rate is a value


between 0 and 1.

In this example, it is 0.1

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 26


Gradient Boosting Step by Step
With 0.1, the new Prediction (71.2)
is not as good it was before.

But it is a little better than the


original Prediction with only one
leaf (71.2).

Scaling the tree by the Learning


Rate results in a small step in the
right direction.

Lots os small steps in the right


direction results in better
Predictions.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 27


Gradient Boosting Step by Step
Now, build another tree to make another small step to the right direction.
Just like before, we calculate the Pseudo residuals.

Residuals = Observed - Predicted

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 28


Gradient Boosting Step by Step
Now, build another tree to make another small step to the right direction.
Just like before, we calculate the Pseudo residuals.

Residuals = Observed – Predicted


= 88 – Predicted
= 88 – (71.2 + 0.1 x 16.8)
= 15.1

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 29


Gradient Boosting Step by Step
Repeat for all the sample.
And save it in the column for Pseudo Residuals.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 30


Gradient Boosting Step by Step

New Residuals are smaller than before.


✓ We are in the right direction ……….

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 31


Gradient Boosting Step by Step
e .
Now, build another tree to predict the new Residuals. t im
ch
t ea
fe ren
if
bed
n
s ca
re e
e t
Th
New Tree

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 32


Gradient Boosting Step by Step
Just like before, multiple samples ended up in these leaves.
Replace the Residuals with their averages.

---------------- ------------
-13.2 3.4

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 33


Gradient Boosting Step by Step
Combine the new Tree with the previous Trees, and the initial Leaf.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 34


Gradient Boosting Step by Step
Make a new Prediction from the Training Data.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 35


Gradient Boosting Step by Step
Just like before……..

71.2 + (0.1 x 16.8) + (0.1 x 15.1)


..

= 74.4

er
os
cl
p
te
rs
he
ot
An

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 36


Gradient Boosting Step by Step
Calculate all the New Residuals.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 37


Gradient Boosting Step by Step
Single Leaf First Tree Second Tree

all he
sm e, t

.
er
re
ge d T
e
ua add

ts
h
ls
Re eac
sid
i th
W
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 38
Gradient Boosting Step by Step
Build another Tree to predict the new Residuals.

Add it to the chain of Trees that we have already created.

Continue keeping additional Trees, until

The maximum specified is reached, or

Adding additional Trees does not significantly reduce the size of Residuals.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 39


Gradient Boosting Step by Step
Predicting new Weight ?

Starting with the initial prediction (Single Leaf) +

Adding the scaled value from the first Tree +

Second Tree. +

etc …. etc …. etc ….

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 40


Gradient Boosting for Classification

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 41


Dataset

???

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 42


Initial Prediction
Initial Prediction for every individual is log(odds).

Can be think as the Logistic Regression


equivalent to the average.

Overall log(odds) that someone Loves


Troll 2 is
log(4/2) = 0.7

Initial Prediction:

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 43


Initial Prediction
How to use it for Classification?
0.6931
Convert it to a Probability.

With a Logistic Function.

Probability of Loving Troll 2 = SAME !!!


Rounded.
= Up to 4 decimal digits?

= 0.7

0.6667

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 44


Initial Prediction

Probability of Loving Troll 2 = 0.7

Probability of Loving Troll 2 > 0.7.

Classify everyone in the Training Dataset


as someone who Loves Troll 2.

It is possible to use another Threshold other


than 0.5.

There are someone who does not Love


Troll 2 !!!

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 45


How Bad is the Initial Prediction?

Probability of Loving Troll 2 = 0.7 Predicted

By calculating Pseudo Residuals.

Residual = Observed – Predicted

Observed

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 46


How Bad is the Initial Prediction?

0.7

Predicted value

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 47


Residuals

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 48


Residuals for the Initial Leaf

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 49


Building the Tree
Now, building a Tree, using Likes Popcorn, Age, and Favourite Color to predict the
Resilduals.

The limit on the number of leaves is 3.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 50


Output Values for Leaves
Calculate the Output Values for the leaves.

In Regression, a leaf with a single Residual had


an Output Value equal to that Residual.

In contrast, for Classification, the situation is a


little more complex.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 51


Output Values for Leaves
Calculate the Output Values for the leaves.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 52


Output Values for Leaves
Calculate the Output Values for the leaves.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 53


Output Values for Leaves
Calculate the Output Values for the leaves.

The most common transformation formula:

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 54


Output Values for Leaves
Previous Probability refers to the Probability from the initial Leaf as it is the very
first tree.

-3.3

= = -3.3

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 55


Output Values for Leaves
For the other leaf.

-1

= = -1

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 56


Output Values for Leaves
Try the third leaf……..

???

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 57


Update Predictions
Update Predictions by combining the initial leaf with the new tree.

Log(odds) Prediction = 0.7 + (0.8 x 1.4)


= 1.8
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 58
Convert into A Probability
Convert the new Log(odds) Prediction into a Probability.

. rson
0. pe
is
th
7)
bi ce
=
ba in
lity
ro n s
r P tio
lie irec
=

(e e d
th
ar
ol to
Tr ep
l2
t
ve all s
sm

s
A
Lo
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 59
Save Predicted Probability
Save the new Predicted Probability, repeat for all the training data.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 60


Calculate Residuals
Just like before, calculate the Residuals.

Residual = Observed – Predicted

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 61


Calculate Residuals
Just like before, calculate the Residuals..

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 62


Build A New Tree
Build a new tree from the new Residuals.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 63


Build A New Tree
Calculate the Output Values for the leaves.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 64


Build A New Tree
Calculate the Output Values for the leaves.

-------
2
------- ------------------
-2 0.6

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 65


Combine
Combine everything done so far.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 66


Repaet ……. Until

Continue keeping additional Trees, until


The maximum specified is reached, or
Adding additional Trees does not
significantly reduce the size of Residuals.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 67


Classify New Sample

Loves Troll 2? Or does not Love Troll 2?

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 68


Do The Math?
log(odds) Prediction that
someone Loves Troll 2 =
0.7 + (0.8 x 1.4) + (0.8 + 0.6)
= 2.3

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 69


Convert into A Probability
Convert the new Log(odds) Prediction into a Probability.

le shold
.5.
is 0
e
thr
op
ssi fied
pe
log(odds) Prediction that someone

i
to -spec
fy
Loves Troll 2 = 2.3

Cla
Pre
--------
YES!!!

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 70


Gradient Boosting: Pros
Feature normalization is not required.

Feature selection is inherently performed during the learning process.

Most of the time predictive accuracy of gradient boosting algorithm on higher side.

It provides lots of flexibility and can optimize on different loss functions and provides
several hyper parameter tuning options that make the function fit very flexible.

Most of the time no data pre-processing required.

Works great with categorical and numerical data.

Handles missing data — missing value imputation not required.

Models are relatively easy to interpret.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 71


Gradient Boosting: Cons
Boosting is a sequential process, not parallelizable.

Continue improving to minimize all errors. This can overemphasize outliers and
cause over fitting. Must use cross-validation to neutralize.

Computationally very expensive — often require many trees (>1000) which can be
time and memory exhaustive.

The high flexibility results in many parameters that interact and influence heavily the
behavior of the approach (number of iterations, tree depth, regularization
parameters, etc.). This requires a large grid search during tuning.

Can perform poorly on high dimensional sparse data, e.g. bag of words
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 72
Thanks!!!

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 73

You might also like