Professional Documents
Culture Documents
Ensemble Learning
Works on the principle of improving mistakes of the previous learner through the
next learner.
It relies on the intuition that the best possible next model, when combined with the
previous models, minimizes the overall prediction errors.
Shortcomings’ (of previous weak learners) ‘Shortcomings’ (of previous weak learners)
are identified by high-weight data points. are identified by gradients.
The trees are usually grown as decision The trees are grown to a greater depth
stumps. usually ranging from 8 to 32 terminal nodes.
Each classifier has different weights All classifiers are weighed equally and their
assigned to the final prediction based on its predictive capacity is restricted with learning
performance. rate to increase accuracy.
Quadratic Loss
A gradient descent procedure is used to minimize the loss when adding trees.
Is This Awesome???
✗ No, the model fits the training data too
well.
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 25
Gradient Boosting Step by Step
---------------- ------------
-13.2 3.4
= 74.4
…
er
os
cl
p
te
rs
he
ot
An
all he
sm e, t
.
er
re
ge d T
e
ua add
ts
h
ls
Re eac
sid
i th
W
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 38
Gradient Boosting Step by Step
Build another Tree to predict the new Residuals.
Adding additional Trees does not significantly reduce the size of Residuals.
Second Tree. +
???
Initial Prediction:
= 0.7
0.6667
Observed
0.7
Predicted value
-3.3
= = -3.3
-1
= = -1
???
. rson
0. pe
is
th
7)
bi ce
=
ba in
lity
ro n s
r P tio
lie irec
=
(e e d
th
ar
ol to
Tr ep
l2
t
ve all s
sm
s
A
Lo
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 59
Save Predicted Probability
Save the new Predicted Probability, repeat for all the training data.
-------
2
------- ------------------
-2 0.6
le shold
.5.
is 0
e
thr
op
ssi fied
pe
log(odds) Prediction that someone
i
to -spec
fy
Loves Troll 2 = 2.3
Cla
Pre
--------
YES!!!
Most of the time predictive accuracy of gradient boosting algorithm on higher side.
It provides lots of flexibility and can optimize on different loss functions and provides
several hyper parameter tuning options that make the function fit very flexible.
Continue improving to minimize all errors. This can overemphasize outliers and
cause over fitting. Must use cross-validation to neutralize.
Computationally very expensive — often require many trees (>1000) which can be
time and memory exhaustive.
The high flexibility results in many parameters that interact and influence heavily the
behavior of the approach (number of iterations, tree depth, regularization
parameters, etc.). This requires a large grid search during tuning.
Can perform poorly on high dimensional sparse data, e.g. bag of words
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 10.02.2021 72
Thanks!!!