Gradient Boosting: Presentation Edited by

1
Gradient Boosting
Presentation Edited
by
Muhammad Nouman
(FA20-RCS-015)
Department of Computer Science
(Date: 12/2/2022)
Under the Supervision of

Professor Dr. Nadeem Javaid
Department of Computer Science
COMSATS University Islamabad, Islamabad Pakistan
Presentation Prepared by Muhammad Nouman, February 12, 2022
Department of Computer Science, COMSATS University Islamabad, Islamabad-Pakistan
2
Outline
• Introduction
• Boosting
• Algorithm of Gradient Boosting
• Gradient Boosting working
• Gradient Boosting Algorithm with Regression example
• Gradient Boosting Algorithm with classification example
• AdaBoost Vs Gradient Boosting
• Gradient Boosting Advantages and Disadvantages
• References

3
Introduction
• The first successful boosting algorithm is Adaboost which invented Freund and Schapire, in 1997.
• Future Formulate Adaboost and developed gradient descent with a special loss function Breiman et al.,
in 1999.
• Generalize Adaboost to Gradient Boosting in order to handle a variety of loss functions in 2001.
• Gradient Boosting is a supervised machine learning algorithm used for classification and regression
problems.
• It is an ensemble technique which uses multiple weak learners to produce a strong model for regression
and classification.

4
Boosting (1/2)
• Boosting is a Ensemble Method. where final prediction is a combination of the prediction of
several predictors.
• What is different?
• Its iterative.
• Boosting: successive classifiers depends upon its predecessors.
• Previous methods : individual classifiers were “independent”.
• Training examples may have unequal weights.
• Look at errors from previous classifier step to decide how to focus on next iteration over
data.
• Set weights to focus more on ‘hard’ examples. (the ones on which we committed mistakes
in the previous iterations).

5
Boosting (2/2)
• W(x) is the distribution of weights over N training observations ∑ W(xi)=1
• Initially assign uniform weights W0(x) = 1/N for all x, step k=0
• At each iteration k :
• Find best weak classifier Ck(x) using weights Wk(x)
• With error rate εk and based on a loss function:
• weight αk the classifier Ck‘s weight in the final hypothesis
• For each xi , update weights based on εk to get Wk+1(xi )
• CFINAL(x) =sign [ ∑ αi Ci (x) ]

6
Gradient boosting Algorithm (1/2)

7
Gradient boosting Algorithm (2/2)

1.Initialize f 0 ( x) to single terminal node tree
2. For m  1 to M :
a) Compute pseudo residuals rim based on loss function
b) Fit a regression tree to rim  R jm , j  1,2,... J m
c) Find the optimal value of coefficien t within different region R lm
 jm  arg min
jm
 L( yi , f m1 ( xi )   )
xi R jm
Jm
d ) f m ( x )  f m1 ( x )    jm I ( x  R jm )
j 1
End For
^
Output : f  f M

8
Gradient Boosting Algorithm Working (1/2)

Gradient boosting Algorithm involves three elements:
(1) Loss function, (2) Weak learner, (3) Additive model
1. Loss function
• The role of the loss function is to estimate how good the model is at making predictions with the
give data. This could vary depending on the problem at hand. The types of loss function in
classification and regression.
Regression Loss functions
It is computed in three ways
• Mean Absolute Errors (MAE)
• Mean Square Error(MSE) or Quadratic Loss
• Mean Square logarithmic error loss

9
Gradient Boosting Algorithm Working (2/2)

• Classification Loss functions
Binary Classification Loss Functions:
Binary Cross Entropy Loss
Hinge Loss
2. Weak learner
Decision trees are used as the weak learner in gradient boosting.
A weak learner is one that classifies our data but does so poorly, perhaps no better than random
guessing.
In other words, it has a high error rate. These are typically decision tree.
3. Additive model
• This is the iterative and sequential approach of adding the trees (weak learners) one step at a time.
• After each iteration, we need to be closer to our final model.
• In other words, each iteration should reduce the value of our loss function.

10
Improvements to Basic Gradient Boosting / Tuning

the Parameters
Gradient boosting algorithm is a greedy algorithm and can overfit a training dataset quickly. It can enjoy
regularization methods. That penalize various parts of boosting algorithm. And generally, improve the
performance of the algorithm by reducing overfitting.
• Tree Constraints
• Shrinkage
• Random sampling
• Penalized Learning

11
Gradient Boosting Algorithm with Regression

example (1/17)
Let’s understand the intuition behind Gradient boosting with the help of an example. Here our target
column is continuous hence we will use Gradient Boosting Regressor.
Following is a sample from a random dataset where we have to predict the car price based on various
features. The target column is price and other features are independent features.
Table 1: Random Dataset

12

example (2/17)
Step -1 The first step in gradient boosting is to build a base model to predict the observations in the
training dataset. For simplicity we take an average of the target column and assume that to be the predicted
value as shown below:
Table 2: Average of the price column

13

example (3/17)
Why did I say we take the average of the target column? Well, there is math involved behind this.
Mathematically the first step can be written as:
Here L is our loss function

Gamma is our predicted value
argmin means we have to find a predicted value/gamma for which the loss function is minimum.
Since the target column is continuous our loss function will be:

14

example (4/17)
Here yi is the observed value
And gamma is the predicted value
Now we need to find a minimum value of gamma such that this loss function is minimum. we use to
differentiate this loss function and then put it equal to 0.
y_i is our observed value and gamma_i is our predicted value, by plugging the values in the above formula
we get:

15

example (5/17)

16

example (6/17)
We end up over an average of the observed car price and this is why I asked you to take the average of the
target column and assume it to be your first prediction.
Hence for gamma=14500, the loss function will be minimum so this value will become our prediction for
the base model.
Step-2 The next step is to calculate the pseudo residuals which are (observed value – predicted value).
Table 3: Find the residual1

17

example (7/17)
Again, the question comes why only observed – predicted? Everything is mathematically proved, let’s
from where did this formula come from. This step can be written as:
Here F(xi) is the previous model and m is the number of DT made. We are just taking the derivative of loss
function w.r.t the predicted value and we have already calculated this derivative.
If you see the formula of residuals above, we see that the derivative of the loss function is multiplied by a
negative sign, so now we get:

18

example (8/17)
The predicted value here is the prediction made by the previous model. In our example the prediction
made by the previous model (initial base model prediction) is 14500, to calculate the residuals our formula
becomes.
In the next step, we will build a model on these pseudo residuals and make predictions. Because we want
to minimize these residuals and minimizing the residuals will eventually improve our model accuracy and
prediction power.
Let’s say hm(x) is our DT made on these residuals.
Step- 4 In this step we find the output values for each leaf of our decision tree. That means there might be
a case where 1 leaf gets more than 1 residual, hence we need to find the final output of all the leaves. TO
find the output we can simply take the average of all the numbers in a leaf, doesn’t matter if there is only 1
number or more than 1.

19

example (9/17)
Mathematically this step can be represented as:
The predicted value is
Here hm(xi) is the DT made on residuals and m is the number of DT. When m=1 we are talking about the
1st DT and when it is “M” we are talking about the last DT.
The output value for the leaf is the value of gamma that minimizes the Loss function.
The left-hand side “Gamma” is the output value of a particular leaf. On the right-hand side [Fm-1(xi)
+ƴhm(xi))] is similar to step 1 but here the difference is that we are taking previous predictions whereas
earlier there was no previous prediction.
20

example (10/17)
Let’s understand this even better with the help of an example. Suppose this is our regressor tree.
Figure 1 : Regressor tree

21

example (11/17)
We see 1st residual goes in R1,1 ,2nd and 3rd residuals go in R2,1 and 4th residual goes in R3,1 .
Let’s calculate the output for the first leave that is R1,1.
Now we need to find the value for gamma for which this function is minimum. So we find the derivative
of this equation w.r.t gamma and put it equal to 0.

22

example (12/17)
Hence the leaf R1,1 has an output value of -2500. Now let’s solve for the R2,1.
Let’s take the derivative to get the minimum value of gamma for which this function is minimum.

23

example (13/17)
We end up with the average of the residuals in the leaf R2,1 . Hence if we get any leaf with more than 1
residual, we can simply find the average of that leaf and that will be our final output.
Now after calculating the output of all the leaves, we get:
Figure 2 : Calculate the residual of each Leaf

24

example (14/17)
Step-5 This is finally the last step where we have to update the predictions of the previous model. It can be updated as:
where m is the number of decision trees made.
Since we have just started building our model so our m=1. Now to make a new DT our new predictions will be:
Here Fm-1(x) is the prediction of the base model (previous prediction) since F 1-1=0 , F0 is our base model hence the
previous prediction is 14500.nu is the learning rate that is usually selected between 0-1. It reduces the effect each tree
has on the final prediction, and this improves accuracy in the long run. Let’s take nu=0.1 in this example.
Hm(x) is the recent DT made on the residuals.

25

example (15/17)
Let’s calculate the new prediction now:
Figure 3 : First Regressor tree Residual

26

example (16/17)
Suppose we want to find a prediction of our first data point which has a car height of 48.8.
This data point will go through this decision tree and the output it gets will be multiplied with the learning
rate and then added to the previous prediction.
Now let’s say m=2 which means we have built 2 decision trees and now we want to have new predictions.
This time we will add the previous prediction that is F1(x) to the new DT made on residuals.
We will iterate through these steps again and again till the loss is negligible.

27

example (17/17)
Figure 4: Gradient Boosting Explained for Regression

28
Gradient Boosting for Classification (1/7)

A gradient boosting classifier is used when the target column is binary.
All the steps explained in the Gradient boosting regressor are used here, the only difference is we change
the loss function. Earlier we used Mean squared error when the target column was continuous but this
time, we will use log-likelihood as our loss function.
The loss function for the classification problem is given below.
Our first step in the gradient boosting algorithm was to initialize the model with some constant value, there
we used the average of the target column but here we’ll use log(odds) to get that constant value.

29

When we differentiate this loss function, we will get a function of log(odds) and then we need to find a
value of log(odds) for which the loss function is minimum.
First transform this loss function so that it is a function of log(odds).

30
Now this is our loss function, and we need to minimize it, for this, we take the derivative of this w.r.t to
log(odds) and then put it equal to 0.

31

Now this is our loss function, and we need to minimize it, for this, we take the derivative of this w.r.t to
log(odds) and then put it equal to 0.

32

Here y are the observed values.
You must be wondering that why did we transform the loss function into the function of log(odds).
Actually, sometimes it is easy to use the function of log(odds), and sometimes it’s easy to use the function
of predicted probability “p”.
It is not compulsory to transform the loss function, we did this just to have easy calculations.
Hence the minimum value of this loss function will be our first prediction (base model prediction).
Now in the Gradient boosting regressor our next step was to calculate the pseudo residuals where we
multiplied the derivative of the loss function with -1. We will do the same but now the loss function is
different, and we are dealing with the probability of an outcome now.

33
After finding the residuals we can build a decision tree with all independent variables
and target variables as “Residuals”.
Now when we have our first decision tree, we find the final output of the leaves because
there might be a case where a leaf gets more than 1 residuals, so we need to calculate the
final output value. I mention the direct formula to calculate the output of a leaf.
Finally, we are ready to get new predictions by adding our base model with the new tree
we made on residuals.

34
Figure 5: Gradient Boosting Tree for Classification

35
Adaboost vs Gradient Boost

Adaboost
• An additive model where shortcomings of previous models are identified by high-weight data
points.
• The trees are usually grown as decision stumps.
• Each classifier has different weights assigned to the final prediction based on its performance.
Gradient Boost
• An additive model where shortcomings of previous models are identified by the gradient.
• The trees are grown to a greater depth usually ranging from 8 to 32 terminal nodes.
• All classifiers are weighed equally, and their predictive capacity is restricted with learning rate to
increase accuracy.

36
Gradient Boosting Advantages &
Disadvantages
Advantages
• Gradient Boost provides predictive accuracy that cannot be beat.
• Lots of flexibility can optimize on different loss functions and provides several hyperparameter
tuning options that make the function fit very flexible.
• No data pre-processing required and Handles missing data imputation not required.
Disadvantages
• GBMs will continue improving to minimize all errors. This can overemphasize outliers and cause
overfitting.
• Computationally expensive - GBMs often require many trees (>1000) which can be time and
memory exhaustive.

37
References
• https://slideplayer.com/slide/13619675/
• https://data-flair.training/blogs/gradient-boosting-algorithm/
• https://www.slideshare.net/prateekkrch/gradient-boosting-for-regression-problems-with-example-basic
s-of-regression-algorithm
• https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning
/
• https://www.analyticsvidhya.com/blog/2021/09/gradient-boosting-algorithm-a-complete-guide-for-
beginners/
• https://www.analyticsvidhya.com/blog/2020/10/adaboost-and-gradient-boost-comparitive-study-
between-2-popular-ensemble-model-techniques/

38
Thank You !!!


Gradient Boosting: Presentation Edited by

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gradient Boosting: Presentation Edited by

Uploaded by

Copyright:

Available Formats

1

Under the Supervision of

Presentation Prepared by Muhammad Nouman, February 12, 2022

Presentation Prepared by Muhammad Nouman, February 12, 2022

Presentation Prepared by Muhammad Nouman, February 12, 2022

• W(x) is the distribution of weights over N training observations ∑ W(xi)=1

• Find best weak classifier Ck(x) using weights Wk(x)

• With error rate εk and based on a loss function:

• weight αk the classifier Ck‘s weight in the final hypothesis

• For each xi , update weights based on εk to get Wk+1(xi )

• CFINAL(x) =sign [ ∑ αi Ci (x) ]

Gradient boosting Algorithm (1/2)

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient boosting Algorithm (2/2)

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm Working (1/2)

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm Working (2/2)

Presentation Prepared by Muhammad Nouman, February 12, 2022

Improvements to Basic Gradient Boosting / Tuning

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Table 1: Random Dataset

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Table 2: Average of the price column

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Here L is our loss function

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Table 3: Find the residual1

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Mathematically this step can be represented as:

The predicted value is

Gradient Boosting Algorithm with Regression

Figure 1 : Regressor tree

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Figure 2 : Calculate the residual of each Leaf

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Let’s calculate the new prediction now:

Figure 3 : First Regressor tree Residual

Gradient Boosting Algorithm with Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting Algorithm with Regression

Figure 4: Gradient Boosting Explained for Regression

Presentation Prepared by Muhammad Nouman, February 12, 2022

Gradient Boosting for Classification (1/7)

Presentation Prepared by Muhammad Nouman, February 12, 2022