You are on page 1of 21

A

Seminar Report
On
Gradient Boosting

In the partial fulfillment of requirements for the degree of


Bachelors of Technology
in
Computer Engineering

Submitted By:
Abrar Nirban
Roll No: 570

Under Supervision of:


Prof. Amol Jumde

Department of Computer Engineering

ShikshaMandal's
Bajaj Institute of Technology, Pipri, Wardha
(Affiliated to DBATU,Lonere,Raigad,Maharashtra)

2020-21
Gradient Boosting

Shiksha Mandal’s

Bajaj Institute of Technology, Wardha


Department of Computer Engineering

CERTIFICATE

Date: 08/01/2021

Certified that seminar work entitled “GRADIENT BOOSTING” is a bonafide


work carried out in the Fifth semester by ABRAR NIRBAN in partial fulfilment
for the award of Bachelor of Technology in Computer Engineering from Shiksha
Mandal’s Bajaj Institute of Technology, Pipri, Wardha during the academic year

2020-2021.

Prof. Amol Jumde Prof. U. N. Pote


Seminar Guide Seminar Coordinator

Prof. P. Kulkarni
HOD
(Computer Engineering Department)

i
Bajaj Institute of Technology,Wardha
Gradient Boosting

INDEX

Topic Page No.

ABSTRACT 1

1. INTRODUCTION 2
1.1 What is Boosting? 2
1.2 What is Gradient Boosting? 2
1.3 How Boosting Algorithms Works? 3
1.4 Objective 3

2. LITERATURE SURVEY 4

3. THEORY BEHIND GRADIENT BOOSTING 6

4. IMPROVMENTS TO GRADIENT BOOSTING ALGORITHM 7

4.1 Tree Constraints 7


4.2 Weighted Updates 7
4.3 Stochastic Gradient Boosting algorithm 7
4.4 Penalized Gradient Boosting algorithm 7

5. STEPS TO GRADIENT BOOSTING 8

6. HOW GRADIENT BOOSTING WORKS 9


6.1 Loss Function 9
6.2 Weak Learner 9
6.3 Additive Model 9

7. GRADIENT BOOSTING ALGORITHM 10

8. INTUITION BEHIND GRADIENT BOOSTING 11

9. TYPES OF BOOSTING ALGORITHMS 12

FUTURE SCOPE 13

ii
Gradient Boosting

INDEX

Topic Page No.

CONCLUSION 14

REFERENCES 15

LIST OF FIGURES

iii
Gradient Boosting

Figure No. Figure Caption Page no.

1 Steps for gradient boosting 8

2 Sample random normally distributed 11


residuals with mean around 0

iv
Gradient Boosting

ABSTRACT

Gradient boosting is a technique used in creating models for prediction. The technique is
mostly used in regression and classification procedures. Prediction models are often presented
as decision trees for choosing the best prediction. Gradient boosting presents model building
in stages, just like other boosting methods, while allowing the generalization and optimization
of differentiable loss functions. The concept of gradient boosting originated from American
statistician, Leo Breiman, who discovered that the technique could be applied on appropriate
cost functions as an optimization algorithm. The method has undergone further development to
optimize cost functions by iteratively picking weak hypotheses or a function with a negative
gradient. Gradient Boosting often provides predictive accuracy that cannot be trumped. Also it
has lots of flexibility from which it can optimize on different loss functions and provides
several hyper parameter tuning options.

Keywords – Gradient boosting, AdaBoost, XGBoost.

1
Bajaj Institute of Technology,Wardha
Gradient Boosting

1.INTRODUCTION

In this chapter we will see what is boosting , gradient boosting how it works and its objective.
1.1What is Boosting?
In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and
also variance in supervised learning, and a family of machine learning algorithms that convert
weak learners to strong ones. Boosting is based on the question posed
by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong
learner?" A weak learner is defined to be a classifier that is only slightly correlated with the
true classification (it can label examples better than random guessing). In contrast, a strong
learner is a classifier that is arbitrarily well-correlated with the true classification.Robert
Schapire's affirmative answer in a 1990 paper to the question of Kearns and Valiant has had
significant ramifications in machine learning and statistics, most notably leading to the
development of boosting.
When first introduced, the hypothesis boosting problem simply referred
to the process of turning a weak learner into a strong learner. "Informally, [the
hypothesis boosting] problem asks whether an efficient learning algorithm that
outputs a hypothesis whose performance is only slightly better than random guessing
[i.e. a weak learner] implies the existence of an efficient algorithm that outputs a
hypothesis of arbitrary accuracy [i.e. a strong learner]." Algorithms that achieve
hypothesis boosting quickly became simply known as "boosting". Freund and
Schapire's arcing (Adaptative Resampling and Combining), as a general technique, is
more or less synonymous with boosting.

1.2 What is Gradient Boosting?


Gradient boosting is a machine learning technique for regression and classification
problems,which produces a prediction model in the form of an ensemble of weak
prediction models, typically decision trees. It builds the model in a stage-wise fashion like
other boosting methods do, and it generalizes them by allowing optimization of an
arbitrary differentiable loss function.

1
Bajaj Institute of Technology,Wardha
Gradient Boosting

The idea of gradient boosting originated in the observation by Leo


Breiman that boosting can be interpreted as an optimization algorithm on a suitable
cost function. Explicit regression gradient boosting algorithms were subsequently
developed by Jerome H. Friedman, simultaneously with the more general functional
gradient boosting perspective of Llew Mason, Jonathan Baxter, Peter Bartlett and
Marcus Frean. The latter two papers introduced the view of boosting algorithms as
iterative functional gradient descent algorithms. That is, algorithms that optimize a
cost function over function space by iteratively choosing a function (weak
hypothesis) that points in the negative gradient direction. This functional gradient
view of boosting has led to the development of boosting algorithms in many areas of
machine learning and statistics beyond regression and classification.

1.3 How Boosting Algorithms Works?


Boosting is a general ensemble method that creates a strong classifier from a number of weak
classifiers. This is done by building a model from the training data, then creating a second
model that attempts to correct the errors from the first model. Models are added until the
training set is predicted perfectly or a maximum number of models are added.

1.4 Objective
The objective of Gradient Boosting classifiers is to minimize the loss, or the difference between
the actual class value of the training example and the predicted class value. It isn't required to
understand the process for reducing the classifier's loss, but it operates similarly to gradient
descent in a neural network.

2
Bajaj Institute of Technology,Wardha
Gradient Boosting

2.LITERATURE SURVEY

In the first paper which is Energy theft detection using gradient boosting theft detector with
feature engineering-based preprocessing by Rajiv Punmiya and Sangho Choe[1]. For the smart
grid energy theft identification, this letter introduces a gradient boosting theft detector (GBTD)
based on the three latest gradient boosting classifiers (GBCs): extreme gradient boosting
(XGBoost), categorical boosting (CatBoost), and light gradient boosting method (LightGBM).
While most of existing ML algorithms just focus on fine tuning the hyperparameters of the
classifiers, our ML algorithm, GBTD, focuses on the feature engineering-based preprocessing
to improve detection performance as well as time-complexity. GBTD improves both detection
rate (DR) and false positive rate (FPR) of those GBCs by generating stochastic features like
standard deviation, mean, minimum, and maximum value of daily electricity usage.
Additionally, this letter proposes an updated version of the existing six theft cases to mimic real
world theft patterns and applies them to the dataset for numerical evaluation of the proposed
algorithm.

In the second paper which is Pose-based Human Action Recognition with


Extreme Gradient Boosting by Vina Ayumi[2]. This Paper investigate action recognition by
using Extreme Gradient Boosting (XGBoost). XGBoost is a supervised classification technique
using an ensemble of decision trees. In this study, we also compare the performance of Xboost
using another machine learning techniques Support Vector Machine (SVM) and Naive Bayes
(NB). The experimental study on the human action dataset shows that XGBoost better as
compared to SVM and NB in classification accuracy. Although takes more computational time
the XGBoost performs good classification on action recognition.

In the third paper which is Short Term Power Demand Prediction Using
Stochastic Gradient Boosting by Ali Bou Nassif[3]. Power prediction demand is vital in
power system and delivery engineering fields. By efficiently predicting the power
demand, we can forecast the total energy to be consumed in a certain city or district.
Thus, exact resources required to produce the demand power can be allocated. In this

3
Bajaj Institute of Technology,Wardha
Gradient Boosting

paper, a Stochastic Gradient Boosting (aka Treeboost) model is used to predict the short
term power demand for the Emirate of Sharjah in the United Arab Emirates (UAE).
Results show that the proposed model gives promising results in comparison to the
model used by Sharjah Electricity and Water Authority (SEWA).

The fourth paper which is Verifying the Value and Veracity of eXtreme
Gradient Boosted Decision Trees on a Variety of Datasets by Aditya Gupta and Kunal
Gusain [4] of Learning models are used widely in both, industries and in areas of our
daily lives. They thus witness a large amount of improvement and research. Gradient
Boosted Machines (GBM) was one approach, which was known to give accurate
solutions, and used ensemble trees to build upon weak learners for classifying the data.
Over time the need for a more scalable, modifiable, and accurate system was felt, and
building upon GBMs an improved variant called eXtreme GBM (XGBoost) was
proposed. XGBoost gave highly accurate results in many international competitions and
presented itself as an ideal learning model ready to be adapted for wide usage. Our
objective was to experimentally verify the value and veracity of this new approach, and
towards this, we analyzed and compared it with traditional and benchmark algorithms,
on a variety of datasets. XGBoost outperformed its counterparts, attesting to the fact
that it indeed holds promise.

The last paper which is Nonlinear Prediction of Gross Industrial Output


Time Series by Gradient Boosting by Rui zhang and Hong-li wang [5] includes of
Predicting gross industrial production is helpful to design plan in development zone.
History data in Jinchuan district, Hohhot, were collected. BDS, Ljung-Box, Box-Pierce,
White’s and Teraesvirta's neural network test and surrogate data test were combined to
selecting a proper model. According to phase space reconstruction, function fitting was
finished by Gradient Boosting. The results showed that nonlinear dependence existed in
series. The production in 2015 was predicted to be 6937977 ten thousand Yuan.

4
Bajaj Institute of Technology,Wardha
Gradient Boosting

3.THEORY BEHIND GRADIENT BOOSTING

The Gradient Boosting Classifier depends on a loss function. A custom loss function can be
used, and many standardized loss functions are supported by gradient boosting classifiers, but
the loss function has to be differentiable.Classification algorithms frequently use logarithmic
loss, while regression algorithms can use squared errors. Gradient boosting systems don't have
to derive a new loss function every time the boosting algorithm is added, rather any
differentiable loss function can be applied to the system.

Gradient boosting systems have two other necessary parts: a weak


learner and an additive component. Gradient boosting systems use decision trees as
their weak learners. Regression trees are used for the weak learners, and these
regression trees output real values. Because the outputs are real values, as new learners
are added into the model the output of the regression trees can be added together to
correct for errors in the predictions.The additive component of a gradient boosting
model comes from the fact that trees are added to the model over time, and when this
occurs the existing trees aren't manipulated, their values remain fixed.

A procedure similar to gradient descent is used to minimize the error


between given parameters. This is done by taking the calculated loss and performing
gradient descent to reduce that loss. Afterwards, the parameters of the tree are modified
to reduce the residual loss.The new tree's output is then appended to the output of the
previous trees used in the model. This process is repeated until a previously specified
number of trees is reached, or the loss is reduced below a certain threshold.

5
Bajaj Institute of Technology,Wardha
Gradient Boosting

4. IMPROVMENTS TO GRADIENT BOOSTING ALGORITHM

Gradient boosting algorithm is a greedy algorithm and can overfit a training dataset quickly.
It can enjoy regularization methods. That penalize various parts of boosting algorithm. And
generally improve the performance of the algorithm by reducing overfitting.Here are some
improvements for gradient boosting algorithm-

4.1 Tree Constraints


 It is important that the weak learners have skill but remain weak.
 There are many ways that the trees need to be a constraint.
4.2 Weighted Updates
• The predictions of each tree have to add together sequentially.
• The contribution of each tree to this sum needs to be weight to slow down the learning
by the algorithm. This weighting is referred as a shrinkage or a learning rate.
4.3. Stochastic Gradient Boosting algorithm
• A big insight into bagging ensembles. Also, the random forest was allowing trees
to create. 
• This same benefit can be used to reduce the correlation between the trees.
• This variation of boosting is referred as stochastic gradient boosting.
4.4. Penalized Gradient Boosting algorithm

• We can impose additional constraints on the parameterized trees.


• We can’t use classical decision tree as weak learners. Instead, a modified form called a
regression tree is used that has numeric values in the leaf nodes. The values in the
leaves of the trees can be called weights in some literature.
As such, the leaf weight values of the trees have to regularize. For this, we use popular
regularization functions, such as:
• L1 regularization of weights.

6
Bajaj Institute of Technology,Wardha
Gradient Boosting

• L2 regularization of weights.

5. STEPS TO GRADIENT BOOSTING

In order to implement a gradient boosting classifier, we'll need to carry out a number of
different steps. We'll need to:
• Fit the model
• Tune the model's parameters and Hyperparameters
• Make predictions
• Interpret the results
Fitting models with Scikit-Learn is fairly easy, as we typically just have
to call the fit() command after setting up the model.However, tuning the model's
hyperparameters requires some active decision making on our part. There are
variousarguments/hyperparameters we can tune to try and get the best accuracy for the
model. One of the ways we can do this is by altering the learning rate of the model.
We'll want to check the performance of the model on the training set at different
learning rates, and then use the best learning rate to make predictions.Predictions can be
made in Scikit-Learn very simply by using the predict() function after fitting the
classifier. You'll want to predict on the features of the testing dataset, and then compare
the predictions to the actual labels. The process of evaluating a classifier typically
involves checking the accuracy of the classifier and then tweaking the
parameters/hyperparameters of the model until the classifier has an accuracy that the
user is satisfied with. In Fig.1 we can see steps for gradient boosting.

7
Bajaj Institute of Technology,Wardha
Gradient Boosting

Fig.1 Steps for gradient boosting.

6. HOW GRADIENT BOOSTING WORKS

Gradient boosting involves three elements:


• A loss function to be optimized.
• A weak learner to make predictions.
• An additive model to add weak learners to minimize the loss function.

6.1 Loss Function


The loss function used depends on the type of problem being solved.It must be differentiable,
but many standard loss functions are supported and you can define your own.For example,
regression may use a squared error and classification may use logarithmic loss.A benefit of the
gradient boosting framework is that a new boosting algorithm does not have to be derived for
each loss function that may want to be used, instead, it is a generic enough framework that any
differentiable loss function can be used.

6.2 Weak Learner


Decision trees are used as the weak learner in gradient boosting.Specifically regression trees
are used that output real values for splits and whose output can be added together, allowing
subsequent models outputs to be added and “correct” the residuals in the predictions.Trees are

8
Bajaj Institute of Technology,Wardha
Gradient Boosting

constructed in a greedy manner, choosing the best split points based on purity scores like Gini
or to minimize the loss.Initially, such as in the case of AdaBoost, very short decision trees were
used that only had a single split, called a decision stumpIt is common to constrain the weak
learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.
6.3 Additive Model
Trees are added one at a time, and existing trees in the model are not changed.A gradient
descent procedure is used to minimize the loss when adding trees.Traditionally, gradient
descent is used to minimize a set of parameters, such as the coefficients in a regression equation
or weights in a neural network. After calculating error or loss, the weights are updated to
minimize that error.Instead of parameters, we have weak learner sub-models or more
specifically decision trees.

7. GRADIENT BOOSTING ALGORITHM

Gradient boosting is a machine learning technique for regression and classification problems,
which produces a prediction model in the form of an ensemble of weak prediction
models,typically decision trees. The objective of any supervised learning algorithm is to define a
loss function and minimize it. Let’s see how maths work out for Gradient Boosting algorithm.
Say we have mean squared error (MSE) as loss defined as:
Loss = MSE = ∑(yi – yp)2
where, Yi = ith target value,ypi = ith prediction,L(yi-yp) is Loss function

We want our predictions, such that our loss function (MSE) is minimum. By using gradient
descent and updating our predictions based on a learning rate, we can find the values where
MSE is minimum.

yi = yp+ a*Ω∑(yi-yp)2/Ωyip
which becomes ,yi= yp – a*2* ∑(yi-yp)
Where, a is learning rate and∑(yi-yp) is sum of residuals

9
Bajaj Institute of Technology,Wardha
Gradient Boosting

So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or
minimum) and predicted values are sufficiently close to actual values.

8. INTUITION BEHIND GRADIENT BOOSTING

The logic behind gradient boosting is simple, (can be understood intuitively, without using

mathematical notation). It is expected expect that whoever is reading this post might be familiar
with simple linear regression modeling.A basic assumption of linear regression is that sum of its
residuals is 0, i.e. the residuals should be spread randomly around zero.Now think of these

10
Bajaj Institute of Technology,Wardha
Gradient Boosting

residuals as mistakes committed by our predictor model. Although, tree-based models ) are not
based on such assumptions, but if we think logically (not statistically) about this assumption, we
might argue that, if we are able to see some pattern of residuals around 0, we can leverage that
pattern to fit a model. So, the intuition behind gradient boosting algorithm is to repetitively
leverage the patterns in residuals and strengthen a model with weak predictions and make it
better.Fig 3 represents the sample random normally distributed residuals with mean around
0.Once we reach a stage that residuals do not have any pattern that could be modeled, we can
stop modeling residuals (otherwise it might lead to overfitting). Algorithmically, we are
minimizing our loss function, such that test loss reach its minima.

Fig 2.Sample random normally distributed residuals with mean around 0

9. TYPES OF BOOSTING ALGORITHMS

Boosting algorithms are the special algorithms that are used to augment the existing result of
the data model and help to fix the errors. They use the concept of the weak learner and strong
learner conversation though the weighted average values and higher votes values for prediction.
There are three types of Boosting Algorithms which are as follows:
9.1 AdaBoost

11
Bajaj Institute of Technology,Wardha
Gradient Boosting

The basis of this algorithm is the Boosting main core: give more weight to the misclassified
observations. In particular, AdaBoost stands for Adaptive Boosting, meaning that the meta-
learner adapts based upon the results of the weak classifiers, giving more weight to the
misclassified observations of the last weak learner
9.2 Gradient Boosting
With Gradient Boosting we have a generalization of boosting techniques in which is possible to
optimize the meta-learner based on an arbitrary differentiable loss function. What does it mean?
It means that changes everything! As we have seen in AdaBoost, the meta-learner is optimized
based on the misclassified observations by the weak learners, in Gradient Boosting the meta-
learner is optimized (with techniques like the almighty gradient descent) based upon a loss
function.
9.3 XGBoost
Here we have one of the most used Boosting algorithms (if not the most used). It has gained his
fame through Kaggle competitions and its winners, thanks to the fine tuning (and ensembling)
of these algorithms.
The major’s benefits of XGBoost are:
 Speed
 High scalability
 Parallelization
 Usually over perform other algorithms

FUTURE SCOPE

Machine Learning is a very active research area and already there are several viable alternatives
to XGBoost. Microsoft Research recently released Light GBM framework for gradient
boosting that shows great potential. Cat Boost developed by Yandex Technology has been
delivering impressive bench-marking results. It is a matter of time when we have a better model
framework that beats XGBoost in terms of prediction performance, flexibility, explainability,

12
Bajaj Institute of Technology,Wardha
Gradient Boosting

and pragmatism. However, until a time when a strong challenger comes along, XGBoost will
continue to reign over the Machine Learning.

CONCLUSION

Gradient boosting models are powerful algorithms which can be used for both classification
and regression tasks. Gradient boosting models can perform incredibly well on very complex
datasets, but they are also prone to overfitting.
GBM is not just a particular algorithm but a standard technique for
building model sets. Besides, this approach is sufficiently versatile and expandable. A

13
Bajaj Institute of Technology,Wardha
Gradient Boosting

large number of models are trained, taking into account various loss functions with a
range of weighting functions. Thus gradient boosting is prone to over-fitting and
requires careful tuning of different hyper-parameters. Gradient Boosting algorithm
represents creation of forest of fixed number of decision trees which are called as weak
learners or weak predictive models. These decision trees are of fixed size or depth .The
gradient boosting starts with mean of target values and add the prediction / outcome /
contribution from subsequent trees by shrinking it with what is called as learning rate.

14
Bajaj Institute of Technology,Wardha
Gradient Boosting

REFERENCES

[1] R. Punmiya and S. Choe, "Energy Theft Detection Using Gradient Boosting Theft Detector
With Feature Engineering-Based Preprocessing," in IEEE Transactions on Smart Grid, vol.
10, no. 2, pp. 2326-2329, March 2019, doi: 10.1109/TSG.2019.2892595.

[2] A. Gupta, K. Gusain and B. Popli, "Verifying the value and veracity of extreme gradient
boosted decision trees on a variety of datasets," 2016 11th International Conference on
Industrial and Information Systems (ICIIS), Roorkee, 2016, pp. 457-462,
doi:10.1109/ICIINFS.2016.8262984.

[3] V. Ayumi, "Pose-based human action recognition with Extreme Gradient Boosting," 2016
IEEE Student Conference on Research and Development (SCOReD), Kuala Lumpur, 2016,
pp. 1-5, doi: 10.1109/SCORED.2016.7810099.

[4] H. Masnadi-Shirazi and N. Vasconcelos, "Cost-Sensitive Boosting," in IEEE Transactions


on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 294-309, Feb. 2011, doi:
10.1109/TPAMI.2010.71.

[5] A. B. Nassif, "Short term power demand prediction using stochastic gradient boosting,"
2016 5th International Conference on Electronic Devices, Systems and Applications
(ICEDSA), Ras Al Khaimah, 2016, pp. 1-4, doi: 10.1109/ICEDSA.2016.7818510.

[6] S. Lu, B. Wang, H. Wang and Q. Hong, "A Hybrid Collaborative Filtering Algorithm Based
on KNN and Gradient Boosting," 2018 13th International Conference on Computer Science
& Education (ICCSE), Colombo, 2018, pp. 1-5, doi: 10.1109/ICCSE.2018.8468751.

[7] StackAbuse, Gradient Boosting Classifiers in Python with Scikit-Learn (stackabuse.com),


Date Visited-2/11/2020.

[8] Wikipedea , Gradient boosting - Wikipedia , Date Visited-4/11/2020.

[9] Medium , A Quick Guide to Boosting in ML. This post will guide you through an… | by
Jocelyn D'Souza | GreyAtom | Medium , Date Visited 10/11/2020.

[10] Machine Learning, Gradient Boosting - A Concise Introduction from Scratch - ML+
(machinelearningplus.com), Date Visited 21/11/2020

15
Bajaj Institute of Technology,Wardha

You might also like