Professional Documents
Culture Documents
Gradient Boosting: (Affiliated To DBATU, Lonere, Raigad, Maharashtra)
Gradient Boosting: (Affiliated To DBATU, Lonere, Raigad, Maharashtra)
Seminar Report
On
Gradient Boosting
Submitted By:
Abrar Nirban
Roll No: 570
ShikshaMandal's
Bajaj Institute of Technology, Pipri, Wardha
(Affiliated to DBATU,Lonere,Raigad,Maharashtra)
2020-21
Gradient Boosting
Shiksha Mandal’s
CERTIFICATE
Date: 08/01/2021
2020-2021.
Prof. P. Kulkarni
HOD
(Computer Engineering Department)
i
Bajaj Institute of Technology,Wardha
Gradient Boosting
INDEX
ABSTRACT 1
1. INTRODUCTION 2
1.1 What is Boosting? 2
1.2 What is Gradient Boosting? 2
1.3 How Boosting Algorithms Works? 3
1.4 Objective 3
2. LITERATURE SURVEY 4
FUTURE SCOPE 13
ii
Gradient Boosting
INDEX
CONCLUSION 14
REFERENCES 15
LIST OF FIGURES
iii
Gradient Boosting
iv
Gradient Boosting
ABSTRACT
Gradient boosting is a technique used in creating models for prediction. The technique is
mostly used in regression and classification procedures. Prediction models are often presented
as decision trees for choosing the best prediction. Gradient boosting presents model building
in stages, just like other boosting methods, while allowing the generalization and optimization
of differentiable loss functions. The concept of gradient boosting originated from American
statistician, Leo Breiman, who discovered that the technique could be applied on appropriate
cost functions as an optimization algorithm. The method has undergone further development to
optimize cost functions by iteratively picking weak hypotheses or a function with a negative
gradient. Gradient Boosting often provides predictive accuracy that cannot be trumped. Also it
has lots of flexibility from which it can optimize on different loss functions and provides
several hyper parameter tuning options.
1
Bajaj Institute of Technology,Wardha
Gradient Boosting
1.INTRODUCTION
In this chapter we will see what is boosting , gradient boosting how it works and its objective.
1.1What is Boosting?
In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and
also variance in supervised learning, and a family of machine learning algorithms that convert
weak learners to strong ones. Boosting is based on the question posed
by Kearns and Valiant (1988, 1989): "Can a set of weak learners create a single strong
learner?" A weak learner is defined to be a classifier that is only slightly correlated with the
true classification (it can label examples better than random guessing). In contrast, a strong
learner is a classifier that is arbitrarily well-correlated with the true classification.Robert
Schapire's affirmative answer in a 1990 paper to the question of Kearns and Valiant has had
significant ramifications in machine learning and statistics, most notably leading to the
development of boosting.
When first introduced, the hypothesis boosting problem simply referred
to the process of turning a weak learner into a strong learner. "Informally, [the
hypothesis boosting] problem asks whether an efficient learning algorithm that
outputs a hypothesis whose performance is only slightly better than random guessing
[i.e. a weak learner] implies the existence of an efficient algorithm that outputs a
hypothesis of arbitrary accuracy [i.e. a strong learner]." Algorithms that achieve
hypothesis boosting quickly became simply known as "boosting". Freund and
Schapire's arcing (Adaptative Resampling and Combining), as a general technique, is
more or less synonymous with boosting.
1
Bajaj Institute of Technology,Wardha
Gradient Boosting
1.4 Objective
The objective of Gradient Boosting classifiers is to minimize the loss, or the difference between
the actual class value of the training example and the predicted class value. It isn't required to
understand the process for reducing the classifier's loss, but it operates similarly to gradient
descent in a neural network.
2
Bajaj Institute of Technology,Wardha
Gradient Boosting
2.LITERATURE SURVEY
In the first paper which is Energy theft detection using gradient boosting theft detector with
feature engineering-based preprocessing by Rajiv Punmiya and Sangho Choe[1]. For the smart
grid energy theft identification, this letter introduces a gradient boosting theft detector (GBTD)
based on the three latest gradient boosting classifiers (GBCs): extreme gradient boosting
(XGBoost), categorical boosting (CatBoost), and light gradient boosting method (LightGBM).
While most of existing ML algorithms just focus on fine tuning the hyperparameters of the
classifiers, our ML algorithm, GBTD, focuses on the feature engineering-based preprocessing
to improve detection performance as well as time-complexity. GBTD improves both detection
rate (DR) and false positive rate (FPR) of those GBCs by generating stochastic features like
standard deviation, mean, minimum, and maximum value of daily electricity usage.
Additionally, this letter proposes an updated version of the existing six theft cases to mimic real
world theft patterns and applies them to the dataset for numerical evaluation of the proposed
algorithm.
In the third paper which is Short Term Power Demand Prediction Using
Stochastic Gradient Boosting by Ali Bou Nassif[3]. Power prediction demand is vital in
power system and delivery engineering fields. By efficiently predicting the power
demand, we can forecast the total energy to be consumed in a certain city or district.
Thus, exact resources required to produce the demand power can be allocated. In this
3
Bajaj Institute of Technology,Wardha
Gradient Boosting
paper, a Stochastic Gradient Boosting (aka Treeboost) model is used to predict the short
term power demand for the Emirate of Sharjah in the United Arab Emirates (UAE).
Results show that the proposed model gives promising results in comparison to the
model used by Sharjah Electricity and Water Authority (SEWA).
The fourth paper which is Verifying the Value and Veracity of eXtreme
Gradient Boosted Decision Trees on a Variety of Datasets by Aditya Gupta and Kunal
Gusain [4] of Learning models are used widely in both, industries and in areas of our
daily lives. They thus witness a large amount of improvement and research. Gradient
Boosted Machines (GBM) was one approach, which was known to give accurate
solutions, and used ensemble trees to build upon weak learners for classifying the data.
Over time the need for a more scalable, modifiable, and accurate system was felt, and
building upon GBMs an improved variant called eXtreme GBM (XGBoost) was
proposed. XGBoost gave highly accurate results in many international competitions and
presented itself as an ideal learning model ready to be adapted for wide usage. Our
objective was to experimentally verify the value and veracity of this new approach, and
towards this, we analyzed and compared it with traditional and benchmark algorithms,
on a variety of datasets. XGBoost outperformed its counterparts, attesting to the fact
that it indeed holds promise.
4
Bajaj Institute of Technology,Wardha
Gradient Boosting
The Gradient Boosting Classifier depends on a loss function. A custom loss function can be
used, and many standardized loss functions are supported by gradient boosting classifiers, but
the loss function has to be differentiable.Classification algorithms frequently use logarithmic
loss, while regression algorithms can use squared errors. Gradient boosting systems don't have
to derive a new loss function every time the boosting algorithm is added, rather any
differentiable loss function can be applied to the system.
5
Bajaj Institute of Technology,Wardha
Gradient Boosting
Gradient boosting algorithm is a greedy algorithm and can overfit a training dataset quickly.
It can enjoy regularization methods. That penalize various parts of boosting algorithm. And
generally improve the performance of the algorithm by reducing overfitting.Here are some
improvements for gradient boosting algorithm-
6
Bajaj Institute of Technology,Wardha
Gradient Boosting
• L2 regularization of weights.
In order to implement a gradient boosting classifier, we'll need to carry out a number of
different steps. We'll need to:
• Fit the model
• Tune the model's parameters and Hyperparameters
• Make predictions
• Interpret the results
Fitting models with Scikit-Learn is fairly easy, as we typically just have
to call the fit() command after setting up the model.However, tuning the model's
hyperparameters requires some active decision making on our part. There are
variousarguments/hyperparameters we can tune to try and get the best accuracy for the
model. One of the ways we can do this is by altering the learning rate of the model.
We'll want to check the performance of the model on the training set at different
learning rates, and then use the best learning rate to make predictions.Predictions can be
made in Scikit-Learn very simply by using the predict() function after fitting the
classifier. You'll want to predict on the features of the testing dataset, and then compare
the predictions to the actual labels. The process of evaluating a classifier typically
involves checking the accuracy of the classifier and then tweaking the
parameters/hyperparameters of the model until the classifier has an accuracy that the
user is satisfied with. In Fig.1 we can see steps for gradient boosting.
7
Bajaj Institute of Technology,Wardha
Gradient Boosting
8
Bajaj Institute of Technology,Wardha
Gradient Boosting
constructed in a greedy manner, choosing the best split points based on purity scores like Gini
or to minimize the loss.Initially, such as in the case of AdaBoost, very short decision trees were
used that only had a single split, called a decision stumpIt is common to constrain the weak
learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.
6.3 Additive Model
Trees are added one at a time, and existing trees in the model are not changed.A gradient
descent procedure is used to minimize the loss when adding trees.Traditionally, gradient
descent is used to minimize a set of parameters, such as the coefficients in a regression equation
or weights in a neural network. After calculating error or loss, the weights are updated to
minimize that error.Instead of parameters, we have weak learner sub-models or more
specifically decision trees.
Gradient boosting is a machine learning technique for regression and classification problems,
which produces a prediction model in the form of an ensemble of weak prediction
models,typically decision trees. The objective of any supervised learning algorithm is to define a
loss function and minimize it. Let’s see how maths work out for Gradient Boosting algorithm.
Say we have mean squared error (MSE) as loss defined as:
Loss = MSE = ∑(yi – yp)2
where, Yi = ith target value,ypi = ith prediction,L(yi-yp) is Loss function
We want our predictions, such that our loss function (MSE) is minimum. By using gradient
descent and updating our predictions based on a learning rate, we can find the values where
MSE is minimum.
yi = yp+ a*Ω∑(yi-yp)2/Ωyip
which becomes ,yi= yp – a*2* ∑(yi-yp)
Where, a is learning rate and∑(yi-yp) is sum of residuals
9
Bajaj Institute of Technology,Wardha
Gradient Boosting
So, we are basically updating the predictions such that the sum of our residuals is close to 0 (or
minimum) and predicted values are sufficiently close to actual values.
The logic behind gradient boosting is simple, (can be understood intuitively, without using
mathematical notation). It is expected expect that whoever is reading this post might be familiar
with simple linear regression modeling.A basic assumption of linear regression is that sum of its
residuals is 0, i.e. the residuals should be spread randomly around zero.Now think of these
10
Bajaj Institute of Technology,Wardha
Gradient Boosting
residuals as mistakes committed by our predictor model. Although, tree-based models ) are not
based on such assumptions, but if we think logically (not statistically) about this assumption, we
might argue that, if we are able to see some pattern of residuals around 0, we can leverage that
pattern to fit a model. So, the intuition behind gradient boosting algorithm is to repetitively
leverage the patterns in residuals and strengthen a model with weak predictions and make it
better.Fig 3 represents the sample random normally distributed residuals with mean around
0.Once we reach a stage that residuals do not have any pattern that could be modeled, we can
stop modeling residuals (otherwise it might lead to overfitting). Algorithmically, we are
minimizing our loss function, such that test loss reach its minima.
Boosting algorithms are the special algorithms that are used to augment the existing result of
the data model and help to fix the errors. They use the concept of the weak learner and strong
learner conversation though the weighted average values and higher votes values for prediction.
There are three types of Boosting Algorithms which are as follows:
9.1 AdaBoost
11
Bajaj Institute of Technology,Wardha
Gradient Boosting
The basis of this algorithm is the Boosting main core: give more weight to the misclassified
observations. In particular, AdaBoost stands for Adaptive Boosting, meaning that the meta-
learner adapts based upon the results of the weak classifiers, giving more weight to the
misclassified observations of the last weak learner
9.2 Gradient Boosting
With Gradient Boosting we have a generalization of boosting techniques in which is possible to
optimize the meta-learner based on an arbitrary differentiable loss function. What does it mean?
It means that changes everything! As we have seen in AdaBoost, the meta-learner is optimized
based on the misclassified observations by the weak learners, in Gradient Boosting the meta-
learner is optimized (with techniques like the almighty gradient descent) based upon a loss
function.
9.3 XGBoost
Here we have one of the most used Boosting algorithms (if not the most used). It has gained his
fame through Kaggle competitions and its winners, thanks to the fine tuning (and ensembling)
of these algorithms.
The major’s benefits of XGBoost are:
Speed
High scalability
Parallelization
Usually over perform other algorithms
FUTURE SCOPE
Machine Learning is a very active research area and already there are several viable alternatives
to XGBoost. Microsoft Research recently released Light GBM framework for gradient
boosting that shows great potential. Cat Boost developed by Yandex Technology has been
delivering impressive bench-marking results. It is a matter of time when we have a better model
framework that beats XGBoost in terms of prediction performance, flexibility, explainability,
12
Bajaj Institute of Technology,Wardha
Gradient Boosting
and pragmatism. However, until a time when a strong challenger comes along, XGBoost will
continue to reign over the Machine Learning.
CONCLUSION
Gradient boosting models are powerful algorithms which can be used for both classification
and regression tasks. Gradient boosting models can perform incredibly well on very complex
datasets, but they are also prone to overfitting.
GBM is not just a particular algorithm but a standard technique for
building model sets. Besides, this approach is sufficiently versatile and expandable. A
13
Bajaj Institute of Technology,Wardha
Gradient Boosting
large number of models are trained, taking into account various loss functions with a
range of weighting functions. Thus gradient boosting is prone to over-fitting and
requires careful tuning of different hyper-parameters. Gradient Boosting algorithm
represents creation of forest of fixed number of decision trees which are called as weak
learners or weak predictive models. These decision trees are of fixed size or depth .The
gradient boosting starts with mean of target values and add the prediction / outcome /
contribution from subsequent trees by shrinking it with what is called as learning rate.
14
Bajaj Institute of Technology,Wardha
Gradient Boosting
REFERENCES
[1] R. Punmiya and S. Choe, "Energy Theft Detection Using Gradient Boosting Theft Detector
With Feature Engineering-Based Preprocessing," in IEEE Transactions on Smart Grid, vol.
10, no. 2, pp. 2326-2329, March 2019, doi: 10.1109/TSG.2019.2892595.
[2] A. Gupta, K. Gusain and B. Popli, "Verifying the value and veracity of extreme gradient
boosted decision trees on a variety of datasets," 2016 11th International Conference on
Industrial and Information Systems (ICIIS), Roorkee, 2016, pp. 457-462,
doi:10.1109/ICIINFS.2016.8262984.
[3] V. Ayumi, "Pose-based human action recognition with Extreme Gradient Boosting," 2016
IEEE Student Conference on Research and Development (SCOReD), Kuala Lumpur, 2016,
pp. 1-5, doi: 10.1109/SCORED.2016.7810099.
[5] A. B. Nassif, "Short term power demand prediction using stochastic gradient boosting,"
2016 5th International Conference on Electronic Devices, Systems and Applications
(ICEDSA), Ras Al Khaimah, 2016, pp. 1-4, doi: 10.1109/ICEDSA.2016.7818510.
[6] S. Lu, B. Wang, H. Wang and Q. Hong, "A Hybrid Collaborative Filtering Algorithm Based
on KNN and Gradient Boosting," 2018 13th International Conference on Computer Science
& Education (ICCSE), Colombo, 2018, pp. 1-5, doi: 10.1109/ICCSE.2018.8468751.
[9] Medium , A Quick Guide to Boosting in ML. This post will guide you through an… | by
Jocelyn D'Souza | GreyAtom | Medium , Date Visited 10/11/2020.
[10] Machine Learning, Gradient Boosting - A Concise Introduction from Scratch - ML+
(machinelearningplus.com), Date Visited 21/11/2020
15
Bajaj Institute of Technology,Wardha