Xgboost Dimensionality Reduction: Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 18.02.2021

Lecture 6
XGBoost
Dimensionality Reduction
Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 1

Need to Know Before We Start ……..
Regularization in Machine Learning

Overfitting
Can you recall Overfitting?
It is a phenomenon that occurs when a Machine Learning model is constraint to training
set and not able to perform well on unseen data.
This happens because the model is trying too hard to capture the noise in training dataset.
By noise we mean the data points that don’t really represent the true properties of your data,
but random chance.
Learning such data points, makes models more flexible, at the risk of overfitting.

Regularisation
Regularisation is a technique used to reduce the errors by fitting the function
appropriately on the given training set and avoid overfitting.
Regularisation discourages learning a more complex or flexible model, so as to avoid the
risk of overfitting.
In tree-based methods regularization is usually understood as defining a minimum gain
so which another split happens:
This minimum gain can usually be set for anything between (0, ∞).
The commonly used regularisation techniques are :
L1 regularisation
L2 regularisation
Dropout regularisation

How Does Regularisation Solve The Problem?
You penalize your loss function, 𝐿(𝑋,𝑌), by adding a multiple of an L1 (LASSO)
or an L2 (Ridge).
You get the following equation:

𝐿(𝑋,𝑌)+𝜆𝑁(𝑤)
(𝑁 is either the 𝐿1, 𝐿2 or any other norm)
𝜆 is Regularization term.

Regularisation
A regression model which uses L1 Regularisation technique is called LASSO(Least Absolute
Shrinkage and Selection Operator) regression.
A regression model that uses L2 regularisation technique is called Ridge regression.
Lasso Regression adds “absolute value of magnitude” of coefficient as penalty term to the
loss function(L).
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss
function(L).
NOTE that during Regularisation the output function(y_hat) does not change. The change is
only in the loss function.
Regularisation
The loss function before regularisation:
prediction cost
The loss function after regularisation:
prediction cost +
regularization cost
lambda (𝜆) is a Hyperparameter Known as regularisation constant and it is greater

than zero.

XGBoost

XGBoost: Aim

XGBoost: Quick Start
I would like to show you how easy and powerful XGBoost is with this Quick
start.
Let’s solve a real problem.

Problem Description
Pima Indians Diabetes Prediction
Predict the onset of diabetes based on diagnostic measures.
https://www.kaggle.com/uciml/pima-indians-diabetes-database
Tabular Data : 768 rows x 9 columns
768 people
8 input features and 1 output
Input features (diagnostic measures) : X ∈ R768 x 8

Pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, diabetes
pedigree function, age
Output : y ∈ R768 x 1
Whether he / she has diabetes (0 or 1).
Code
Around 30 lines (including some pre-processing).
The core part of the code.
Very simple. Isn’t it?

Results
Train error: 1.9% / Test error: 24% .
Quite good performance.
Vary depending on the trial.
Also, this library is really fast.
It takes < 10 seconds to fit the model in my lab-top computer.

Results and Further Usage
XGBoost also tells you how important each feature is.
5-th feature is most important and 0-th feature is least important.

5-th feature: BMI / 0-th feature: Pregnancies
It can be used as a basis when using other models (such as Deep Learning to learn
later).
XGBoost: What It Is?
XGBoost stands for eXtreme Gradient Boosting.
It is an algorithm that has recently been dominating applied machine learning

and Kaggle competitions for structured or tabular data.
It is an implementation of gradient boosted decision trees designed for speed

and performance.
It is a machine learning library like numpy, tensorflow, pytorch.

XGBoost: What It Is?
The “eXtreme” refers to speed enhancements such as parallel computing and
cache awareness that makes XGBoost approximately 10 times faster than
traditional Gradient Boosting.
In addition, XGBoost includes a unique split-finding algorithm to optimize trees,

along with built-in regularization that reduces overfitting.
Generally speaking, XGBoost is a faster, more accurate version of Gradient

Boosting.

XGBoost and Gradient Boosting
XGBoost is an implementation of gradient boosting machines created by Tianqi
Chen, now with contributions from many developers.
Both XGBoost and gbm follows the principle of gradient boosting.
There are, however, the difference in modeling details. Specifically, XGBoost used a
more regularized model formalization to control over-fitting, which gives it better
performance.
In addition to shrinkage parameter, XGB employs many other configurable

strategies that are not found in ‘traditional’ GBM implementations. Check out the list:
https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters

Why XGBoost?
Easy to implement in scikit-learn.
It is an ensemble, so it scores better than individual models.
It is regularized, so default models often don’t overfit.

Very fast (for ensembles).
It learns form its mistakes (gradient boosting).
Has extensive hyperparameters for fine-tuning.

It includes hyperparameters to scale imbalanced data and fill null values.

XGBoost Model Features
The implementation of the model supports the features of the scikit-learn and R
implementations, with new additions like regularization.
Three main forms of gradient boosting are supported:

Gradient Boosting algorithm also called gradient boosting machine including
the learning rate.
Stochastic Gradient Boosting with sub-sampling at the row, column and

column per split levels.
Regularized Gradient Boosting with both L1 and L2 regularization.

XGBoost System Features
The library provides a system for use in a range of computing environments, not
least:
Parallelization of tree construction using all of your CPU cores during training.
Distributed Computing for training very large models using a cluster of machines.
Out-of-Core Computing for very large datasets that don’t fit into memory.
Cache Optimization of data structures and algorithm to make best use of

hardware.

XGBoost Algorithm Features
The implementation of the algorithm was engineered for efficiency of compute time
and memory resources.
A design goal was to make the best use of available resources to train the model.
Some key algorithm implementation features include:
Sparse Aware implementation with automatic handling of missing data values.
Block Structure to support the parallelization of tree construction.
Continued Training so that you can further boost an already fitted model on
new data.

Fine Tuning XGBoost Model
Tuning the model is the way to supercharge the model to increase their
performance.
Let us look into an example where there is a comparison between the untuned
XGBoost model and tuned XGBoost model based on their RMSE score.
Later, you will know about the description of the hyperparameters in XGBoost.

Fine Tuning XGBoost Model
Untuned Tuned
Output: 34624.229980 Output: 29812.683594

Around 15% reduction in the RMSE score
XGBoost Hyperparameters
For each base learner of the XGBoost, there are different parameters that can
be tuned to increase the model performance.
There are a plethora of tuning parameters for tree-based learners in XGBoost

for building the model.
It is available at:
http://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters
The most common ones are:

learning rate: Gradient boosting involves creating and adding trees to the model sequentially.
New trees are created to correct the residual errors in the predictions from the existing
sequence of trees.
A problem with gradient boosted decision trees is that they are quick to learn and overfit
training data.
One effective way to slow down learning in the gradient boosting model is to use a learning
rate, also called shrinkage (or eta in XGBoost).
It affects how quickly the model fits the residual error using additional base learners.
Setting values less than 1.0 has the effect of making less corrections for each tree added to the
model. This in turn results in more trees that must be added to the model.
It is common to have small values in the range of 0.1 to 0.3, as well as values less than
0.1.

max_depth: It is a positive integer value and is responsible for how deep each tree
will grow during any boosting round.
Increasing this value will make the model more complex and more likely to
overfit.
XGBoost aggressively consumes memory when training a deep tree.
Default = 6.

subsample: The fraction of the training samples (randomly selected) that will be
used to train each tree.
It ranges from 0 to 1.
Setting it to 0.5 means that XGBoost would randomly sample half of the
training data prior to growing trees.
Lower values prevents overfitting, but too small values might lead to under-
fitting.
High values may lead to overfitting.
Default = 1.

colsample_by_tree: The fraction of Imagine we have a dataset with 16 features.
features (randomly selected) that
For simplicity let’s use 0.5.
will be used to train each tree.
Ranges from 0 to 1.
Default = 1.
For the first tree.

colsample_bylevel: The fraction of features (randomly
selected) that will be used in each node to train each tree.
Ranges from 0 to 1.
Default = 1.
This comes into play every time when we achieve the new
level of depth in a tree.
Before making any further splits, we take all the features

that are left after applying colsample_bytree and filter
them again using colsample_bylevel.
On the next level of depth, we repeat this step, so you get

different set of features on each level.

colsample_bynode: The fraction of features (randomly selected) that will be used for
each split.
Ranges from 0 to 1.
Default = 1.
Occurs once every time a new split is evaluated.
Features are subsampled from the set of features chosen for the current level.

Why do I need these parameters?
By limiting the number of features for building each tree we may end up with trees
that gained different insights from the data.
They learn how to optimise for the target variable using different set of features.
So, if you have enough data you can try tuning colsample parameters!

n_estimators: Number of trees you want to build.
objective: Determines the loss function to be used like reg:linear for

regression problems, reg:logistic for classification problems with only
decision, binary:logistic for classification problems with probability.
max_depth: Maximum depth of a tree.

Increasing this value will make the model more complex and more likely to
overfit.
Default = 1.

gamma: Controls whether a given node will split based on the expected reduction
in loss after the split.
Minimum loss reduction required to make a further partition on a leaf node of
the tree.
A higher value leads to fewer splits.
The larger gamma is, the more conservative the algorithm will be.
range: [0, ∞].
Default = 0.
There is no “good Gamma” for any data set alone.
Gamma is dependent on both the training set and the other parameters you
use.
alpha: L1 regularization on leaf weights.
A large value leads to more regularization.
Default = 0.
lambda: This is responsible for L2 regularization on leaf weights.

A large value leads to more regularization.
is smoother than L1 regularization.
Default = 1.

eta: Step size shrinkage used in update to prevents overfitting.
A problem with gradient boosted decision trees is that they are quick to learn and overfit
training data.
One effective way to slow down learning in the gradient boosting model is to use a
learning rate, also called shrinkage (or eta in XGBoost).
range: [0,1]
Default = 0.3.
to get the most of xgboost, eta must be set as low as possible. However, as eta gets
lower, you need many more steps (rounds) to get to the optimum:
Increasing eta makes computation faster (because you need to input less rounds) but
does not make reaching the best optimum.
Decreasing eta makes computation slower (because you need to input more rounds) but
makes easier reaching the best optimum.
Tuning eta
esv
gi
el ta
od of e
.
r m ue
tte val
a sing
be
r ea
nc
ei
Th
What Makes XGBoost so Popular?
Speed and performance.
Core algorithm is parallelizable.
Consistently outperforms single-algorithm methods.
State-of-the-art performance in many ML tasks.
Out-of-Core computing (large datasets that do not fit in memory).

When to Use XGBoost
You have a large number of training samples.
Greater than 1000 training samples and less 100 features.
The number of features < number of training samples.
You have a mixture of categorical and numeric features.

Or just numeric features.

When Not to Use XGBoost
Image recognition.
Computer vision.
Natural language processing and understanding problems.
When the number of training samples is significantly smaller than the number
of features.


Curse of Dimensionality
Increasing the number of features will not always
improve classification accuracy.
In practice, the inclusion of more features might

actually lead to worse performance.
31 bins
It has been estimated that as the number of
dimensions increase, the number of training
examples required increases exponentially with
dimensionality d (i.e., kd).
32 bins
33 bins
k: number of bins per feature
The curse of dimensionality is the phenomena
whereby an increase in the dimensionality of a data
set results in exponentially more data being required
to produce a representative sample of that data set.
The ability to generalize correctly becomes

exponentially harder as the dimensionality of the
training dataset grows, as the training set covers a
dwindling fraction of the input space.
Models also become more efficient as the reduced

feature set boosts learning rates and diminishes
computation costs by removing redundant features.

With one dimension there are only 10 possible
positions. 10 datum are required to create a
representative sample which 'covers' the problem
space.
With two dimensions there are 10^2 = 100 possible
positions. 100 datum are required to create a
representative sample which 'covers' the problem
space.
With just three dimensions there are 10^3 = 1000
possible positions. 1000 datum are required to create
a representative sample which 'covers' the problem
space ...............
What is Dimensionality Reduction?
In machine learning classification problems, there are often too many factors on the
basis of which the final classification is done.
These factors are basically variables called features.
The higher the number of features, the harder it gets to visualize the training set
and then work on it.
Sometimes, most of these features are correlated, and hence redundant.

This is where dimensionality reduction algorithms come into play.

What is Dimensionality Reduction?
Dimensionality reduction is the process of reducing the number of random variables
under consideration, by obtaining a set of principal variables.
It can be divided into feature selection and feature extraction.
Feature extraction: This reduces the data in a high dimensional space to a

lower dimension space, i.e., a space with lesser no. of dimensions.
Feature selection: Finding a subset of the original set of variables, or

features, to get a smaller subset which can be used to model the problem

Feature extraction: finds a Feature selection:
set of new features (i.e., chooses a subset of
through some mapping f()) the original features.
from the existing features.
The mapping f() é x1 ù
êx ú
é x1 ù could be linear ê 2ú
êx ú é xi1 ù
or non-linear ê . ú ê ú
ê 2ú é y1 ù
ê . ú êy ú
ê ú ê xi2 ú
.
ê ú
. ê 2ú x=ê ú ® y=ê . ú
x=ê ú ¾¾¾
f (x)
®y = ê . ú ê . ú ê ú
ê . ú ê ú ê ú ê . ú
ê ú ê . ú ê . ú ê ú
ê . ú êë yK úû ê . ú ë xiK û
ê . ú ê ú
ê ú
K<<N ëê xN ûú K<<N
ëê xN ûú
Feature Extraction Feature Selection
Independent component analysis Remove features with missing values
Principal component analysis Remove features with low variance
Isomap Remove highly correlated features
Autoencoder Univariate feature selection
Locally Linear Embedding Recursive feature elimination
t-distributed Stochastic Neighbor Feature selection using
Embedding SelectFromModel
………….. …………..

Principal Component Analysis (PCA)

What is PCA?
Principal Component Analysis (PCA) is an unsupervised,
non-parametric statistical technique primarily used for
dimensionality reduction in machine learning.
The goal of PCA is to reduce the dimensionality of the data
while retaining as much as possible of the variation present
in the original dataset.
High dimensionality means that the dataset has a large
number of features.
The primary problem associated with high-dimensionality in
the machine learning field is model overfitting, which
reduces the ability to generalize beyond the examples in the
training set.
à Curse of Dimensionality
What is PCA?
Curse of Dimensionality à “Many algorithms that work
fine in low dimensions become intractable when the
input is high-dimensional.”
The goal of PCA is to reduce the dimensionality (number

of variables) of the data by extracting important one from
a large pool while retaining as much as possible of the
variation present in the original dataset.
In other words, this method combines highly correlated

variables together to form a smaller number of an
artificial set of variables which is called “principal
components” that account for most variance in the data.

What is PCA?
The goal of PCA is to reduce the dimensionality of the
data while retaining as much as possible of the variation
present in the original dataset.
PCA reduces the number of variables in your data by

extracting important one from a large pool with the aim
of retaining as much information as possible.
More formally, PCA is the identification of linear

combinations of highly correlated variables to form a
smaller number of an artificial set of variables (principal
components) that provide maximum variability within a
set of data.

Principal Components
Principal components are new variables
that are constructed as linear combinations or
mixtures of the initial variables.
Geometrically speaking, principal

components represent the directions of the
data that explain a maximal amount of
variance, that is to say, the lines that capture
most information of the data.

Principal Components
These combinations are done in such a way that
the new variables (i.e., principal components)
are uncorrelated and most of the information
within the initial variables is squeezed or
compressed into the first components.
So, the idea is 10-dimensional data gives you 10

principal components, but PCA tries to put
maximum possible information in the first
component, then maximum remaining
information in the second and so on, until having
something like shown in the scree plot below.

Intuition
You want to predict what the GDP of the USA for 2017.
You have lots of information available:
👉 the U.S. GDP for the first quarter of 2017,
👉 the U.S. GDP for the entirety of 2016, 2015, and so on.
👉 Publicly-available economic indicator, like the unemployment rate, inflation rate, etc.
👉 U.S. Census data from 2010 estimating how many Americans work in each industry.
👉 American Community Survey data updating those estimates in between each census.
👉 How many members of the House and Senate belong to each political party.
👉 Stock price data, the number of IPOs occurring in a year, and
👉 how many CEOs seem to be mounting a bid for public office.

Intuition
A lot of variables to consider.
But this can present problems.
Do you understand the relationships between each variable?
Do you have so many variables that you are in danger of overfitting your
model to your data?
You might ask the question, “How do you take all of the variables you have
collected and focus on only a few of them?”
In technical terms, you want to “reduce the dimension of your feature space.”

Intuition
By reducing the dimension of your feature space, you have fewer relationships
between variables to consider, and
You are less likely to overfit your model.
Ways:
Feature Elimination
Feature Extraction

Feature Elimination
Reducing the feature space by eliminating features.
GDP example: instead of considering every single variable, drop all variables
except the three you think will best predict what the U.S.’s GDP will look like.
Advantages: simplicity and maintaining interpretability of variables.
Disadvantage: gain no information from those variables you’ve dropped.

If you only use last year’s GDP, you will miss out on whatever the dropped
variables could contribute to the model.
By eliminating features, you’ve also entirely eliminated any benefits those
dropped variables would bring.
PCA
PCA (feature extraction), however, doesn’t run into this problem.
Suppose:
There are ten independent variables.
PCA will create ten “new” independent variables, where each “new”
independent variable is a combination of each of the ten “old” independent
variables.
However, these new independent variables are created in a specific way and
order these new variables by how well they predict our dependent variable.

PCA
Where does the dimensionality reduction come into play?
It is possible keep as many of the new independent variables as anyone
want but can drop the “least important ones.”
The new variables are ordered by how well they predict the dependent variable, it
is possible to know which variable is the most important and least important.
But — and here’s the kicker — because these new independent variables are
combinations of our old ones,
It is still possible to keep the most valuable parts of our old variables,
even when one or more of these “new” variables are dropped!

PCA: Step by Step
Step 1: Standardize the dataset.
Step 2: Calculate the covariance matrix for the
Step 1: Standardization
features in the dataset.
Step 2: Covariance Matrix computation Step 3: Calculate the eigenvalues and
eigenvectors for the covariance matrix.
Step 3: Compute the Eigenvectors and Step 4: Sort eigenvalues and their
Eigenvalues of the Covariance Matrix to corresponding eigenvectors.
identify the Principal Components Step 5: Pick k eigenvalues and form a matrix
of eigenvectors.
Step 6: Transform the original matrix.

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

PCA: Algorithm

Xgboost Dimensionality Reduction: Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 18.02.2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Xgboost Dimensionality Reduction: Kazi Shah Nawaz Ripon - Faculty of Computer Sciences 1 18.02.2021

Uploaded by

Copyright:

Available Formats

Lecture 6

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 1

Regularization in Machine Learning

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 2

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 3

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 4

You get the following equation:

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 5

A regression model that uses L2 regularisation technique is called Ridge regression.

lambda (𝜆) is a Hyperparameter Known as regularisation constant and it is greater

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 7

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 8

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 9

Let’s solve a real problem.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 10

Input features (diagnostic measures) : X ∈ R768 x 8

The core part of the code.

Very simple. Isn’t it?

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 12

Also, this library is really fast.

It takes < 10 seconds to fit the model in my lab-top computer.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 13

5-th feature is most important and 0-th feature is least important.

It is an algorithm that has recently been dominating applied machine learning

It is an implementation of gradient boosted decision trees designed for speed

It is a machine learning library like numpy, tensorflow, pytorch.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 15

In addition, XGBoost includes a unique split-finding algorithm to optimize trees,

Generally speaking, XGBoost is a faster, more accurate version of Gradient

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 16

Both XGBoost and gbm follows the principle of gradient boosting.

In addition to shrinkage parameter, XGB employs many other configurable

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 17

It is an ensemble, so it scores better than individual models.

It is regularized, so default models often don’t overfit.

It learns form its mistakes (gradient boosting).

Has extensive hyperparameters for fine-tuning.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 18

Three main forms of gradient boosting are supported:

Stochastic Gradient Boosting with sub-sampling at the row, column and

Regularized Gradient Boosting with both L1 and L2 regularization.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 19

Cache Optimization of data structures and algorithm to make best use of

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 20

Some key algorithm implementation features include:

Sparse Aware implementation with automatic handling of missing data values.

Block Structure to support the parallelization of tree construction.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 21

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 22

Output: 34624.229980 Output: 29812.683594

There are a plethora of tuning parameters for tree-based learners in XGBoost

The most common ones are:

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 24

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 25

XGBoost aggressively consumes memory when training a deep tree.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 26

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 27

For the first tree.

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 28

Before making any further splits, we take all the features

On the next level of depth, we repeat this step, so you get

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 29

Kazi Shah Nawaz Ripon | Faculty of Computer Sciences 18.02.2021 30