Imbalanced Data An Extensive Guide On How To Deal With Imbalanced Classification Problems by Lavinia Guadagnolo Eni digiTALKS Medium

Open in app
Search Write
Imbalanced Data: an extensive

guide on how to deal with
imbalanced classification problems
Lavinia Guadagnolo · Follow
Published in Eni digiTALKS · 9 min read · May 3, 2022
124 2
An in-depth analysis on data-level, algorithm-level, and hybrid approaches to

face imbalanced classification problems.
Source https://stock.adobe.com
Imbalanced classification problems and the accuracy paradox

Anyone who is familiar with machine learning has certainly come across the
problem of imbalanced classification. By definition, imbalanced
classification occurs when one or more classes have very low proportions in
the training data as compared to the other classes. When the distribution of
example is uneven by a large amount (≥ 1:100) there is a severe imbalance.
Going into details, there are two main aspects in imbalanced classification
problems: minority interest and rare instances. Minority interest refers to
the fact that rare instances are the most interesting ones: in problems such
as fraud detection or churn prevention, the minority class is the one to be
addressed. On the other hand, rarity refers to the fact that data belonging to
a particular class are represented with low proportions with respect to the
other classes. Most of imbalanced classification problems arise from a
combination of these two factors, such as predicting rare diseases. In such
defined situations, most of the common machine learning algorithms fail
since they are designed to improve accuracy. In imbalanced classification
problems, models can have high accuracy, while performing poorly on the
minority class. So how can we face this challenge?
In such defined situations, most of the common machine learning

algorithms fail since they are designed to improve accuracy. In imbalanced
classification problems, models can have high accuracy, while performing
poorly on the minority class. So how can we face this challenge? There are
several approaches, which can be classified in 3 categories:
• data - level methods:

- oversampling techniques
- undersampling techniques
• agorithm - level methods
• hybrid methods.
Choosing the right approach is crucial, a wrong choice might bring
information loss or lead to overfitting.
In this article we will focus on different approaches to avoid these risks.

First, we will present SMOTE and its extensions, techniques aimed at
resampling the dataset; then we will give a brief overview of two families of
algorithms that are suitable for imbalanced datasets, and finally we will
present possible hybrid approaches combining the previous ones.
1. Data-level methods
Data-level approaches aim at rebalancing the training dataset before
applying machine learning algorithms. This can be done in two different
ways:
• Oversampling: creating new instances of the minority class
• Undersampling: deleting instances of the majority class
The two approaches are explained in Figure 1.
Figure 1 — Undersampling and oversampling effects in rebalancing.

There are several data-level methods, ranging from random oversampling /
undersampling to more complex approaches. In the following paragraphs
we will focus on SMOTE and its extensions.
1.1 SMOTE
SMOTE (Synthetic Minority Oversampling Technique) is an oversampling
technique that was first presented in 2002 (Nitesh V. Chawla, 2002) and it’s
based on the idea of generating new samples of the minority class by linear
combination.
Let’s see this approach in detail:
• An �ᵢ minority class instance is selected as root sample for new synthetic

samples
• �ᵢ ’s � nearest neighbors are obtained
• � of the � instances are randomly chosen to compute the new instances

by interpolation
• The difference between �� and the selected neighbors is considered
• � synthesized samples are generated by the following formula:
where ��(�) is a uniformly distributed random variable from (0,1) for the j-
th feature, and � is the amount of oversampling.
Figure 2 — Example of imbalanced dataset before and after SMOTE is applied. Python code for
implementation source: https://machinelearningmastery.com
As it’s apparent from the plots, after SMOTE application the two classes are
more balanced, and the new generated samples are close to the original ones
from the minority class.
1.2 SMOTE Extensions

At first sight, focusing on the first plot, we can identify different areas.
Figure 3 — Example of imbalanced dataset highlighting different regions

• The green circled area, in which most of the instances belong to the
minority class
• the yellow circled area, where we can find in almost equal proportions
instances of the majority and the minority class
• and the red circled area where most of the points belong to the majority
class.
We define the second one as the borderline area. Instances located in this
area are the most difficult to predict. Therefore, some extensions to SMOTE
have been developed, based on the idea of selecting target samples, before
generating new ones.
1.2.1 Borderline SMOTE
Borderline SMOTE is one of these extensions, which focuses on the “danger”

region. The approach is the following one:
• An �ᵢ minority class instance is selected as root sample for new synthetic

samples
• �ᵢ ’s � nearest neighbours are obtained
• The number of majority samples (�′ ) among the nearest neighbors of �ᵢ

is calculated
- If �′ = � → the sample is considered as NOISE
- If �/2 ≤ �′<� . It means that the number of majority neighbors is
higher than the minority samples → the sample is considered as DANGER
- If 0 ≤ �′<�/2 the number of majority neighbors is lower than the
number of minority samples → the sample is considered as SAFE.
Let �� be the number of minority instances and �� be the number of
instances in danger:
• For each sample in danger:
- A �ᵢ′ minority class instance is selected as root sample for new synthetic
samples
- �ᵢ′ ’s � nearest neighbors are obtained
- � ∗ �� of the � instances are randomly chosen to compute the new
instances by interpolation
- The difference between �ᵢ and the selected neighbors is considered
- � ∗ �� synthesized samples are generated by the following formula:
Where, ��ⱼ is the difference between �ᵢ′ and its j nearest neighbor, �ⱼ is a
uniformly distributed random variable from (0,1), � is the amount of
oversampling and � is the number of nearest neighbors. The result of
borderline SMOTE is shown in the following plots:
Figure 4 — Example of imbalanced dataset before and after borderline SMOTE is applied
Similarly to SMOTE, the technique for generating new samples is a linear
combination (oversampling technique). But, in contrast with the previous
approach, new samples are generated almost only along the borderline area.
In this way more samples are generated in the most critical region, ignoring
the areas defined as “safe” or “noise”.
1.2.2 Adasyn
However, in some cases, the noise area needs to be noted and addressed.
The Adasyn approach (Haibo He) focuses on this problem. The idea behind
this approach is to generate more minority samples (oversampling), in areas
where their density is lower. The procedure is the following one:
Let �ₛ be the number of minority class examples, �ₗ be the number of

majority class examples, �ₜₕ be the threshold for the maximum tolerated
degree of class imbalance, and � a parameter used to specify the desired
balance.
Calculate the degree of class imbalance � = �ₛ ⁄ �ₗ
• if �<�ₜₕ
- Calculate the number of synthetic data examples that need to be
generated � = (�ₗ — �ₛ ) x �
- For each example �� in the minority class, find � nearest neghbors and
calculate �ᵢ = Δᵢ ⁄� , � = 1, . ., �ₛ where Δᵢ is the number of examples in K
nearest neighbor that belong to the majority class
- Normalize �ᵢ s.t. s.t. �̂ᵢ = �ᵢ /∑ �� =1, s.t. �̂ᵢ is a density distribution
- Calculate the number of synthetic data example that needs to be
generated �ᵢ= �̂ᵢ × �
For each minority class data example generate �� synthetic samples:

Figure 5 — Example of imbalanced dataset before and after Adasyn is applied
According to this approach, examples with the most class overlap have the
most focus. Hence, on problems where these low-density examples might be
outliers, the ADASYN approach may put too much attention on these areas
of the feature space, which may result in worse model performance. It may
help to remove outliers prior to applying the oversampling procedure, and
this might be a helpful heuristic to use more generally.
Finally, an interesting integration with under sampling techniques can

furtherly improve dataset rebalancing.
1.2.3 Integration with Tomek Link
Tomek link (Luo Ruisen), is an interesting under sampling technique which

avoids information loss. The approach is as follows:
• For two instances �ᵢ , �ⱼ

- If for any �ₖ ∈ � {�ᵢ , �ⱼ }
�ᵢ , �ⱼ are called Tomek link.
- If �ᵢ , �ⱼ belong to different classes → The one belonging to the majority class
is removed.
Therefore, instances belonging to the majority class, which are close to the
ones belonging to the minority class (and might be misinterpreted) are
removed.
Figure 6 — Example of Tomek link application on an imbalanced dataset
Integrating this undersampling technique with SMOTE can be beneficial in

rebalancing the dataset, since noise is reduced and new samples are created.
2. Algorithm-based approaches
Image from https://stock.adobe.com
After having resampled the data, it is crucial to apply the right algorithm
according to the new distribution and according to the aim of the project.
The most effective types of algorithms for imbalanced classification are
ensemble techniques. The idea behind them is to train multiple models
(weak learners) and combine their results to obtain predictions. Ensemble
algorithms improve stability and accuracy of common machine learning
algorithms.
Ensemble algorithms can be grouped in two categories: bagging and

boosting techniques.
Bagging states for Bootstrap Aggregating. In this technique different training

datasets are generated by random sampling with replacement from the
original dataset, so some observations may be repeated in different datasets.
These new sets are used to train the same learner algorithm and different
classifiers are produced. Different learners are trained independently and in
parallel. Then the predictions are obtained by averaging the results of all the
learners.
Similarly, in boosting, different training datasets are generated by random
sampling with replacement but, in contrast with bagging, each classifier
considers the previous classifiers’ success: samples that are misclassified get
higher weight, so the different models are trained sequentially and
adaptively. Furthermore, in boosting techniques the final vote is obtained
considering the weighted average of the estimates.
Both techniques are good at reducing variance and increase stability but,
while boosting mainly aims at reducing bias, bagging reduces variance, so it
may solve the overfitting problem. The table below sums up strengths and
weaknesses of both the approaches.
Table 1 — Advantages and disadvantages of Algorithm based techniques
3. Hybrid approaches
When there is a severe imbalance, it is advisable to apply hybrid techniques,
combining data-level and algorithm-level approaches to leverage the
potential of both the techniques. The following pictures illustrate two
possible hybrid approaches.
Figure 7 — Hybrid approach integrating oversampling, undersampling techniques and boosting algorithm
Figure 8 — Hybrid approach integrating bagging, undersampling techniques and boosting algorithm
In Figure 7 two data-level approaches are applied to rebalance the dataset:

Borderline SMOTE and Tomek link for under sampling. Then XGBoost is
applied and scores are obtained. In Figure 8 different models are trained in
parallel on different dataset. These are obtained selecting all the instances
belonging to the minority class and a subset of the instances of the majority
class. In this way the degree of imbalance is reduced. After that, Tomek link
is applied to furtherly balance the dataset, removing misleading samples,
and finally the XGBoost algorithm is trained on each subset. The probability
score is obtained by averaging the results of the different algorithms.
Conclusion
So, which sampling technique apply? Which algorithm to choose? In this
article we presented different possibilities, however there is no single right
answer to these questions. It depends on the problem itself and on the
severity of the imbalance. It is suggested to do a careful preliminary and
exploratory analysis, verifying data distribution, the presence of outliers and
deepening their meanings. After restricting the panel of possible
approaches, according to these elements, it is possible to test and compare
them, in order to find the most suitable one for the problem.
References
Haibo He, Y. B. (n.d.). ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced. Conference: Neural Networks, 2008. IJCNN 2008. (IEEE World
Congress on Computational Intelligence).
Luo Ruisen, D. S. (n.d.). Bagging of Xgboost Classifiers with Random Under-

sampling and Tomek Link for Noisy Label-imbalanced Data. IOP Conference
Series: Materials Science and Engineering. Chengdu, China.
Nitesh V. Chawla, K. W. (2002). SMOTE: Synthetic Minority Over-sampling

Technique. Journal of Artificial Intelligence Research 16, 321–357
Machine Learning Data Science Classification Algorithms Imbalanced Data AI

Written by Lavinia Guadagnolo Follow
42 Followers · Writer for Eni digiTALKS
Data Scientist
More from Lavinia Guadagnolo and Eni digiTALKS
Lavinia Guadagnolo in Eni digiTALKS MStella Darena in Eni digiTALKS
Unlock the NeuralProphet Text Preprocessing: NLP

potential: main components fundamentals with spaCy
Analyzing key disparities from Prophet, A brief overview of the peculiarities of natural
examining its main components, highlightin… language, how to handle them through the…
14 min read · Sep 4, 2023 13 min read · Feb 7, 2023
45 2 52
Mauro Suardi in Eni digiTALKS Lavinia Guadagnolo in Eni digiTALKS
Machine Learning for beginners Unlock the NeuralProphet

with Orange Data Mining potential: hyperparameter tuning
Introducing a No-Code tool for Data Scientists A guide through the optimization of its
to teach beginners how to create their first… hyperparameters for more accurate…
10 min read · Feb 6, 2024 11 min read · Sep 12, 2023
78 16 1
See all from Lavinia Guadagnolo See all from Eni digiTALKS
Recommended from Medium
Karina Thapa Kartik Chaudhary in Game of Bits
SMOTE How to deal with Imbalanced data

Talking about SMOTE, it stands for Synthetic in classification?
Minority Over-sampling Technique. As… Common ML techniques for handling
Imbalanced data
2 min read · Feb 10, 2024 5 min read · Sep 23, 2023
193 1
Lists
Predictive Modeling w/ Practical Guides to Machine
Python Learning
20 stories · 975 saves 10 stories · 1161 saves
Natural Language Processing The New Chatbots: ChatGPT,

1258 stories · 749 saves Bard, and Beyond
12 stories · 324 saves
Juan C Olamendy Bragadeesh Sundararajan
How to Tackle Unbalanced Data Leveling the Field: Strategies for

with SMOTE: A Comprehensive… Handling Imbalanced Datasets
Introduction In the realm of machine learning and
statistical modeling, one common and…
3 min read · Dec 5, 2023 · 10 min read · Jan 3, 2024
55
Sze Zhong LIM in Data And Beyond Ab
Exploring Pyspark.ml for Machine Random Forest, AdaBoost &

Learning: Handling Class… XGBoost (Comparing 3 Ensemble…
Data resampling techniques and codes to This blog is dedicated to comparing the
handle class imbalance in Pyspark. Includes… results of using 3 classification algorithms in…
5 min read · Nov 11, 2023 9 min read · Oct 18, 2023
See more recommendations
51 10 1

Imbalanced Data An Extensive Guide On How To Deal With Imbalanced Classification Problems by Lavinia Guadagnolo Eni digiTALKS Medium

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Imbalanced Data An Extensive Guide On How To Deal With Imbalanced Classification Problems by Lavinia Guadagnolo Eni digiTALKS Medium

Uploaded by

Copyright:

Available Formats

Open in app

Imbalanced Data: an extensive

An in-depth analysis on data-level, algorithm-level, and hybrid approaches to

Imbalanced classification problems and the accuracy paradox

In such defined situations, most of the common machine learning

• data - level methods:

• agorithm - level methods

In this article we will focus on different approaches to avoid these risks.

• Oversampling: creating new instances of the minority class

• Undersampling: deleting instances of the majority class

The two approaches are explained in Figure 1.

Figure 1 — Undersampling and oversampling effects in rebalancing.

Let’s see this approach in detail:

• An �ᵢ minority class instance is selected as root sample for new synthetic

• �ᵢ ’s � nearest neighbors are obtained

• � of the � instances are randomly chosen to compute the new instances

• The difference between �� and the selected neighbors is considered

• � synthesized samples are generated by the following formula:

1.2 SMOTE Extensions

Figure 3 — Example of imbalanced dataset highlighting different regions

1.2.1 Borderline SMOTE

Borderline SMOTE is one of these extensions, which focuses on the “danger”

• An �ᵢ minority class instance is selected as root sample for new synthetic

• �ᵢ ’s � nearest neighbours are obtained

• The number of majority samples (�′ ) among the nearest neighbors of �ᵢ

• For each sample in danger:

Let �ₛ be the number of minority class examples, �ₗ be the number of

Calculate the degree of class imbalance � = �ₛ ⁄ �ₗ

For each minority class data example generate �� synthetic samples:

Finally, an interesting integration with under sampling techniques can

1.2.3 Integration with Tomek Link

Tomek link (Luo Ruisen), is an interesting under sampling technique which

• For two instances �ᵢ , �ⱼ

Figure 6 — Example of Tomek link application on an imbalanced dataset

Integrating this undersampling technique with SMOTE can be beneficial in

Ensemble algorithms can be grouped in two categories: bagging and

Bagging states for Bootstrap Aggregating. In this technique different training

Table 1 — Advantages and disadvantages of Algorithm based techniques

In Figure 7 two data-level approaches are applied to rebalance the dataset:

Luo Ruisen, D. S. (n.d.). Bagging of Xgboost Classifiers with Random Under-

Nitesh V. Chawla, K. W. (2002). SMOTE: Synthetic Minority Over-sampling

Machine Learning Data Science Classification Algorithms Imbalanced Data AI

42 Followers · Writer for Eni digiTALKS

More from Lavinia Guadagnolo and Eni digiTALKS

Lavinia Guadagnolo in Eni digiTALKS MStella Darena in Eni digiTALKS

Unlock the NeuralProphet Text Preprocessing: NLP

14 min read · Sep 4, 2023 13 min read · Feb 7, 2023

Machine Learning for beginners Unlock the NeuralProphet

10 min read · Feb 6, 2024 11 min read · Sep 12, 2023

Recommended from Medium

Karina Thapa Kartik Chaudhary in Game of Bits

SMOTE How to deal with Imbalanced data

Natural Language Processing The New Chatbots: ChatGPT,

Juan C Olamendy Bragadeesh Sundararajan

How to Tackle Unbalanced Data Leveling the Field: Strategies for

3 min read · Dec 5, 2023 · 10 min read · Jan 3, 2024

Exploring Pyspark.ml for Machine Random Forest, AdaBoost &

You might also like