You are on page 1of 18

Open in app

Search Write

Imbalanced Data: an extensive


guide on how to deal with
imbalanced classification problems
Lavinia Guadagnolo · Follow
Published in Eni digiTALKS · 9 min read · May 3, 2022

124 2

An in-depth analysis on data-level, algorithm-level, and hybrid approaches to


face imbalanced classification problems.

Source https://stock.adobe.com

Imbalanced classification problems and the accuracy paradox


Anyone who is familiar with machine learning has certainly come across the
problem of imbalanced classification. By definition, imbalanced
classification occurs when one or more classes have very low proportions in
the training data as compared to the other classes. When the distribution of
example is uneven by a large amount (≥ 1:100) there is a severe imbalance.

Going into details, there are two main aspects in imbalanced classification
problems: minority interest and rare instances. Minority interest refers to
the fact that rare instances are the most interesting ones: in problems such
as fraud detection or churn prevention, the minority class is the one to be
addressed. On the other hand, rarity refers to the fact that data belonging to
a particular class are represented with low proportions with respect to the
other classes. Most of imbalanced classification problems arise from a
combination of these two factors, such as predicting rare diseases. In such
defined situations, most of the common machine learning algorithms fail
since they are designed to improve accuracy. In imbalanced classification
problems, models can have high accuracy, while performing poorly on the
minority class. So how can we face this challenge?

In such defined situations, most of the common machine learning


algorithms fail since they are designed to improve accuracy. In imbalanced
classification problems, models can have high accuracy, while performing
poorly on the minority class. So how can we face this challenge? There are
several approaches, which can be classified in 3 categories:

• data - level methods:


- oversampling techniques
- undersampling techniques

• agorithm - level methods

• hybrid methods.
Choosing the right approach is crucial, a wrong choice might bring
information loss or lead to overfitting.

In this article we will focus on different approaches to avoid these risks.


First, we will present SMOTE and its extensions, techniques aimed at
resampling the dataset; then we will give a brief overview of two families of
algorithms that are suitable for imbalanced datasets, and finally we will
present possible hybrid approaches combining the previous ones.

1. Data-level methods
Data-level approaches aim at rebalancing the training dataset before
applying machine learning algorithms. This can be done in two different
ways:

• Oversampling: creating new instances of the minority class

• Undersampling: deleting instances of the majority class

The two approaches are explained in Figure 1.

Figure 1 — Undersampling and oversampling effects in rebalancing.


There are several data-level methods, ranging from random oversampling /
undersampling to more complex approaches. In the following paragraphs
we will focus on SMOTE and its extensions.

1.1 SMOTE
SMOTE (Synthetic Minority Oversampling Technique) is an oversampling
technique that was first presented in 2002 (Nitesh V. Chawla, 2002) and it’s
based on the idea of generating new samples of the minority class by linear
combination.

Let’s see this approach in detail:

• An �ᵢ minority class instance is selected as root sample for new synthetic


samples

• �ᵢ ’s � nearest neighbors are obtained

• � of the � instances are randomly chosen to compute the new instances


by interpolation

• The difference between �� and the selected neighbors is considered

• � synthesized samples are generated by the following formula:

where ���(�) is a uniformly distributed random variable from (0,1) for the j-
th feature, and � is the amount of oversampling.
Figure 2 — Example of imbalanced dataset before and after SMOTE is applied. Python code for
implementation source: https://machinelearningmastery.com

As it’s apparent from the plots, after SMOTE application the two classes are
more balanced, and the new generated samples are close to the original ones
from the minority class.

1.2 SMOTE Extensions


At first sight, focusing on the first plot, we can identify different areas.

Figure 3 — Example of imbalanced dataset highlighting different regions


• The green circled area, in which most of the instances belong to the
minority class

• the yellow circled area, where we can find in almost equal proportions
instances of the majority and the minority class

• and the red circled area where most of the points belong to the majority
class.

We define the second one as the borderline area. Instances located in this
area are the most difficult to predict. Therefore, some extensions to SMOTE
have been developed, based on the idea of selecting target samples, before
generating new ones.

1.2.1 Borderline SMOTE

Borderline SMOTE is one of these extensions, which focuses on the “danger”


region. The approach is the following one:

• An �ᵢ minority class instance is selected as root sample for new synthetic


samples

• �ᵢ ’s � nearest neighbours are obtained

• The number of majority samples (�′ ) among the nearest neighbors of �ᵢ


is calculated
- If �′ = � → the sample is considered as NOISE
- If �/2 ≤ �′<� . It means that the number of majority neighbors is
higher than the minority samples → the sample is considered as DANGER
- If 0 ≤ �′<�/2 the number of majority neighbors is lower than the
number of minority samples → the sample is considered as SAFE.

Let ���� be the number of minority instances and ���� be the number of
instances in danger:

• For each sample in danger:

- A �ᵢ′ minority class instance is selected as root sample for new synthetic
samples
- �ᵢ′ ’s � nearest neighbors are obtained
- � ∗ ���� of the � instances are randomly chosen to compute the new
instances by interpolation
- The difference between �ᵢ and the selected neighbors is considered
- � ∗ ���� synthesized samples are generated by the following formula:

Where, ���ⱼ is the difference between �ᵢ′ and its j nearest neighbor, �ⱼ is a
uniformly distributed random variable from (0,1), � is the amount of
oversampling and � is the number of nearest neighbors. The result of
borderline SMOTE is shown in the following plots:

Figure 4 — Example of imbalanced dataset before and after borderline SMOTE is applied
Similarly to SMOTE, the technique for generating new samples is a linear
combination (oversampling technique). But, in contrast with the previous
approach, new samples are generated almost only along the borderline area.
In this way more samples are generated in the most critical region, ignoring
the areas defined as “safe” or “noise”.

1.2.2 Adasyn

However, in some cases, the noise area needs to be noted and addressed.
The Adasyn approach (Haibo He) focuses on this problem. The idea behind
this approach is to generate more minority samples (oversampling), in areas
where their density is lower. The procedure is the following one:

Let �ₛ be the number of minority class examples, �ₗ be the number of


majority class examples, �ₜₕ be the threshold for the maximum tolerated
degree of class imbalance, and � a parameter used to specify the desired
balance.

Calculate the degree of class imbalance � = �ₛ ⁄ �ₗ

• if �<�ₜₕ
- Calculate the number of synthetic data examples that need to be
generated � = (�ₗ — �ₛ ) x �
- For each example �� in the minority class, find � nearest neghbors and
calculate �ᵢ = Δᵢ ⁄� , � = 1, . ., �ₛ where Δᵢ is the number of examples in K
nearest neighbor that belong to the majority class
- Normalize �ᵢ s.t. s.t. �̂ᵢ = �ᵢ /∑ �� =1, s.t. �̂ᵢ is a density distribution
- Calculate the number of synthetic data example that needs to be
generated �ᵢ= �̂ᵢ × �

For each minority class data example generate �� synthetic samples:


Figure 5 — Example of imbalanced dataset before and after Adasyn is applied

According to this approach, examples with the most class overlap have the
most focus. Hence, on problems where these low-density examples might be
outliers, the ADASYN approach may put too much attention on these areas
of the feature space, which may result in worse model performance. It may
help to remove outliers prior to applying the oversampling procedure, and
this might be a helpful heuristic to use more generally.

Finally, an interesting integration with under sampling techniques can


furtherly improve dataset rebalancing.

1.2.3 Integration with Tomek Link

Tomek link (Luo Ruisen), is an interesting under sampling technique which


avoids information loss. The approach is as follows:

• For two instances �ᵢ , �ⱼ


- If for any �ₖ ∈ � {�ᵢ , �ⱼ }
�ᵢ , �ⱼ are called Tomek link.
- If �ᵢ , �ⱼ belong to different classes → The one belonging to the majority class
is removed.

Therefore, instances belonging to the majority class, which are close to the
ones belonging to the minority class (and might be misinterpreted) are
removed.

Figure 6 — Example of Tomek link application on an imbalanced dataset

Integrating this undersampling technique with SMOTE can be beneficial in


rebalancing the dataset, since noise is reduced and new samples are created.

2. Algorithm-based approaches
Image from https://stock.adobe.com

After having resampled the data, it is crucial to apply the right algorithm
according to the new distribution and according to the aim of the project.
The most effective types of algorithms for imbalanced classification are
ensemble techniques. The idea behind them is to train multiple models
(weak learners) and combine their results to obtain predictions. Ensemble
algorithms improve stability and accuracy of common machine learning
algorithms.

Ensemble algorithms can be grouped in two categories: bagging and


boosting techniques.

Bagging states for Bootstrap Aggregating. In this technique different training


datasets are generated by random sampling with replacement from the
original dataset, so some observations may be repeated in different datasets.
These new sets are used to train the same learner algorithm and different
classifiers are produced. Different learners are trained independently and in
parallel. Then the predictions are obtained by averaging the results of all the
learners.
Similarly, in boosting, different training datasets are generated by random
sampling with replacement but, in contrast with bagging, each classifier
considers the previous classifiers’ success: samples that are misclassified get
higher weight, so the different models are trained sequentially and
adaptively. Furthermore, in boosting techniques the final vote is obtained
considering the weighted average of the estimates.

Both techniques are good at reducing variance and increase stability but,
while boosting mainly aims at reducing bias, bagging reduces variance, so it
may solve the overfitting problem. The table below sums up strengths and
weaknesses of both the approaches.

Table 1 — Advantages and disadvantages of Algorithm based techniques

3. Hybrid approaches
When there is a severe imbalance, it is advisable to apply hybrid techniques,
combining data-level and algorithm-level approaches to leverage the
potential of both the techniques. The following pictures illustrate two
possible hybrid approaches.

Figure 7 — Hybrid approach integrating oversampling, undersampling techniques and boosting algorithm

Figure 8 — Hybrid approach integrating bagging, undersampling techniques and boosting algorithm

In Figure 7 two data-level approaches are applied to rebalance the dataset:


Borderline SMOTE and Tomek link for under sampling. Then XGBoost is
applied and scores are obtained. In Figure 8 different models are trained in
parallel on different dataset. These are obtained selecting all the instances
belonging to the minority class and a subset of the instances of the majority
class. In this way the degree of imbalance is reduced. After that, Tomek link
is applied to furtherly balance the dataset, removing misleading samples,
and finally the XGBoost algorithm is trained on each subset. The probability
score is obtained by averaging the results of the different algorithms.

Conclusion
So, which sampling technique apply? Which algorithm to choose? In this
article we presented different possibilities, however there is no single right
answer to these questions. It depends on the problem itself and on the
severity of the imbalance. It is suggested to do a careful preliminary and
exploratory analysis, verifying data distribution, the presence of outliers and
deepening their meanings. After restricting the panel of possible
approaches, according to these elements, it is possible to test and compare
them, in order to find the most suitable one for the problem.

References
Haibo He, Y. B. (n.d.). ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced. Conference: Neural Networks, 2008. IJCNN 2008. (IEEE World
Congress on Computational Intelligence).

Luo Ruisen, D. S. (n.d.). Bagging of Xgboost Classifiers with Random Under-


sampling and Tomek Link for Noisy Label-imbalanced Data. IOP Conference
Series: Materials Science and Engineering. Chengdu, China.

Nitesh V. Chawla, K. W. (2002). SMOTE: Synthetic Minority Over-sampling


Technique. Journal of Artificial Intelligence Research 16, 321–357

Machine Learning Data Science Classification Algorithms Imbalanced Data AI


Written by Lavinia Guadagnolo Follow

42 Followers · Writer for Eni digiTALKS

Data Scientist

More from Lavinia Guadagnolo and Eni digiTALKS

Lavinia Guadagnolo in Eni digiTALKS MStella Darena in Eni digiTALKS

Unlock the NeuralProphet Text Preprocessing: NLP


potential: main components fundamentals with spaCy
Analyzing key disparities from Prophet, A brief overview of the peculiarities of natural
examining its main components, highlightin… language, how to handle them through the…

14 min read · Sep 4, 2023 13 min read · Feb 7, 2023

45 2 52
Mauro Suardi in Eni digiTALKS Lavinia Guadagnolo in Eni digiTALKS

Machine Learning for beginners Unlock the NeuralProphet


with Orange Data Mining potential: hyperparameter tuning
Introducing a No-Code tool for Data Scientists A guide through the optimization of its
to teach beginners how to create their first… hyperparameters for more accurate…

10 min read · Feb 6, 2024 11 min read · Sep 12, 2023

78 16 1
See all from Lavinia Guadagnolo See all from Eni digiTALKS

Recommended from Medium

Karina Thapa Kartik Chaudhary in Game of Bits

SMOTE How to deal with Imbalanced data


Talking about SMOTE, it stands for Synthetic in classification?
Minority Over-sampling Technique. As… Common ML techniques for handling
Imbalanced data

2 min read · Feb 10, 2024 5 min read · Sep 23, 2023

193 1

Lists
Predictive Modeling w/ Practical Guides to Machine
Python Learning
20 stories · 975 saves 10 stories · 1161 saves

Natural Language Processing The New Chatbots: ChatGPT,


1258 stories · 749 saves Bard, and Beyond
12 stories · 324 saves

Juan C Olamendy Bragadeesh Sundararajan

How to Tackle Unbalanced Data Leveling the Field: Strategies for


with SMOTE: A Comprehensive… Handling Imbalanced Datasets
Introduction In the realm of machine learning and
statistical modeling, one common and…

3 min read · Dec 5, 2023 · 10 min read · Jan 3, 2024

55
Sze Zhong LIM in Data And Beyond Ab

Exploring Pyspark.ml for Machine Random Forest, AdaBoost &


Learning: Handling Class… XGBoost (Comparing 3 Ensemble…
Data resampling techniques and codes to This blog is dedicated to comparing the
handle class imbalance in Pyspark. Includes… results of using 3 classification algorithms in…

5 min read · Nov 11, 2023 9 min read · Oct 18, 2023
See more recommendations

51 10 1

You might also like