Professional Documents
Culture Documents
Search Write
124 2
Source https://stock.adobe.com
Going into details, there are two main aspects in imbalanced classification
problems: minority interest and rare instances. Minority interest refers to
the fact that rare instances are the most interesting ones: in problems such
as fraud detection or churn prevention, the minority class is the one to be
addressed. On the other hand, rarity refers to the fact that data belonging to
a particular class are represented with low proportions with respect to the
other classes. Most of imbalanced classification problems arise from a
combination of these two factors, such as predicting rare diseases. In such
defined situations, most of the common machine learning algorithms fail
since they are designed to improve accuracy. In imbalanced classification
problems, models can have high accuracy, while performing poorly on the
minority class. So how can we face this challenge?
• hybrid methods.
Choosing the right approach is crucial, a wrong choice might bring
information loss or lead to overfitting.
1. Data-level methods
Data-level approaches aim at rebalancing the training dataset before
applying machine learning algorithms. This can be done in two different
ways:
1.1 SMOTE
SMOTE (Synthetic Minority Oversampling Technique) is an oversampling
technique that was first presented in 2002 (Nitesh V. Chawla, 2002) and it’s
based on the idea of generating new samples of the minority class by linear
combination.
where ���(�) is a uniformly distributed random variable from (0,1) for the j-
th feature, and � is the amount of oversampling.
Figure 2 — Example of imbalanced dataset before and after SMOTE is applied. Python code for
implementation source: https://machinelearningmastery.com
As it’s apparent from the plots, after SMOTE application the two classes are
more balanced, and the new generated samples are close to the original ones
from the minority class.
• the yellow circled area, where we can find in almost equal proportions
instances of the majority and the minority class
• and the red circled area where most of the points belong to the majority
class.
We define the second one as the borderline area. Instances located in this
area are the most difficult to predict. Therefore, some extensions to SMOTE
have been developed, based on the idea of selecting target samples, before
generating new ones.
Let ���� be the number of minority instances and ���� be the number of
instances in danger:
- A �ᵢ′ minority class instance is selected as root sample for new synthetic
samples
- �ᵢ′ ’s � nearest neighbors are obtained
- � ∗ ���� of the � instances are randomly chosen to compute the new
instances by interpolation
- The difference between �ᵢ and the selected neighbors is considered
- � ∗ ���� synthesized samples are generated by the following formula:
Where, ���ⱼ is the difference between �ᵢ′ and its j nearest neighbor, �ⱼ is a
uniformly distributed random variable from (0,1), � is the amount of
oversampling and � is the number of nearest neighbors. The result of
borderline SMOTE is shown in the following plots:
Figure 4 — Example of imbalanced dataset before and after borderline SMOTE is applied
Similarly to SMOTE, the technique for generating new samples is a linear
combination (oversampling technique). But, in contrast with the previous
approach, new samples are generated almost only along the borderline area.
In this way more samples are generated in the most critical region, ignoring
the areas defined as “safe” or “noise”.
1.2.2 Adasyn
However, in some cases, the noise area needs to be noted and addressed.
The Adasyn approach (Haibo He) focuses on this problem. The idea behind
this approach is to generate more minority samples (oversampling), in areas
where their density is lower. The procedure is the following one:
• if �<�ₜₕ
- Calculate the number of synthetic data examples that need to be
generated � = (�ₗ — �ₛ ) x �
- For each example �� in the minority class, find � nearest neghbors and
calculate �ᵢ = Δᵢ ⁄� , � = 1, . ., �ₛ where Δᵢ is the number of examples in K
nearest neighbor that belong to the majority class
- Normalize �ᵢ s.t. s.t. �̂ᵢ = �ᵢ /∑ �� =1, s.t. �̂ᵢ is a density distribution
- Calculate the number of synthetic data example that needs to be
generated �ᵢ= �̂ᵢ × �
According to this approach, examples with the most class overlap have the
most focus. Hence, on problems where these low-density examples might be
outliers, the ADASYN approach may put too much attention on these areas
of the feature space, which may result in worse model performance. It may
help to remove outliers prior to applying the oversampling procedure, and
this might be a helpful heuristic to use more generally.
Therefore, instances belonging to the majority class, which are close to the
ones belonging to the minority class (and might be misinterpreted) are
removed.
2. Algorithm-based approaches
Image from https://stock.adobe.com
After having resampled the data, it is crucial to apply the right algorithm
according to the new distribution and according to the aim of the project.
The most effective types of algorithms for imbalanced classification are
ensemble techniques. The idea behind them is to train multiple models
(weak learners) and combine their results to obtain predictions. Ensemble
algorithms improve stability and accuracy of common machine learning
algorithms.
Both techniques are good at reducing variance and increase stability but,
while boosting mainly aims at reducing bias, bagging reduces variance, so it
may solve the overfitting problem. The table below sums up strengths and
weaknesses of both the approaches.
3. Hybrid approaches
When there is a severe imbalance, it is advisable to apply hybrid techniques,
combining data-level and algorithm-level approaches to leverage the
potential of both the techniques. The following pictures illustrate two
possible hybrid approaches.
Figure 7 — Hybrid approach integrating oversampling, undersampling techniques and boosting algorithm
Figure 8 — Hybrid approach integrating bagging, undersampling techniques and boosting algorithm
Conclusion
So, which sampling technique apply? Which algorithm to choose? In this
article we presented different possibilities, however there is no single right
answer to these questions. It depends on the problem itself and on the
severity of the imbalance. It is suggested to do a careful preliminary and
exploratory analysis, verifying data distribution, the presence of outliers and
deepening their meanings. After restricting the panel of possible
approaches, according to these elements, it is possible to test and compare
them, in order to find the most suitable one for the problem.
References
Haibo He, Y. B. (n.d.). ADASYN: Adaptive Synthetic Sampling Approach for
Imbalanced. Conference: Neural Networks, 2008. IJCNN 2008. (IEEE World
Congress on Computational Intelligence).
Data Scientist
45 2 52
Mauro Suardi in Eni digiTALKS Lavinia Guadagnolo in Eni digiTALKS
78 16 1
See all from Lavinia Guadagnolo See all from Eni digiTALKS
2 min read · Feb 10, 2024 5 min read · Sep 23, 2023
193 1
Lists
Predictive Modeling w/ Practical Guides to Machine
Python Learning
20 stories · 975 saves 10 stories · 1161 saves
55
Sze Zhong LIM in Data And Beyond Ab
5 min read · Nov 11, 2023 9 min read · Oct 18, 2023
See more recommendations
51 10 1