You are on page 1of 13

Applied Soft Computing Journal 83 (2019) 105662

Contents lists available at ScienceDirect

Applied Soft Computing Journal


journal homepage: www.elsevier.com/locate/asoc

An empirical comparison and evaluation of minority oversampling


techniques on a large number of imbalanced datasets
György Kovács
Analytical Minds Ltd., 4933, Beregsurány, Árpád street 5, Hungary

highlights

• The best performing oversamplers are identified through empirical evaluation.


• The best performing principles are identified through empirical evaluation.
• The top performers were found to depend slightly on characteristics of datasets.

article info a b s t r a c t

Article history: Learning and mining from imbalanced datasets gained increased interest in recent years. One simple
Received 10 February 2019 but efficient way to increase the performance of standard machine learning techniques on imbalanced
Received in revised form 24 July 2019 datasets is the synthetic generation of minority samples. In this paper, a detailed, empirical comparison
Accepted 24 July 2019
of 85 variants of minority oversampling techniques is presented and discussed involving 104 imbal-
Available online 31 July 2019
anced datasets for evaluation. The goal of the work is to set a new baseline in the field, determine
Keywords: the oversampling principles leading to the best results under general circumstances, and also give
Imbalanced learning guidance to practitioners on which techniques to use with certain types of datasets.
SMOTE © 2019 Elsevier B.V. All rights reserved.
Minority oversampling
SMOTE variants

1. Introduction nature of training data. These techniques are usually reason-


able but heuristic adjustments made to the decision rules in
In recent years, the topic of imbalanced learning gained re- instance-based methods [6,10,11] (see Fig. 2(b) for an illustra-
markable interest among theorists and practitioners of machine tion). Oversampling techniques resolve the issue of imbalanced
learning and artificial intelligence. The primary focus of imbal- data by generating additional training samples for the minority
anced learning is improving the performance of general purpose classes. One of the first oversampling techniques was Synthetic
classifiers in the presence of some significant inter-class or intra- Minority Oversampling Technique (SMOTE) [12], generating mi-
class variation in the number and/or distribution of training data nority samples by randomly sampling the line segments between
samples [1]. See Fig. 1 for an illustration of various types of neighboring minority instances (see Fig. 2(c) for an illustration).
imbalanced datasets. In practice, imbalanced learning problems Notable methodological overviews and comparative evaluations
are usually related to fields where the planned and/or cheap sam- of imbalanced learning techniques are available in [13–17], in this
pling of data is not feasible (for example, medical predictions [2,3] study we focus on oversampling techniques.
and failure/anomaly detection [4,5]). The reason why imbalanced As summarized in the recent overview [18], oversampling
problems need special treatment is that general machine learning techniques are highly successful in various types of machine
techniques tend to overfit majority classes and densely sampled learning problems, from online learning to regression. The success
regions [1]. of oversampling can be attributed to two main factors. On the one
Literature distinguishes three main approaches to tackle the hand, unlike cost-sensitive and classifier-specific solutions, over-
issues of imbalanced data [6]. Cost sensitive techniques [7–9] as- sampling is a preprocessing step, thus, any pipeline of machine
sign a higher cost to the misclassification of minority samples learning methods can be applied to the oversampled dataset. On
during the training process, in this way putting more emphasis the other hand, oversampling tries to tackle the root of imbal-
on the generalization of minority samples (see Fig. 2(a) for an anced problems: the lack of data. Minority oversampling can be
illustration). Classifier specific solutions try to tweak the oper- considered as a generalization of data augmentation [19] widely
ation of classifiers to take into consideration the imbalanced used in the image domain: slight geometrical transformations
are supposed to leave the meaningful content of images intact,
E-mail address: gyuriofkovacs@gmail.com. thus, the training set can be extended by generating geometrically

https://doi.org/10.1016/j.asoc.2019.105662
1568-4946/© 2019 Elsevier B.V. All rights reserved.
2 G. Kovács / Applied Soft Computing Journal 83 (2019) 105662

Fig. 1. Various types of imbalanced data: imbalance in the number of instances (a); imbalance in the number and in the density of instances (b); imbalance in the
number and locally varying imbalance in the density of samples (c).

Fig. 2. Cost sensitive techniques apply higher cost to the misclassification of minority instances — a possible cost scheme making the loss function balanced is
indicated next to the samples (a); the classifier specific technique [10] treats the minority points as objects with some volume to increase the minority region — a
possible choice of volumes is indicated by pink regions (b); oversamplers generate additional minority instances by local interpolation — the generated samples are
indicated by green color (c).

distorted images to improve the generalization of the learning 4. For reproducible science and the benefit of the community,
process. all implementations and numerical results are published in
Since SMOTE was published in 2003, more than 100 variants the GitHub repository http://github.com/gykovacs/smote_
have been proposed [18]. Recent techniques are usually com- variants.
pared to conventional ones like the original SMOTE, borderline-
The paper is organized as follows. In Section 2 the main
SMOTE [20] and ADASYN [21]. Unfortunately, the evaluation
characteristics of oversampling techniques, the databases used
methodology varies, thus, an unambiguous ranking of oversam-
for testing, and the evaluation methodology are described. In
plers based on the available comparisons cannot be deduced. Al-
Section 3 the test results are presented and evaluated, and finally,
though some studies [14,15,22] compare and evaluate the already
conclusions are drawn in Section 4.
mentioned conventional techniques, for dozens of oversampling
approaches comparable evaluations are not available, which ne-
2. Materials and methods
cessitates a thorough evaluation and comparison to facilitate
applications and further research. Finally, we emphasize that the
First, we introduce the notations used in the rest of the paper.
proper evaluation of oversamplers is also pulled back by the
The size of imbalanced datasets is denoted by N ∈ Z+ with
lack of open source implementations: the most notable packages
N+ , N− ∈ Z+ , N+ + N− = N minority (positive) and majority
imblearn [23] (Python) and smotefamily (R) implement only
(negative) samples, respectively. The terms minority (majority)
8 oversampling techniques together.
points/samples/instances are used interchangeably. In order to
In this paper, we carry out a thorough comparison and eval-
make the discussions compatible, we use the same abbreviations
uation of 85 variants of minority oversampling involving 104
as [18] to refer to oversampling techniques.
imbalanced datasets. We emphasize that this work does not
intend to be a methodological review of oversampling techniques,
2.1. The methods involved in the study
due to page limitations and the large number of algorithms cov-
ered, we cannot discuss all techniques and operating principles
Although SMOTE has more than 100 variants in the litera-
in details. The contributions to the field and benefits for theorists
ture [18], some techniques have not been involved in this study.
and practitioners are summarized as follows:
Representatives of these are developed for online learning [24],
1. We set a new baseline for the research of oversampling work only with Support Vector Machines [25], binary or discrete
methods. Given a ranking based on thorough testing, newly features [26,27], are not stable enough to be used in an automated
proposed oversampling methods are enough to be com- framework [28], and we also found that some techniques are
pared to the best performers to provide meaningful insight essentially the same (like [29] and [30]), in these cases only one
into their performance. of them was considered.
2. The analysis of the performance of various operating prin- For the ease of discussion and comparison, we have catego-
ciples of oversampling techniques can aid and boost the de- rized the oversamplers based on their most characteristic operat-
velopment of new techniques based on the most successful ing principles. Although the categories are not disjoint, the overall
principles. performance of techniques in a particular category is expected
to give some insight into how efficient the operating principle
3. Scientists and engineers working with real imbalanced data represented by the category is. We note that some techniques
gain an insight into the performance of the various tech- implement so unique approaches that none of the categories
niques on datasets with various characteristics, enabling we defined fits them. In the rest of the subsection, we give the
them to select the most appropriate oversampling tech- definitions of the categories, while the list of all techniques and
nique for the data they are working with. category assignments is given in Table 1.
G. Kovács / Applied Soft Computing Journal 83 (2019) 105662 3

Table 1
The oversampling methods being compared and the category assignments characterizing their main operating principles.
Method Method

Componentwise sampling

Componentwise sampling
Dimension reduction

Dimension reduction
Sampling by cloning

Sampling by cloning
Density estimation

Density estimation
Ordinary sampling

Ordinary sampling
Changes majority

Changes majority
Uses clustering

Uses clustering
Noise removal

Noise removal
Uses classifier

Uses classifier
Density based

Density based
Application

Application
Borderline

Borderline
Memetic

Memetic
1 SMOTE (2002) [12] × 44 SMOTE-OUT (2014) [31]
2 SMOTE-TomekLinks × × × 45 SMOTE-Cosine (2014) [31]
(2004) [32]
3 SMOTE-ENN (2004) [32] × × × 46 Selected-SMOTE (2014) [31] ×
4 Borderline-SMOTE1 × × 47 MWMOTE (2014) [33] × ×
(2005) [20]
5 Borderline-SMOTE2 × × 48 PDFOS (2014) [34] ×
(2005) [20]
6 AHC (2006) [35] × × × 49 IPADE-ID (2014) [36] × × ×
7 LLE-SMOTE (2006) [37] × 50 RWO-sampling (2014) [30]
8 cluster-SMOTE (2006) [5] × 51 NEATER (2014) [38] × ×
9 distance-SMOTE × 52 SDSMOTE (2014) [40] × ×
(2007) [39]
10 ADASYN (2008) [21] × × × 53 DSMOTE (2014) [41] ×
11 SMMO (2008) [42] × × 54 G-SMOTE (2014) [43] ×
12 polynom-fit-SMOTE 55 NT-SMOTE (2014) [45] ×
(2008) [44]
13 Stefanowski (2008) [46] × × × × 56 SSO (2014) [47] × × ×
14 ADOMS (2008) [48] × 57 Supervised-SMOTE × × ×
(2014) [49]
15 Safe-Level-SMOTE × × 58 DEAGO (2015) [51] × ×
(2009) [50]
16 MSMOTE (2009) [52] × × 59 Gazzah (2015) [53] × ×
17 ISOMAP-Hybrid × × × 60 MCT (2015) [55] ×
(2009) [54]
18 DE-oversampling × × 61 ADG (2015) [57] ×
(2010) [56]
19 CE-SMOTE (2010) [58] × × × 62 SMOTE-IPF (2015) [59] × ×
20 Edge-Det-SMOTE × × 63 KernelADASYN (2015) [61] × ×
(2010) [60]
21 SMOBD (2011) [62] × × × 64 MOT2LD (2015) [63] × ×
22 SUNDO (2011) [64] × × 65 V-SYNTH (2015) [65] ×
23 MSYN (2011) [66] 66 Lee (2015) [67] ×
24 LN-SMOTE (2011) [68] × 67 SPY (2015) [69] ×
25 CBSO (2011) [70] × × × 68 SMOTE-PSOBAT (2015) [71] × × ×
26 E-SMOTE (2011) [72] × × × 69 OUPS (2016) [73] ×
27 Random-SMOTE × 70 SMOTE-D (2016) [75]
(2011) [74]
28 NDO-sampling × × 71 MDO (2016) [77] ×
(2011) [76]
29 DSRBF (2011) [78] × × × 72 VIS-RST (2016) [79] × ×
30 SVM-balance (2012) [80] × × 73 GASMOTE (2016) [81] × ×
31 TRIM-SMOTE (2012) [82] × 74 A-SUWO (2016) [83] × × ×
32 SMOTE-RSB (2012) [84] × 75 SMOTE-FRST-2T (2016) [85] × × × ×
33 DBSMOTE (2012) [86] × × × 76 AND-SMOTE (2016) [87] ×
34 ASMOBD (2012) [88] × × 77 SMOTE-PSO (2017) [89] × ×
35 SN-SMOTE (2012) [90] × 78 CURE-SMOTE (2017) [91] ×
36 ProWSyn (2013) [92] × 79 SOMO (2017) [93] ×
37 SL-graph-SMOTE × 80 NRAS (2017) [95] × ×
(2013) [94]
38 NRSBoundary-SMOTE 81 Gaussian-SMOTE
×
(2013) [96] (2017) [97]
39 LVQ-SMOTE (2013) [98] × 82 CCR (2017) [99]
40 SOI-CJ (2013) [100] × × 83 ANS (2017) [101] × ×
41 Assembled-SMOTE × × × 84 AMSCO (2018) [103] × × ×
(2013) [102]
42 ISMOTE (2013) [104] × 85 kmeans-SMOTE ×
(2018) [105]
43 ROSE (2014) [106] ×

Ordinary sampling: Methods in this category implement the Rd , new samples are generated by
same sampling as SMOTE, based on the assumption that points
xn = xi + r · (xj − xi ), (1)
along the line segments connecting neighboring minority in-
stances belong to the minority class. Given two instances xi , xj ∈ where r is a uniformly distributed random number from [0, 1].
4 G. Kovács / Applied Soft Computing Journal 83 (2019) 105662

Component-wise sampling: Methods in this category carry out 2.2. Implementation details and directives
the sampling of attributes independently. The assumption behind
component-wise sampling is that the entire volume of a hyper- In order to carry out the proposed evaluation and comparison
cube spanned by two neighboring minority samples belongs to of oversampling techniques, we have implemented all algorithms
the minority class. listed in Table 1. During the implementation, we tried to stick to
Sampling by cloning: Oversampling can be done with no as- the pseudo-codes in the papers, using similar parameter and vari-
sumptions on the distribution of data by cloning minority in- able names to enable the easy validation of the implementations.
stances. It is worth noting that cloning minority instances n times Additionally, we followed some general directives to resolve am-
has a similar effect as assigning the integer cost n + 1 to the biguous steps, improve comparability and increase efficiency.
misclassification of that instance. Thus, classification after cloning These directives are summarized in the rest of this subsection.
is closely related to using a cost-sensitive variant of the classi- The proportion parameter: As a matter of fact, the number of
fier with the misclassification cost N /N+ assigned to minority minority samples generated affects the subsequent classification
samples. performance. Some techniques let the user specify the number of
samples to be generated, others (just like ordinary SMOTE) pro-
Borderline: These techniques implement some explicit steps to
vide a coarse control on the number of instances to be generated
identify minority instances near the decision boundary and gen-
(for example, generating H number of samples in the neighbor-
erate new samples in the neighborhoods of these borderline
hood of each minority instance), while memetic oversamplers
instances.
usually generate an unpredictable number of samples. In order
Using a sampling density: These techniques assign some impor-
to standardize the algorithms in this respect, in most cases we
tance scores to minority instances and the normalization of these
managed to introduce a parameter called proportion referring to
scores results the sampling density pi , i = 1, . . . N+ driving the the proportion of the difference Nd = (N− − N+ ) to be generated.
sampling process. Particularly, pi indicates the proportion of new For example, setting the proportion parameter to p = 1 means
instances to be generated in the neighborhood of the ith minority that p · Nd = Nd minority samples are generated and the dataset
instance. Although ordinary SMOTE (pi = 1/N+ ) and borderline- gets balanced. During the evaluation process, the proportion pa-
SMOTE (pi = 1/Nb for borderline samples, Nb denoting their rameter is varied and the best subsequent classification results
number) can be considered as special cases of using a sampling are used for comparison, eliminating the effect of the number of
density, only those techniques are included in this category which generated samples on the results.
use some advanced importance scores. For example, multiple Unspecified details: In some papers minor algorithmic details or
techniques use the number of majority samples in the neigh- parameter settings are not discussed in details, for example the
borhoods of minority samples as the importance score: a large parameters of the bat-optimization in [71] and the autoencoder
number of majority neighbors indicates that many additional network structure in [51]. We tried to resolve all ambiguous steps
minority samples need to be generated in the neighborhood to in the most meaningful way.
facilitate the correct classification of the minority sample. Reasonable parameters: For each parameter of each technique,
Using density estimation: Methods in this category implement we have identified a set of reasonable values, for example, for
some (kernel-)density estimation for the minority class, and sam- the proportion parameter p ∈ {0.1, 0.25, 0.5, 0.75, 1.0, 1.5, 2.0}.
ple new instances from this density. It is important to distin- These sets enable the construction of meaningful parameter com-
guish density estimation and the use of sampling density. Density binations for model selection.
estimation based techniques sample the overall region of the Optimization: We found that some techniques have the time
minority class and can be expected to generate fewer samples complexity O(nx ), x ≥ 3, making the evaluation unfeasible on
near the decision boundary, as the density of the minority class datasets having more than a couple of hundred elements (for
decreases there. On the contrary, sampling density based tech- example [41]). In these cases, compromises were made and some
niques sample regions which are conceptually declared to be accuracy was sacrificed for tractable time complexity.
important by the importance score, and the common choice of The use of classifiers: Oversampling techniques are evalu-
importance scores makes them likely to generate samples near ated in subsequent classification scenarios. As mentioned be-
the decision boundary. fore, memetic optimization based techniques [71,89] use clas-
Dimensionality reduction: Techniques in this category use di- sifiers and cross-validation scores to drive the optimization of
mensionality reduction to map high dimensional data to lower the oversampling process. Passing the subsequent classifier to
these techniques would let them optimize sample generation
dimensional spaces. Commonly implemented techniques are fea-
for the subsequent classification task. The drawback of doing
ture selection by cross-validation, principal component analysis
so is that cross-validation driven memetic search is extremely
(PCA) and t-distributed stochastic neighborhood embedding.
expensive, and optimizing oversampling for a given classifier
Use of classifiers: Some techniques use supervised classifica-
would necessitate repeating the memetic optimization whenever
tion at certain points of the oversampling process, for exam-
the parameters of the classifier are changed. In order to keep the
ple, identify noisy samples based on the predicted positive-class
oversampling step separated from the classification step and keep
probabilities, or implement memetic optimization and drive the the model selection procedure tractable in terms of runtime, our
optimization process by cross-validation scores. implementation uses the classifiers reported in the corresponding
Use of clustering: Techniques of this category use clustering papers as default, instead of optimizing sample generation for
methods to identify minority concepts, and then do the oversam- subsequent classification steps.
pling within the individual clusters independently. Reproducing machine learning techniques is always prone to
Memetic: There are numerous methods using memetic algo- errors, in spite of all efforts, it is possible that our interpretations
rithms to change the minority and/or majority sets, making the and/or implementations are not completely aligned with the
entire training set more suitable for supervised learning. These intentions of the authors. In order to prevent well-established
techniques usually use some classifier and drive the optimization methods from being declared as worst performers due to mis-
process by cross-validation scores. matching interpretations or implementations, we will report only
Application specific: Methods in this category were developed the best performing techniques in each evaluation scenario. The
for specific applications, and evaluating their performance can major changes and adjustments applied to the algorithms are
give an insight into how useful application specific solutions are documented in the source codes at http://github.com/gykovacs/
in general settings. smote_variants.
G. Kovács / Applied Soft Computing Journal 83 (2019) 105662 5

2.3. Datasets used for evaluation G-score: the geometric mean of accuracies achieved on minority
and majority instances:
Although the evaluation methodology of imbalanced learn- √
TP TN
ing techniques has not been standardized, the datasets used G= · . (3)
for evaluation has distilled to a commonly accepted set. Recent P N
evaluations [15,16] use datasets collected in the Knowledge Ex- F1 -score: the harmonic mean of precision (PR) and recall (RE):
traction based on Evolutionary Learning (KEEL) repository [107], PR · RE TP TP
most of which are based on multiclass datasets available in the F1 = 2 · , PR = , RE = , (4)
UCI Machine Learning Repository [108], restructured as imbal- PR + RE TP + FP TP + FN
anced binary classification problems by selecting some small where precision measures the proportion of samples correctly
cardinality classes to constitute the minority class and consid- classified as positive, and recall characterizes the proportion of
ering the union of all other classes as the majority class. This all positive samples classified as positive.
restructuring is reflected in the names of the datasets, for ex- AUC-score: (Area Under the receiver operating characteristic
ample, ‘‘ecoli-0-6-7_vs_3-5’’ refers to the UCI database ‘‘ecoli’’ Curve) characterizes the area under the curve of sensitivities
restructured by selecting class labels 0, 6 and 7 to be the negative plotted against corresponding false positive rates
(majority) class and using class labels 3 and 5 as the positive (mi- FP
nority) class. Additionally to the KEEL datasets, we have included FPR = (5)
FP + TN
the imbalanced databases commonly used for the evaluation of
variants of the k Nearest Neighbors (kNN) classifier [6,10,109, at various probability threshold settings. According to one of
the most intuitive interpretations, AUC is the probability that
110]. We recognized that there is a minor overlap, the datasets
the classifier will assign a higher positive-class probability to a
vehicle and glass in [6] are the same as datasets vehicle0 and
randomly selected positive sample than to a randomly selected
glass2 in [107]. We also recognized that there is no difference
negative sample.
between the new-thyroid1 and new-thyroid2 datasets of [107], and
Precision at Top 20% (P20): the study [14] on imbalanced eval-
used only new-thyroid1 for evaluation. Finally, we have excluded
uation measures proposed the use of the precision at the top 20
some the datasets: ADA, HIVA, sylva, satimage, dermathology_6
measure, which is the percentage of truly positive samples among
and page_blocks0 were found to be too large in terms of records
the 20 test records having the highest predicted probability of
and/or attributes to be included in the study, as applying the
being positive. As some of the imbalanced datasets do not have 20
oversamplers of highest time complexities to these datasets is
minority samples, we have changed the measure to the percent-
intractable. An overview of the most characteristic properties of
age of truly positive samples among the top 20% of test records
the remaining 104 datasets is given in Table 2, sorted by the
ranked by the predicted positive-class probabilities.
imbalance ratio
N− 2.5. Evaluation methodology
IR = . (2)
N+
As oversampling techniques are evaluated in subsequent clas-
The datasets involved in the study show various characteristics,
sification scenarios, the results highly depend on the classification
containing real, nominal, categorical and binary attributes, as
technique used. For example, instance-based techniques are likely
well. Although some of the techniques cover the case of cate-
to work well with oversamplers equalizing the density of pos-
gorical attributes and propose some special treatment to handle
itive and negative samples near the decision boundary, on the
them, most of the oversampling methods are inherently related
contrary, neural networks might work better with oversamplers
to the features of the Euclidean space and the Euclidean distance.
increasing the cardinality of the minority class to that of the
In order to make all techniques applicable to all datasets, we have
majority class by sampling the entire minority manifold, since
applied some feature encoding steps to make the feature repre-
the classes will be represented equally in the gradient of the
sentation compatible with operations in the Euclidean space: (a) loss function during backpropagation. In order to see which over-
attributes with 1 unique value have been removed; (b) binary samplers work well with various types of classifiers, we have
attributes were kept and handled as floating point values; (c) selected k Nearest Neighbors (kNN) [111], Support Vector Ma-
nominal and categorical attributes were one-hot encoded when chines (SVM) [111], decision trees (DT) [111] and multilayer
they had less than 5 distinct values, otherwise label encoding was perceptrons (MLP) [111] to be included in the study. In the rest
applied to keep the number of attributes relatively low compared of the subsection, the details of the evaluation methodology are
to the size of the datasets, thus, avoiding the introduction of described.
sparsity and changing the nature of datasets remarkably. Parameter combinations for oversamplers: Based on the sets
of reasonable parameter values identified for each oversampler,
2.4. Performance measures we have randomly chosen up to 35 parameter combinations
for model selection (some techniques have less 35 parameter
Although binary classification has widely accepted and stan- combinations altogether). The maximum number of parameter
dard measures to characterize the performance of classifiers, combinations is an empirical compromise made for tractable
many of these are not suitable in imbalanced scenarios, since the runtimes.
performance on the majority class will be overrepresented. In im- Parameter combinations for classifiers: For each classifier we
balanced learning, the main goal is to improve the classification have selected 6 different parameterizations for evaluation: linear
of minority (positive) samples, while maintaining a reasonable SVM with C ∈ {1, 10} (high C values improve SVM on imbalanced
performance regarding the majority ones. Accordingly, based on data [80]), L1 and L2 penalties with compatible hinge or squared
the previous comparative studies [14–16], 4 measures have been hinge loss; kNN with k ∈ {3, 5, 7}, standard or distance weighted
selected to compare classification performance. Introducing the decision functions with L2 distance; DT with Gini-impurity or
notations TP, TN, FP and FN for the number of true positive, true entropy as splitting criterion, maximum depth of 3, 5 and un-
negative, false positive and false negative samples, respectively, bounded; MLP with 1 hidden layer, RELU or logistic activation
and P = TP + FN, N = TN + FP, the selected measures are defined functions and hidden units specified as 10%, 50% and 100% of the
as follows. number of input features.
6 G. Kovács / Applied Soft Computing Journal 83 (2019) 105662

Table 2
Overview of imbalanced datasets involved in the study, d denoting the number of attributes.
name N N+ d IR name N N+ d IR
1 glass1 [107] 214 76 9 1.82 53 PC1 [6] 1109 77 21 13.40
2 ecoli-0_vs_1 [107] 220 77 7 1.86 54 shuttle-c0-vs-c4 [107] 1829 123 9 13.87
3 wisconsin [107] 683 239 9 1.86 55 yeast-1_vs_7 [107] 459 30 7 14.30
4 pima [107] 768 268 8 1.87 56 glass4 [107] 214 13 9 15.46
5 iris0 [107] 150 50 4 2.00 57 ecoli4 [107] 336 20 7 15.80
6 glass0 [107] 214 70 9 2.06 58 page-blocks-1-3_vs_4 [107] 472 28 10 15.86
7 german [6] 1000 300 29 2.33 59 abalone9-18 [107] 731 42 8 16.40
8 yeast1 [107] 1484 429 10 2.46 60 zoo-3 [107] 101 5 16 19.20
9 habarman [107] 306 81 3 2.78 61 glass-0-1-6_vs_5 [107] 184 9 9 19.44
10 vehicle2 [107] 846 218 18 2.88 62 hypothyroid [6] 3163 151 25 19.95
11 vehicle1 [107] 846 217 18 2.90 63 shuttle-c2-vs-c4 [107] 129 6 9 20.50
12 vehicle3 [107] 846 212 18 2.99 64 shuttle-6_vs_2-3 [107] 230 10 9 22.00
13 glass-0-1-2-3_vs_4-5-6 [107] 214 51 9 3.20 65 yeast-1-4-5-8_vs_7 [107] 693 30 10 22.10
14 vehicle0 [107] 846 199 18 3.25 66 glass5 [107] 214 9 9 22.78
15 ecoli1 [107] 336 77 7 3.36 67 yeast-2_vs_8 [107] 482 20 10 23.10
16 hepatitis [6] 155 32 19 3.84 68 lymphography-normal-fibrosis [107] 148 6 23 23.67
17 SPECT_F [6] 267 55 44 3.85 69 flare-F [107] 1066 43 11 23.79
18 new_thyroid1 [107] 215 35 5 5.14 70 car_good [107] 1728 69 6 24.04
19 ecoli2 [107] 336 52 7 5.46 71 car-vgood [107] 1728 65 6 25.58
20 KC1 [6] 2109 326 21 5.47 72 kr-vs-k-zero–one_vs_draw [107] 2901 105 6 26.63
21 segment0 [107] 2308 329 23 6.02 73 kr-vs-k-one_vs_fifteen [107] 2244 78 6 27.77
22 glass6 [107] 214 29 9 6.38 74 yeast4 [107] 1484 51 10 28.10
23 yeast3 [107] 1484 163 10 8.10 75 winequality-red-4 [107] 1599 53 11 29.17
24 ecoli3 [107] 336 35 7 8.60 76 poker-9_vs_7 [107] 244 8 25 29.50
25 ecoli-0-3-4_vs_5 [107] 200 20 7 9.00 77 kddcup-guess_passwd_vs_satan [107] 1642 53 38 29.98
26 yeast-2_vs_4 [107] 514 51 8 9.08 78 yeast-1-2-8-9_vs_7 [107] 947 30 10 30.57
27 ecoli-0-6-7_vs_3-5 [107] 222 22 7 9.09 79 abalone-3_vs_11 [107] 502 15 8 32.47
28 ecoli-0-2-3-4_vs_5 [107] 202 20 7 9.10 80 winequality-white-9_vs_4 [107] 168 5 11 32.60
29 glass-0-1-5_vs_2 [107] 172 17 9 9.12 81 yeast5 [107] 1484 44 10 32.73
30 yeast-0-3-5-9_vs_7-8 [107] 506 50 10 9.12 82 kr-vs-k-three_vs_eleven [107] 2935 81 6 35.23
31 yeast-0-2-5-6_vs_3-7-8-9 [107] 1004 99 10 9.14 83 winequality-red-8_vs_6 [107] 656 18 11 35.44
32 yeast-0-2-5-7-9_vs_3-6-8 [107] 1004 99 10 9.14 84 ecoli-0-1-3-7_vs_2-6 [107] 281 7 7 39.14
33 ecoli-0-4-6_vs_5 [107] 203 20 6 9.15 85 abalone_17_vs_7_8_9_10 [107] 2338 58 8 39.31
34 CM1 [6] 498 49 23 9.16 86 abalone-21_vs_8 [107] 581 14 8 40.50
35 ecoli-0-1_vs_2-3-5 [107] 244 24 7 9.17 87 yeast6 [107] 1484 35 10 41.40
36 ecoli-0-2-6-7_vs_3-5 [107] 224 22 7 9.18 88 winequality-white-3_vs_7 [107] 900 20 11 44.00
37 glass-0-4_vs_5 [107] 92 9 9 9.22 89 winequality-red-8_vs_6-7 [107] 855 18 11 46.50
38 ecoli-0-3-4-6_vs_5 [107] 205 20 7 9.25 90 kddcup-land_vs_portsweep [107] 1061 21 40 49.52
39 ecoli-0-3-4-7_vs_5-6 [107] 257 25 7 9.28 91 abalone-19_vs_10-11-12-13 [107] 1622 32 8 49.69
40 yeast-0-5-6-7-9_vs_4 [107] 528 51 10 9.35 92 kr-vs-k-zero_vs_eight [107] 1460 27 6 53.07
41 vowel0 [107] 988 90 13 9.98 93 winequality-white-3-9_vs_5 [107] 1482 25 11 58.28
42 ecoli-0-6-7_vs_5 [107] 220 20 6 10.00 94 poker-8-9_vs_6 [107] 1485 25 25 58.40
43 glass-0-1-6_vs_2 [107] 192 17 9 10.29 95 shuttle-2_vs_5 [107] 3316 49 9 66.67
44 ecoli-0-1-4-7_vs_2-3-5-6 [107] 336 29 7 10.59 96 winequality-red-3_vs_5 [107] 691 10 11 68.10
45 led7digit-0-2-4-6-7-8-9_vs_1 [107] 443 37 7 10.97 97 abalone-20_vs_8-9-10 [107] 1916 26 8 72.69
46 ecoli-0-1_vs_5 [107] 240 20 6 11.00 98 kddcup-buffer_overflow_vs_back [107] 2233 30 31 73.43
47 glass-0-6_vs_5 [107] 108 9 9 11.00 99 kddcup-land_vs_satan [107] 1610 21 30 75.67
48 glass-0-1-4-6_vs_2 [107] 205 17 9 11.06 100 kr-vs-k-zero_vs_fifteen [107] 2193 27 6 80.22
49 glass2 [107] 214 17 9 11.59 101 poker-8-9_vs_5 [107] 2075 25 25 82.00
50 ecoli-0-1-4-7_vs_5-6 [107] 332 25 6 12.28 102 poker-8_vs_6 [107] 1477 17 25 85.88
51 cleveland-0_vs_4 [107] 177 13 23 12.62 103 kddcup-rootkit-imap_vs_back [107] 2225 22 47 100.14
52 ecoli-0-1-4-6_vs_5 [107] 280 20 6 13.00 104 abalone19 [107] 4174 32 8 129.44

Cross-validation: Classification performance is evaluated by re- the volume of the study, we mention that the overall number of
peated stratified k-fold cross-validation with 5 splits and 3 re- both the oversampling, training and prediction jobs is over 76
peats. The number of splits was implied by the lowest number million.
of minority samples in dataset zoo-3, the number of repeats is a
compromise made for tractable runtimes. We note that the num- 3. Results
ber of repeats is higher than the ones used in many recent studies
(for example, [93] and [51] use 1 and 2 repeats, respectively), Due to page limitations, it is infeasible to analyze and dis-
thus, more reliable results can be expected. cuss all aspects of oversampling techniques. In this section, we
Evaluation: The evaluation is carried out by the usual cross- keep focusing on the most interesting theoretical and practical
validation methodology involving all oversamplers with up to 35 questions: Which techniques give the best performance in gen-
random parameter combinations and all types of classifiers with eral? How does the performance depend on the characteristics
6 parameter combinations. The training set of each dataset fold of the datasets? Which operating principles seem to be the most
is oversampled before classifier training. For comparable results, efficient and useful in general?
special emphasis is put on folding the datasets the same way in
each test case. In each cross-validation scenario, the G, AUC, F1 3.1. The top-performing oversamplers
and P20 scores were determined. In order to implement model
selection, for each combination of dataset, oversampler type, and In order to remove the dataset dimension of the results, the
classifier type the highest scores were determined, and these mean scores were computed over all datasets for each perfor-
scores are analyzed in the next subsection. In order to illustrate mance measure, oversampler, and classifier. We emphasize that
G. Kovács / Applied Soft Computing Journal 83 (2019) 105662 7

the aggregation is carried out after model selection. The top- the ones with low probabilities of belonging to the minority class.
performing 10% of oversamplers regarding the various combi- The default supervised classifier used in SMOTE-IPF is DT. We also
nations of performance measures and classifiers are rendered mention that the 4th top performer technique by Lee et al. (Lee)
in Table 3 with the scores of SMOTE and classification without implements a very similar approach using kNN for noise filtering.
oversampling as baseline results. The first thing to observe is that As a consequence, supervised classification based noise-filtering
the classification performance without oversampling is signifi- as a post-processing step seems to be another powerful operating
cantly worse than using oversampling, which validates the use of principle for oversamplers.
oversamplers to improve imbalanced classification. According to The common property of the best performing oversamplers
the expectations, the top performers depend on the classifier and is that they are fairly simple, yet robust algorithms, if there are
the measure. As a contribution to the field, Table 3 can be used at least two different minority samples, these techniques are
as a reference to select oversampling techniques to be included able to generate a balanced dataset without cloning. The reason
in certain imbalanced learning pipelines. why not math-intensive techniques are topping the rankings is
One can also observe that the top performing advanced over- that they are likely to fail in edge cases, for example, when the
sampling techniques outperform ordinary SMOTE with the im- number of minority samples is extremely low (5–10) and the
provement of about 1 percent in terms of AUC and F1, 2 percents number of attributes is higher than the number of instances:
in terms of G, and 0.25 percent in the case of P20. The advantage clustering becomes unreliable, all minority samples can be iden-
of using advanced oversamplers is less remarkable than expected. tified as noise, autoencoders do not converge, and linear algebraic
Previous papers have reported significant improvements com- problems might become ill-posed.
pared to SMOTE, which seems to decrease by averaging over the One can expect that the performance of oversampling tech-
unprecedented number of 104 imbalanced datasets. For example, niques depends on the characteristics of the datasets. In the next
the recent paper [89] uses 21 datasets for evaluation and reports subsection, we examine the results by the most characteristic
an improvement of 0.04 AUC by SMOTE-PSO over SMOTE using properties of datasets.
SVM, while averaging over 104 datasets with thorough cross-
validation leads to the improvement of 0.01 AUC by the best 3.3. The top-performing oversamplers on various types of datasets
performer DBSMOTE over SMOTE. On the other hand, the high
Databases are characterized by three main attributes in this
performance of some oversamplers seems to be consistent: tech-
study: the imbalance rate (IR), the number of minority sam-
niques like polynom-fit-SMOTE are among the top performers in
plers, and the number of attributes. We have introduced some
many respects.
thresholds and consider databases to either have high (IR > 9) or
In order to get a more explicit overall ranking, we have ag-
low imbalance rate, high (N+ > 30) or low number of minority
gregated the results by averaging performance measures over
samplers and high (d > 10) or low number of attributes. The
classifiers, assigning the ranks of the average scores to each
threshold on IR is supported by [107] which defines extreme
oversampler and taking the average of ranks (averaging ranks
imbalance as IR higher than 9, the other thresholds are empirical.
was found to be more meaningful than averaging scores of highly
The results are aggregated (like in Table 4) and only the final
different meanings like F1 and AUC). The final ranking of the
rankings are rendered in Table 5, where all techniques are listed
overall top performers is given in Table 4: based on the evaluation
which got into the top 10 for any type of dataset. The overall
methodology outlined before, these 10 techniques are the best
ranking is identical to that in Table 4.
candidates to be used on arbitrary, unseen data. In the next
The first thing to observe is that 66 techniques did not get
subsection, we provide some insight into the operation of the top
in the top 10 performers on any type of dataset: 19 oversam-
three performers.
pling techniques provide the top 10 performers in each of the
categories. One can also observe that the top 3 performers on
3.2. Analysis of the top performers all datasets (Table 4) perform among the top 6 for all types of
datasets, suggesting that the overall ranking discussed in the
Polynom-fit-SMOTE refers to 4 fairly different oversampling previous subsection is robust and reliable, regardless of the char-
strategies controlled by the topology parameter of the technique. acteristics of the datasets. The rankings on datasets with high IR
Checking the parameter settings providing the highest scores, we and low N+ are highly similar to the overall ranking, these types
found that the various topologies are equally likely. The common of datasets give the majority of the datasets involved in the study.
behavior of using the ‘bus’, ‘star’, ‘mesh’ and ‘polynom’ topologies In the next paragraphs, we examine the outliers giving superior
is that each of them generates instances along line segments results on some special types of datasets.
between relatively far samples of the minority class, thus, the When the imbalance ratio is relatively low, some techniques
generated instances are more scattered in the manifold of the like NEATER, Supervised-SMOTE, and cluster-SMOTE give supe-
minority class, than using simple SMOTE. rior results. Supervised-SMOTE is highly similar to SMOTE-IPF
In Proximity Weighted Synthesis (ProWSyn) the number of except SMOTE-IPF uses a single decision tree and removes noise
instances generated in the neighborhood of a minority sample is after sample generation, while Supervised-SMOTE uses random
inversely proportional to its distance to majority instances. What forest by default and removes noise before sample generation.
makes ProWSyn unique among other sampling density based Supervised-SMOTE works properly only if the training of the
techniques is that it generates new instances by sampling the line random forest (which takes place before oversampling) is not
segments between minority instances having similar distances skewed enormously by the imbalance: relatively low imbalance
to majority instances. This property makes ProWSyn similar to rate is required, just like confirmed by the test results. In many
polynom-fit-SMOTE in the sense that samples are generated be- senses, NEATER is also similar to SMOTE-IPF, namely, it generates
tween relatively far minority instances, which seems to be an samples by ordinary SMOTE and ADASYN, and uses game theo-
efficient oversampling approach even though the assumption retical approaches to remove inconsistencies. The results suggest
made on the distribution of the data is stronger than that of that the game theory based filtering works well when the imbal-
SMOTE. ance is moderate, however, the overall rank of 41 shows that it
Finally, SMOTE-IPF executes ordinary SMOTE to generate new cannot cope with highly imbalanced data. Cluster-SMOTE applies
minority samples, then, uses a supervised classifier with cross- k-means clustering to the minority samples and does a SMOTE-
validation to check the consistency of new samples and remove like sampling within the clusters. The good performance on low
8 G. Kovács / Applied Soft Computing Journal 83 (2019) 105662

Table 3
The top performer oversamplers ranked by the average scores on all datasets: the columns report classification techniques and the four set of rows are related to
the four performance measures; in each cell, the oversampling techniques providing the top results are reported, ranked in descending order.
Classifier SVM DT kNN MLP
Rank Sampler AUC Sampler AUC Sampler AUC Sampler AUC
1 DBSMOTE [86] .9089 LVQ-SMOTE [98] .8885 AHC [35] .9153 CCR [99] .9182
2 CURE-SMOTE [91] .9068 ProWSyn [92] .8861 ProWSyn [92] .9139 SMOTE-IPF [59] .9174
3 G-SMOTE [43] .9062 Borderline-SMOTE2 [20] .8860 Assembled-SMOTE [102] .9095 Gaussian-SMOTE [97] .9171
4 CE-SMOTE [58] .9056 polynom-fit-SMOTE [44] .8841 SMOTE-IPF [59] .9092 DBSMOTE [86] .9169
5 MDO [77] .9042 SMOBD [62] .8835 CE-SMOTE [58] .9091 Assembled-SMOTE [102] .9168
6 CCR [99] .9037 Lee [67] .8834 Gaussian-SMOTE [97] .9090 polynom-fit-SMOTE [44] .9166
7 polynom-fit-SMOTE [44] .9037 Assembled-SMOTE [102] .8834 Lee [67] .9088 CE-SMOTE [58] .9166
8 SMOTE-ENN [32] .9029 SMOTE-IPF [59] .8828 Borderline-SMOTE1 [20] .9088 G-SMOTE [43] .9166
Baseline SMOTE .8999 SMOTE .8809 SMOTE .9082 SMOTE .9156
Baseline No sampling .8568 No sampling .8124 No sampling .8488 No sampling .8248
Rank Sampler G Sampler G Sampler G Sampler G
1 polynom-fit-SMOTE [44] .8672 DE-oversampling [56] .8658 polynom-fit-SMOTE [44] .8820 Selected-SMOTE [31] .8781
2 DBSMOTE [86] .8661 SMOTE-ENN [32] .8611 SMOTE-ENN [32] .8785 SMOBD [62] .8781
3 SMOTE-ENN [32] .8649 Lee [67] .8598 ProWSyn [92] .8780 polynom-fit-SMOTE [44] .8780
4 G-SMOTE [43] .8629 Borderline-SMOTE2 [20] .8598 SMOTE-IPF [59] .8766 SMOTE-IPF [59] .8777
5 CURE-SMOTE [91] .8625 SMOTE-IPF [59] .8590 SVM-balance [80] .8756 Lee [67] .8774
6 SMOTE-IPF [59] .8616 SMOBD [62] .8584 Lee [67] .8748 ADOMS [48] .8769
7 ProWSyn [92] .8614 LVQ-SMOTE [98] .8584 Assembled-SMOTE [102] .8743 ProWSyn [92] .8768
8 Lee [67] .8612 Assembled-SMOTE [102] .8582 SMOTE-TomekLinks [32] .8736 NDO-sampling [76] .8767
Baseline SMOTE .8595 SMOTE .8563 SMOTE .8733 SMOTE .8748
Baseline No sampling .5284 No sampling .7237 No sampling .7007 No sampling .6005
Rank Sampler F1 Sampler F1 Sampler F1 Sampler F1
1 DBSMOTE [86] .6774 ProWSyn [92] .7006 polynom-fit-SMOTE [44] .7276 Supervised-SMOTE [49] .6784
2 polynom-fit-SMOTE [44] .6763 Lee [67] .6994 Supervised-SMOTE [49] .7254 polynom-fit-SMOTE [44] .6774
3 Supervised-SMOTE [49] .6723 SMOBD [62] .6994 CCR [99] .7239 DBSMOTE [86] .6749
4 CURE-SMOTE [91] .6696 polynom-fit-SMOTE [44] .6993 ProWSyn [92] .7208 Assembled-SMOTE [102] .6745
5 SMOTE-Cosine [31] .6678 Assembled-SMOTE [102] .6969 LLE-SMOTE [37] .7208 SMOTE-IPF [59] .6744
6 CCR [99] .6677 Supervised-SMOTE [49] .6962 Gaussian-SMOTE [97] .7206 NEATER [38] .6742
7 LLE-SMOTE [37] .6677 SMOTE-IPF [59] .6961 CURE-SMOTE [91] .7201 ProWSyn [92] .6739
8 G-SMOTE [43] .6677 NRSBoundary-SMOTE [96] .6957 CE-SMOTE [58] .7200 SMOBD [62] .6737
Baseline SMOTE .6636 SMOTE .6932 SMOTE .7114 SMOTE .6706
Baseline No sampling .4692 No sampling .6092 No sampling .6351 No sampling .4308
Rank Sampler P20 Sampler P20 Sampler P20 Sampler P20
1 polynom-fit-SMOTE [44] .9918 LVQ-SMOTE [98] .9908 DE-oversampling [56] .9939 Gaussian-SMOTE [97] .9956
2 DBSMOTE [86] .9914 DE-oversampling [56] .9904 polynom-fit-SMOTE [44] .9927 DE-oversampling [56] .9953
3 LVQ-SMOTE [98] .9913 polynom-fit-SMOTE [44] .9902 ProWSyn [92] .9926 polynom-fit-SMOTE [44] .9953
4 Gaussian-SMOTE [97] .9912 E-SMOTE [72] .9893 E-SMOTE [72] .9925 DEAGO [51] .9951
5 LLE-SMOTE [37] .9911 Gaussian-SMOTE [97] .9890 SVM-balance [80] .9923 Selected-SMOTE [31] .9951
6 SMOTE-ENN [32] .9911 CCR [99] .9888 Gaussian-SMOTE [97] .9920 cluster-SMOTE [5] .9951
7 MOT2LD [63] .9910 SSO [47] .9884 NDO-sampling [76] .9919 SSO [47] .9951
8 cluster-SMOTE [5] .9910 LLE-SMOTE [37] .9882 SMOTE-ENN [32] .9919 LVQ-SMOTE [98] .9950
Baseline SMOTE .9896 SMOTE .9862 SMOTE .9913 SMOTE .9947
Baseline No sampling .3242 No sampling .3033 No sampling .3285 No sampling .2864

Table 4
The top performer oversamplers ranked by the combination of all scores. Besides the combined ranking, the aggregated values of the measures and the corresponding
ranks are also reported.
Rank Sampler Average score AUC AUC rank G G rank F1 F1 rank P20 P20 rank
1 polynom-fit-SMOTE [44] 2.50 0.9025 6 0.8708 1 0.6952 1 0.9925 2
2 ProWSyn [92] 4.50 0.9044 1 0.8684 4 0.6903 3 0.9911 10
3 SMOTE-IPF [59] 7.50 0.9026 5 0.8687 3 0.6879 9 0.9909 13
4 Lee [67] 8.00 0.9023 7 0.8683 5 0.6881 8 0.9910 12
5 SMOBD [62] 9.25 0.9022 8 0.8677 6 0.6889 4 0.9906 19
6 G-SMOTE [43] 13.50 0.9019 10 0.8651 18 0.6866 12 0.9908 14
7 CCR [99] 14.25 0.9021 9 0.8620 30 0.6879 10 0.9913 8
8 LVQ-SMOTE [98] 14.75 0.9028 3 0.8623 29 0.6836 24 0.9922 3
9 Assembled-SMOTE [102] 15.50 0.9027 4 0.8669 7 0.6886 5 0.9827 46
10 SMOTE-TomekLinks [32] 15.75 0.9010 14 0.8662 9 0.6847 20 0.9906 20

IR datasets can be attributed to two factors. On the one hand, of overly exaggerated clusters). In the case of high IR, cluster-
low IR datasets in this study usually contain at least 50 minority ing techniques are likely to underperform other techniques as
samples (see Table 2), thus, meaningful clusters are likely to be generating too many samples within small clusters deteriorates
identified; on the other hand, the low IR implies that a relatively the generalization of the minority class by overemphasizing the
small number of samples need to be generated to balance the clusters as distinct positive regions.
dataset, thus, the clusters do not get overly exaggerated and the Finally, CBSO shows an increased performance when the num-
minority class does not fall apart to the support regions of the ber of attributes gets high. CBSO carries out agglomerative clus-
clusters (see Fig. 3 for an illustration of the deteriorating effect tering and does a SMOTE-like sampling within the clusters, but
G. Kovács / Applied Soft Computing Journal 83 (2019) 105662 9

Fig. 3. An imbalanced dataset with the relatively low imbalance rate IR=2 and two noisy majority samples (a). A clustering based oversampler seeking for 2
clusters can identify the top 4 and bottom 4 minority samples as clusters. Balancing the dataset by generating samples within the clusters keeps the common sense
classification boundary depicted and classifies the test point (green triangle) as a minority sample (b). If the imbalance rate is extreme, balancing the dataset overly
exaggerates the two clusters, makes classifiers learn the clusters and classify the test point as a majority sample (c).

generates samples only in the neighborhoods of minority samples The moderate performance of clustering based techniques is
having majority neighbors. The increased performance can be explained by the issues already discussed: sample generation
attributed to two main factors. On the one hand, in high dimen- within clusters might overemphasize clusters when the imbal-
sional spaces more minority instances have majority neighbors, ance ratio is high and deteriorate the subsequent classification
thus, CBSO is less likely to overemphasize small regions near the results. Similarly to clustering, the moderate performance of den-
class boundary. On the other hand, the higher the dimension sity estimation and dimension reduction based techniques indi-
of the data is, the more similar the distances between points cates that in some cases they work well, but the performance
become [112], thus, agglomerative clustering is likely to iden- drops in oversampler-specific edge cases. Consequently, the au-
tify a low number of cluster and generating samples within the tomated application of techniques like these necessitates the
cluster means leads to samples generated between relatively far careful monitoring of results.
instances, similarly to polynom-fit-SMOTE and ProWSyn. Finally, the techniques using memetic algorithms give the worst
performance, and they are also responsible for low scores of the
3.4. The performance of oversamplers by operating principles techniques using classifiers. We emphasize that these techniques
have the highest number of parameters, which would necessitate
In this subsection, we try to gain some insight into the perfor- a more careful model selection than in other cases. The proper,
mance related to the various operating principles of oversampling manual tuning of parameters might yield better results, however,
techniques. Although an analysis like this has many pitfalls as the when machine learning needs to be done at scale to process many
effects of operating principles can hardly be separated from each datasets in an automated manner, techniques requiring manual
other, we make an attempt to carry out a meaningful comparison. tuning might be unfavorable.
We decided to use the median scores within the categories as a At this point, we revisit the anomalies in the results of the P20
measure to compare operating principles: they are not affected measure: the extremely high performance of classifier based tech-
by extremely high and low performances and the median score of niques and the low performance of borderline techniques needs
many techniques implementing some principle can be expected explanation. Unlike the other measures, P20 does not take into
to characterize the principle to some extent by smoothing other consideration the classification performance on majority sam-
effects. The results are rendered in Table 6. ples. One can readily see, that the oversamplers using classifiers
to remove samples with low positive class probabilities (like
One can observe that the rankings by the AUC, G and F1
SMOTE-IPF and Lee) optimize exactly the P20 measure: they
scores are highly similar, while P20 is slightly different from
keep minority samples belonging to the minority class with high
them: borderline techniques rank only as 9th but the ones using
probabilities. Due to the relatively large number of techniques
classifiers top the ranking. Disregarding these anomalies with
like these having the same high P20 scores, the performance
P20, there seems to be an agreement on the performance of
of memetic optimization based methods (also using classifiers) is
various operating principles.
hidden by the median score of the category, and consequently,
The topping of borderline and density based techniques means
the classifier based techniques top the P20 ranking. On the other
that it does worth sampling near the decision boundary, in
hand, the low P20 performance of borderline techniques can be
general. Sampling near the class boundary naturally improves
explained by the fact that generating too many samples near
instance-based classification (like kNN and SVM), aids the proper the decision boundary might turn truly majority samples into
partitioning of the space when DTs are used, and just like all minority ones with high probability, hence, reducing the P20
oversamplers, it helps MLP as more minority samples are rep- scores in the category. Finally, we highlight that the P20 score
resented in the backpropagation gradient. From the overall top might be a good choice to optimize the parameters of one specific
performers ProWSyn, SMOBD, and Assembled-SMOTE implement oversampler, but care must be taken when conclusions are drawn
density based oversampling. from P20 scores as the results depend on the operating principles.
Ordinary sampling performs better than component-wise sam-
pling, and the principle of sampling by cloning is among the 3.5. Runtimes
worst performers. As discussed before, component-wise sampling
makes stronger, while cloning makes milder assumptions on the Runtime is another important feature of oversampling tech-
distribution of the data than ordinary sampling. The consequence niques as some applications might require rapid retraining, thus,
to draw is that ordinary sampling makes the right compromise rapid oversampling. Given the large number of algorithms and
between introducing variance and staying close to the original parameters affecting time complexities, deriving and reporting
distribution. time complexities in big-O notation is beyond the scope of this
One interesting result is that the category application is close study. However, we expect that the average runtimes of over-
to the top performers, indicating that oversamplers developed samplers (rendered in Table 7) can still provide meaningful in-
for specific applications are worth to be considered for other sight into their time efficiency, nevertheless, we emphasize that
applications. runtimes always depend on the implementations.
10 G. Kovács / Applied Soft Computing Journal 83 (2019) 105662

Table 5
Comparison of the performance of oversampling techniques on various types of datasets.
Sampler Overall rank IR > 9 IR ≤ 9 N+ > 30 N+ ≤ 30 d > 10 d ≤ 10
1 polynom-fit-SMOTE [44] 1 1 1 3 1 1 1
2 ProWSyn [92] 2 2 2 1 3 2 2
3 SMOTE-IPF [59] 3 3 4 2 6 5 3
4 Lee [67] 4 4 4 4 4 4 4
5 SMOBD [62] 5 5 8 5 7 6 8
6 G-SMOTE [43] 6 8 9 13 8 7 10
7 CCR [99] 7 6 21 14 9 20 7
8 LVQ-SMOTE [98] 8 7 16 21 2 19 5
9 Assembled-SMOTE [102] 9 9 10 16 5 9 6
10 SMOTE-TomekLinks [32] 10 12 14 10 15 12 13
11 SMOTE [12] 11 12 26 8 18 12 17
12 Random-SMOTE [74] 12 14 26 8 19 15 16
13 CE-SMOTE [58] 14 10 36 26 10 8 27
14 SMOTE-Cosine [31] 15 18 14 6 21 28 9
15 Selected-SMOTE [31] 19 21 12 8 23 10 20
16 Supervised-SMOTE [49] 22 24 6 12 28 18 21
17 CBSO [70] 26 31 28 22 33 3 34
18 cluster-SMOTE [5] 26 29 7 20 31 29 24
19 NEATER [38] 41 42 3 48 40 46 20

Table 6
Comparison of operating principles.
Rank Attribute AUC Attribute G Attribute F1 Attribute P20
1 Borderline .8992 Ordinary sampling .8626 Ordinary sampling .6799 Uses classifier .9903
2 Ordinary sampling .8985 borderline .8617 Application .6765 Density based .9901
3 Density based .8980 Density based .8608 Uses clustering .6762 Ordinary sampling .9899
4 Application .8954 Application .8578 Density based .6759 Componentwise sampling .9896
5 Density estimation .8947 Componentwise sampling .8443 borderline .6757 Application .9895
6 Componentwise sampling .8944 Dimension reduction .8442 Uses classifier .6701 Density estimation .9895
7 Dimension reduction .8926 Uses clustering .8415 Density estimation .6694 Dimension reduction .9874
8 Uses clustering .8913 Uses classifier .8322 Componentwise sampling .6671 Uses clustering .9845
9 Uses classifier .8898 Density estimation .8258 Dimension reduction .6562 Borderline .9652
10 Sampling by cloning .8793 Memetic .8206 Sampling by cloning .6496 Sampling by cloning .6671
11 Memetic .8605 Sampling by cloning .8101 Memetic .6435 Memetic .6286

Table 7
Average runtime of oversampling techniques in seconds.
Oversampler Time Oversampler Time Oversampler Time Oversampler Time
SPY 0.11 ANS 0.64 SMOTE_Cosine 2.01 PDFOS 15.14
OUPS 0.17 MSMOTE 0.73 kmeans_SMOTE 2.44 KernelADASYN 17.87
SMOTE_D 0.21 Safe_Level_SMOTE 0.79 MWMOTE 2.45 G_SMOTE 19.24
NT_SMOTE 0.21 SMOBD 0.80 SMOTE_FRST_2T 2.59 SVM_balance 22.97
Gazzah 0.21 CBSO 0.82 SMOTE_ENN 2.76 E_SMOTE 26.13
ROSE 0.26 Assembled_SMOTE 0.82 A_SUWO 2.81 SUNDO 26.21
Borderline_SMOTE1 0.28 SDSMOTE 0.88 RWO_sampling 2.92 GASMOTE 31.39
Borderline_SMOTE2 0.30 Edge_Det_SMOTE 0.94 SMOTE_IPF 3.82 DEAGO 38.52
ISMOTE 0.30 SMOTE_TomekLinks 0.99 ADOMS 3.89 SMOTE_PSO 45.12
SMMO 0.32 ProWSyn 1.00 Lee 4.16 NEATER 75.59
SMOTE_OUT 0.38 Stefanowski 1.04 cluster_SMOTE 4.19 IPADE_ID 90.01
SN_SMOTE 0.44 AND_SMOTE 1.14 SOMO 4.31 DSMOTE 146.73
NDO_sampling 0.45 DBSMOTE 1.18 DE_oversampling 4.68 MOT2LD 149.42
SMOTE 0.46 polynom_fit_SMOTE 1.18 CCR 4.72 Supervised_SMOTE 195.75
Selected_SMOTE 0.47 ASMOBD 1.19 SMOTE_RSB 5.12 SSO 215.28
distance_SMOTE 0.47 MDO 1.19 V_SYNTH 5.23 SMOTE_PSOBAT 331.32
Gaussian_SMOTE 0.49 SOI_CJ 1.24 NRSBoundary_SMOTE 5.27 DSRBF 383.36
MCT 0.52 LN_SMOTE 1.27 AHC 5.28 ADG 493.64
Random_SMOTE 0.58 VIS_RST 1.35 LVQ_SMOTE 7.00 AMSCO 659.02
ADASYN 0.59 TRIM_SMOTE 1.37 ISOMAP_Hybrid 7.01
SL_graph_SMOTE 0.59 NRAS 1.56 CE_SMOTE 7.45
CURE_SMOTE 0.59 LLE_SMOTE 1.63 MSYN 13.96

According to the expectations, memetic oversamplers have the the fastest technique SPY. The consequence we can draw is that
longest runtimes as they carry out an expensive optimization
the best performing techniques are definitely not the compu-
driven by cross-validation. All the top performers discussed in the
previous subsections have moderate runtimes. From the overall tationally most intensive ones, however, if extremely fast tech-
top 10 performers (Table 4), the fastest technique is SMOBD,
niques are needed, oversamplers up to 5–10 times faster than the
being 50% quicker than the overall top performer polynom-fit-
SMOTE, however, SMOBD is still 8 times slower in average than top performers are available for the price of a decrease in quality.
G. Kovács / Applied Soft Computing Journal 83 (2019) 105662 11

4. Discussion and conclusions An interesting observation is that none of the recently pub-
lished techniques have been compared to any of the top per-
The primary goal of this study is to carry out a detailed formers in the corresponding papers. The comparison of new
comparison of 85 oversampling techniques involving 104 imbal- techniques to a conventional subset of historical ones draws back
anced datasets. Further goals are gaining insight into the per- the development in the field as the conventional techniques get
formance of oversamplers on various types of datasets, and the more and more acknowledged while well-performing but less
performance of various operating principles. The key findings and popular recent algorithms get forgotten.
contributions of the paper can be summarized as follows: The comparison of 85 algorithms is a challenging task, the
results suffer from several limitations: (a) reproducing algorithms
1. We have determined the oversamplers providing the best
performance respecting 4 commonly used classification from research papers is prone to errors; (b) finding a well-
techniques and 4 performance measures of imbalanced performing set of parameters for each database is challenging —
learning. The results summarized in Table 3 can be used improved model selection might slightly change the results, espe-
as a reference to select the most suitable oversamplers for cially for oversamplers with many hyperparameters; (c) although
existing machine learning pipelines. we have involved 104 datasets in the experiments, application
specific datasets might have highly different characteristics, thus,
2. Comparing the results of oversamplers to the baseline no
proper oversampler and model selection is still recommended.
oversampling in Table 3 justifies and confirms that over-
Nevertheless, we believe that all the conclusions we drew are
sampling is a reasonable technique to improve the classifi-
valid in the scope of the paper and the results might help both
cation performance on imbalanced datasets.
theorists and practitioners to develop the field.
3. We have identified the oversampling techniques giving For reproducibility, the foldings of datasets, the implemen-
the best performance irrespectively of classifier types and tation of oversamplers and all the evaluation scripts have been
performance measures. The results in Table 4 suggest that shared at http://github.com/gykovacs/smote_variants. Addition-
polynom-fit-SMOTE, ProWSyn and SMOTE-IPF generate ally, all numerical results have been shared in the same repository
high-quality samples, these techniques are recommended to enable further analysis by the community.
to be applied to data with unknown or continuously chang- Future steps include the development of general purpose over-
ing characteristics. sampling techniques based on the consequences drawn in this
4. Comparing the performance of oversamplers on various study, and the evaluation of classifier ensembles where classi-
types of datasets (see Table 5) shows that some relatively fiers are trained on oversampled datasets obtained by different
simple techniques provide highly reliable performance re- oversampling techniques.
gardless of the characteristic properties of the datasets. The
performance of math-intensive techniques is likely to fail Declaration of competing interest
in edge cases, for example, at extreme imbalance rates.
5. Comparing the performances related to various operating No author associated with this paper has disclosed any po-
principles (see Table 6) suggests that the most successful tential or pertinent conflicts which may be perceived to have
principles are ordinary sampling (sampling line segments impending conflict with this work. For full disclosure statements
between neighboring instances), and borderline/density refer to https://doi.org/10.1016/j.asoc.2019.105662.
based sampling (generating more instances near the class
boundaries). We note that the common sampling strat- References
egy of the overall top performers (sampling along line
segments between relatively far minority samples) does [1] H. He, E.A. Gracia, Learning from imbalanced data, IEEE Trans. Knowl.
Discov. 21 (9) (2009) 1263–1284.
not fall any of the introduced categories. These findings
[2] H. Yu, J. Ni, J. Zhao, ACOSampling: An ant colony optimization based
suggest a promising research direction: efficient general undersampling method for classifying imbalanced DNA microarray data,
purpose oversampling techniques should sample the line Neurocomputing 101 (2013) 309–318.
segments connecting relatively far instances and generate [3] M. Al-Khaldy, C. Kambhampati, Resampling imbalanced class and the
more instances near the borderline. effectiveness of feature selection methods for heart failure dataset, Int.
Robotics Autom. J. 4 (2) (2018) 1–10.
6. Although the runtimes shared in Table 7 are related to [4] Y. Wang, S. Sun, J. Zhong, An ensemble anomaly detection with imbal-
our implementation, they still provide an insight into the anced data based on robot vision, Int. J. Robot. Autom. 31 (2) (2016)
time-efficiency of various oversamplers and enable making 1–7.
compromises between performance and speed for time- [5] D.A. Cieslak, N.V. Chawla, A. Striegel, Combating imbalance in network
intrusion datasets, in: 2006 IEEE International Conference on Granular
efficient applications. Computing, 2006, pp. 732–737.
7. The improvements achieved by advanced oversamplers [6] X.J. Zhang, Z. Tari, M. Cheriet, KRNN: k rare-class nearest neighbor
compared to SMOTE are less significant than the improve- classification, Pattern Recognit. 62 (2) (2017) 33–44.
[7] Z. Qi, Y. Tian, Y. Shi, X. Yu, Cost-sensitive support vector machine for
ments achieved by any reliable oversampler compared to
semi-supervised learning, Procedia Comput. Sci. 18 (2013) 1684–1689.
neglecting oversampling. The performances of advanced [8] S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction
oversampling techniques seem to level when averaging algorithms, ACM Comput. Surv. 45 (2) (2013) 16:1–16:35.
over 104 datasets. We can draw the consequence that no [9] M. Kukar, I. Kononenko, Cost-sensitive learning with neural networks, in:
oversampler can provide really outstanding performance Proceedings of the 13th European Conference on Artificial Intelligence,
ECAI-98, John Wiley and Sons, 1998, pp. 445–449.
on each dataset, thus, oversamplers need to be designed
[10] Y. Li, X. Zhang, Improving k nearest neighbor with examplar gen-
to operate on datasets with certain intrinsic characteris- eralization for imbalanced classification, in: PAKDD 2011, 2011, pp.
tics and tests need to be provided to check if a dataset 1–12.
is suitable to be oversampled by a particular technique. [11] Z. László, L. Török, G. Kovács, Improving the performance of the k rare
For example, for cluster-based oversamplers it would be class nearest neighbor classifier by the ranking of point patterns, in:
Proc. of Foundations of Information and Knowledge Systems, 2018, pp.
reasonable to select a lower limit on the information the-
265–283.
oretical gains of clustering which ensures that clusters are [12] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic
really present in the data and cluster-based oversampling minority over-sampling technique, J. Artificial Intelligence Res. 16 (2002)
is applicable. 321–357.
12 G. Kovács / Applied Soft Computing Journal 83 (2019) 105662

[13] N.W. Chawla, Data mining for imbalanced datasets: an overview, in: Data [40] K. Li, W. Zhang, Q. Lu, X. Fang, An improved SMOTE imbalanced data
Mining and Knowledge Discovery Handbook, Springer, 2010, pp. 875–886. classification method based on support degree, in: 2014 International
[14] T. Raeder, G. Forman, N.V. Chawla, Learning from imbalanced data: Eval- Conference on Identification, Information and Knowledge in the Internet
uation matters, in: Data Mining: Foundations and Intelligent Paradigms: of Things, 2014, pp. 34–38.
Volume 1: Clustering, Association and Classification, Springer Berlin [41] S. Mahmoudi, P. Moradi, F. Akhlaghian, R. Moradi, Diversity and separable
Heidelberg, 2012, pp. 315–331. metrics in over-sampling technique for imbalanced data classification, in:
[15] V. Lopez, A. Fernandez, S. Garcia, V. Palade, F. Herrera, An insight into 4th International Conference on Computer and Knowledge Engineering,
classification with imbalanced data: Empirical results and current trends 2014, pp. 152–158.
on using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141. [42] J. de la Calleja, O. Fuentes, J. González, Selecting minority examples from
[16] V. Lopez, A. Fernandez, F. Herrera, On the importance of the valida- misclassified data for over-sampling, in: Proc. of the 21st Int. Florida
tion technique for classification with imbalanced datasets: Addressing Artificial Intelligence Research Society Conference, 2008, pp. 276–281.
covariate shift when data is skewed, Inf. Sci. 257 (2014) 1–13. [43] T. Sandhan, J.Y. Choi, Handling imbalanced datasets by partially
[17] T.R. Hoens, N.V. Chawla, Imbalanced datasets: From sampling to classi- guided hybrid sampling for pattern recognition, in: 22nd International
fiers, in: Imbalanced Learning, John Wiley & Sons, Ltd, 2013, pp. 43–59, Conference on Pattern Recognition, 2014, pp. 1449–1453.
chapter 3. [44] S. Gazzah, N.E.B. Amara, New oversampling approaches based on poly-
[18] A. Fernandez, S. Garcia, F. Herrera, N.V. Chawla, SMOTE for learning nomial fitting for imbalanced data sets, in: 2008 the Eighth IAPR
from imbalanced data: Progress and challenges, marking the 15-year International Workshop on Document Analysis Systems, 2008, pp.
anniversary, J. Artificial Intelligence Res. 61 (2018) 863–905. 677–684.
[19] D.A. van Dyk, X.-L. Meng, The art of data augmentation, J. Comput. Graph.
[45] Y.H. Xu, H. Li, L.P. Le, X.Y. Tian, Neighborhood triangular synthetic
Statist. 10 (1) (2001) 1–50.
minority over-sampling technique for imbalanced prediction on small
[20] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: A new over-sampling
samples of Chinese tourism and hospitality firms, in: 7th Int. Joint Conf.
method in imbalanced data sets learning, in: Advances in Intelligent
on Computational Sciences and Optimization, 2014, pp. 534–538.
Computing, Springer Berlin Heidelberg, 2005, pp. 878–887.
[46] J. Stefanowski, S. Wilk, Selective pre-processing of imbalanced data
[21] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN: adaptive synthetic sam-
for improving classification performance, in: Proceedings of the 10th
pling approach for imbalanced learning, in: Proc. of IJCNN, 2008, pp.
International Conference on Data Warehousing and Knowledge Discovery,
1322–1328.
DaWaK ’08, Springer-Verlag, 2008, pp. 283–292.
[22] A. Gosain, S. Sardana, Handling class imbalance problem using over-
sampling techniques: A review, in: 2017 International Conference on [47] T. Rong, H. Gong, W.W.Y. Ng, Stochastic sensitivity oversampling tech-
Advances in Computing, Communications and Informatics, ICACCI, 2017, nique for imbalanced data, in: Machine Learning and Cybernetics,
pp. 79–85. Springer Berlin Heidelberg, 2014, pp. 161–171.
[23] G. Lemaitre, F. Nogueira, C.K. Aridas, Imbalanced-learn: A Python toolbox [48] S. Tang, S. Chen, The generation mechanism of synthetic minority class
to tackle the curse of imbalanced datasets in machine learning, J. Mach. examples, in: 2008 International Conference on Information Technology
Learn. Res. 18 (1) (2017) 1–5. and Applications in Biomedicine, 2008, pp. 444–447.
[24] Ş. Ertekin, Adaptive oversampling for imbalanced data classification, in: [49] J. Hu, X. He, D.-J. Yu, X.-B. Yang, J.-Y. Yang, H.-B. Shen, A new supervised
Information Sciences and Systems 2013, Springer International Publishing, over-sampling algorithm with application to protein-nucleotide binding
2013, pp. 261–269. residue prediction, PLoS One 9 (9) (2014) 1–10.
[25] P. Cao, X. Liu, J. Zhang, D. Zhao, M. Huang, O. Zaiane, 2,1 norm regularized [50] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-Level-SMOTE:
multi-kernel based joint nonlinear feature selection and over-sampling Safe-Level-synthetic minority over-sampling technique for handling the
for imbalanced data classification, Neurocomputing 234 (2017) 38–57. class imbalanced problem, in: Proc. of the 13th Pacific-Asia Conference on
[26] M. Zieba, J.M. Tomczak, A. Gonczarek, RBM-SMOTE: Restricted Boltzmann Advances in Knowledge Discovery and Data Mining, 2009, pp. 475–482.
machines for synthetic minority oversampling technique, in: Intelligent [51] C. Bellinger, N. Japkowicz, C. Drummond, Synthetic oversampling for
Information and Database Systems, Springer International Publishing, advanced radioactive threat detection, in: IEEE 14th International
2015, pp. 377–386. Conference on Machine Learning and Applications, 2015, pp. 948–953.
[27] B. Das, N.C. Krishnan, D.J. Cook, RACOG and wRACOG: Two probabilistic [52] S. Hu, Y. Liang, L. Ma, Y. He, MSMOTE: Improving classification per-
oversampling techniques, IEEE Trans. Knowl. Data Eng. 27 (1) (2015) formance when training data is imbalanced, in: Proc. of the 2nd
222–234. International Workshop on Computer Science and Engineering, vol. 2,
[28] G. Douzas, F. Bacao, Effective data generation for imbalanced learning 2009, pp. 13–17.
using conditional generative adversarial networks, Expert Syst. Appl. 91 [53] S. Gazzah, A. Hechkel, N.E.B. Amara, A hybrid sampling method for im-
(C) (2018) 464–471. balanced data, in: IEEE 12th International Multi-Conference on Systems,
[29] H. Zhang, Z. Wang, A normal distribution-based over-sampling ap- Signals Devices, 2015, pp. 1–6.
proach to imbalanced data classification, in: Advanced Data Mining and [54] Q. Gu, Z. Cai, L. Zhu, Classification of imbalanced data sets by using the
Applications, Springer Berlin Heidelberg, 2011, pp. 83–96. hybrid re-sampling algorithm based on Isomap, in: Proceedings of the 4th
[30] H. Zhang, M. Li, RWO-Sampling: A random walk over-sampling approach International Symposium on Advances in Computation and Intelligence,
to imbalanced data classification, Inf. Fusion 20 (2014) 99–116. ISICA ’09, Springer-Verlag, 2009, pp. 287–296.
[31] F. Koto, SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhance- [55] L. Jiang, C. Qiu, C. Li, A novel minority cloning technique for cost-sensitive
ment strategy to handle imbalance in data level, in: 2014 Int. Conf. on learning, Int. J. Pattern Recognit. Artif. Intell. 29 (2015) 1551004.
Advanced Computer Science and Information System, 2014, pp. 280–284.
[56] L. Chen, Z. Cai, L. Chen, Q. Gu, A novel differential evolution-clustering
[32] G.E.A.P.A. Batista, R.C. Prati, M.C. Monard, A study of the behavior of
hybrid resampling algorithm on imbalanced datasets, in: 3rd International
several methods for balancing machine learning training data, SIGKDD
Conference on Knowledge Discovery and Data Mining, 2010, pp. 81–85.
Explor. Newsl. 6 (1) (2004) 20–29.
[57] A. Pourhabib, B.K. Mallick, Y. Ding, Absent data generating classifier for
[33] S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted
imbalanced class sizes, J. Mach. Learn. Res. 16 (2015) 2695–2724.
minority oversampling technique for imbalanced data set learning, IEEE
[58] S. Chen, G. Guo, L. Chen, A new over-sampling method based on cluster
Trans. Knowl. Data Eng. 26 (2) (2014) 405–425.
ensembles, in: 2010 IEEE 24th International Conference on Advanced
[34] M. Gao, X. Hong, S. Chen, C.J. Harris, E. Khalaf, PDFOS: PDF es-
Information Networking and Applications Workshops, 2010, pp. 599–604.
timation based over-sampling for imbalanced two-class problems,
Neurocomputing 138 (2014) 248–259. [59] J.A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE–IPF: Addressing the
[35] G. Cohen, M. Hilario, H. Sax, S. Hugonnet, A. Geissbuhler, Learning from noisy and borderline examples problem in imbalanced classification by a
imbalanced data in surveillance of nosocomial infection, Artif. Intell. Med. re-sampling method with filtering, Inf. Sci. 291 (2015) 184–203.
37 (1) (2006) 7–18. [60] Y. Kang, S. Won, Weight decision algorithm for oversampling technique
[36] V. Lopez, I. Triguero, C.J. Carmona, S. Garcia, F. Herrera, Addressing on class-imbalanced learning, in: ICCAS 2010, 2010, pp. 182–186.
imbalanced classification with instance generation techniques: IPADE-ID, [61] B. Tang, H. He, KernelADASYN: Kernel based adaptive synthetic data
Neurocomputing 126 (2014) 15–28. generation for imbalanced learning, in: IEEE Congress on Evolutionary
[37] J. Wang, M. Xu, H. Wang, J. Zhang, Classification of imbalanced data by Computation, 2015, pp. 664–671.
using the smote algorithm and locally linear embedding, in: 2006 8th [62] Q. Cao, S. Wang, Applying over-sampling technique based on data
International Conference on Signal Processing, vol. 3, 2006, pp. 1–9. density and cost-sensitive SVM to imbalanced learning, in: International
[38] B.A. Almogahed, I.A. Kakadiaris, NEATER: Filtering of over-sampled data Conference on Information Management, Innovation Management and
using non-cooperative game theory, in: 22nd International Conference on Industrial Engineering, vol. 2, 2011, pp. 543–548.
Pattern Recognition, 2014, pp. 1371–1376. [63] Z. Xie, L. Jiang, T. Ye, X. Li, A synthetic minority oversampling method
[39] J. de la Calleja, O. Fuentes, A distance-based over-sampling method for based on local densities in low-dimensional space for imbalanced
learning from imbalanced data sets, in: Proceedings of the Twentieth learning, in: Database Systems for Advanced Applications, Springer
International Florida Artificial Intelligence, vol. 3, 2007, pp. 634–635. International Publishing, 2015, pp. 3–18.
G. Kovács / Applied Soft Computing Journal 83 (2019) 105662 13

[64] S. Cateni, V. Colla, M. Vannucci, Novel resampling method for the [86] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, DBSMOTE: Density-
classification of imbalanced datasets for industrial and other real-world based synthetic minority over-sampling technique, Appl. Intell. 36 (3)
problems, in: 2011 11th International Conference on Intelligent Systems (2012) 664–684.
Design and Applications, 2011, pp. 402–407. [87] J. Yun, J. Ha, J.-S. Lee, Automatic determination of neighborhood size
[65] W.A. Young, S.L. Nykl, G.R. Weckman, D.M. Chelberg, Using Voronoi dia- in SMOTE, in: Proc. of the 10th International Conference on Ubiquitous
grams to improve classification performances when modeling imbalanced Information Management and Communication, 2016, pp. 100:1–100:8.
datasets, Neural Comput. Appl. 26 (5) (2015) 1041–1054. [88] S. Wang, Z. Li, W. Chao, Q. Cao, Applying adaptive over-sampling
[66] X. Fan, K. Tang, T. Weise, Margin-based over-sampling method for technique based on data density and cost-sensitive SVM to imbalanced
learning from imbalanced datasets, in: Advances in Knowledge Discovery learning, in: Int. Joint Conf. on Neural Networks, 2012, pp. 1–8.
and Data Mining, Springer Berlin Heidelberg, 2011, pp. 309–320. [89] J. Cervantes, F. Garcia-Lamont, L. Rodriguez, A. López, J.R. Castilla, A.
[67] J. Lee, N. Kim, J.-H. Lee, An over-sampling technique with rejection Trueba, PSO-based method for SVM classification on skewed data sets,
for imbalanced class learning, in: Proceedings of the 9th International Neurocomputing 228 (2017) 187–197.
Conference on Ubiquitous Information Management and Communication, [90] V. García, J.S. Sánchez, R. Martín-Félez, R.A. Mollineda, Surrounding
IMCOM ’15, ACM, New York, NY, USA, 2015, pp. 102:1–102:6. neighborhood-based SMOTE for learning from imbalanced data sets, Prog.
[68] T. Maciejewski, J. Stefanowski, Local neighbourhood extension of SMOTE Artif. Intell. 1 (4) (2012) 347–362.
for mining imbalanced data, in: 2011 IEEE Symposium on Computational [91] L. Ma, S. Fan, CURE-SMOTE algorithm and hybrid algorithm for feature
Intelligence and Data Mining, CIDM, 2011, pp. 104–111. selection and parameter optimization based on random forests, BMC
[69] X.T. Dang, D.H. Tran, O. Hirose, K. Satou, SPY: A novel resampling Bioinformatics 18 (169) (2017) 1–18.
method for improving classification performance in imbalanced data, [92] S. Barua, M.M. Islam, K. Murase, ProWSyn: Proximity weighted synthetic
in: 2015 Seventh International Conference on Knowledge and Systems oversampling technique for imbalanced data set learning, in: Advances in
Engineering, KSE, 2015, pp. 280–285. Knowledge Discovery and Data Mining, Springer Berlin Heidelberg, 2013,
[70] S. Barua, M.M. Islam, K. Murase, A novel synthetic minority oversam- pp. 317–328.
pling technique for imbalanced data set learning, in: Neural Information [93] G. Douzas, F. Bacao, Self-organizing map oversampling (SOMO) for
Processing, Springer Berlin Heidelberg, 2011, pp. 735–744. imbalanced data set learning, Expert Syst. Appl. 82 (2017) 40–52.
[71] J. Li, S. Fong, Y. Zhuang, Optimizing SMOTE by metaheuristics with neural [94] C. Bunkhumpornpat, S. Subpaiboonkit, Safe level graph for synthetic
network and decision tree, in: 2015 3rd International Symposium on minority over-sampling techniques, in: 13th International Symposium on
Computational and Business Intelligence, ISCBI, 2015, pp. 26–32. Communications and Information Technologies, 2013, pp. 570–575.
[72] T. Deepa, M. Punithavalli, An E-SMOTE technique for feature selec- [95] W.A. Rivera, Noise reduction a priori synthetic over-sampling for class
tion in high-dimensional imbalanced dataset, in: 2011 3rd International imbalanced data sets, Inf. Sci. 408 (2017) 146–161.
Conference on Electronics Computer Technology, vol. 2, 2011, pp. [96] H. Feng, L. Hang, A novel boundary oversampling algorithm based on
322–324. neighborhood rough set model: NRSBoundary-SMOTE, Math. Probl. Eng.
[73] W.A. Rivera, P. Xanthopoulos, A priori synthetic over-sampling methods (2013) 694809.
for increasing classification sensitivity in imbalanced data sets, Expert [97] H. Lee, J. Kim, S. Kim, Gaussian-based smote algorithm for solving skewed
Syst. Appl. 66 (2016) 124–135. class distributions, Int. J. Fuzzy Log. Intell. Syst. 17 (2017) 229–234.
[74] Y. Dong, X. Wang, A new over-sampling approach: Random-SMOTE for [98] M. Nakamura, Y. Kajiwara, A. Otsuka, H. Kimura, LVQ-SMOTE – learning
learning from imbalanced data sets, in: Knowledge Science, Engineering vector quantization based synthetic minority over–sampling technique
and Management, Springer Berlin Heidelberg, 2011, pp. 343–352. for biomedical data, BioData Min. 6 (16) (2013) 1–10.
[75] F.R. Torres, J.A. Carrasco-Ochoa, J.F. Martínez-Trinidad, SMOTE-D a deter- [99] M. Koziarski, M. Wozniak, CCR: A combined cleaning and resampling
ministic version of smote, in: MCPR2016: Pattern Recognition, 2016, pp. algorithm for imbalanced data classification, Int. J. Appl. Math. Comput.
177–188. Sci. 27 (2017) 727–736.
[76] L. Zhang, W. Wang, A re-sampling method for class imbalance learning [100] I.A. Sanchez, E. Morales, J. Gonzalez, Synthetic oversampling of instances
with credit data, in: International Conference of Information Technol- using clustering, Int. J. Artif. Intell. Tools 22 (2013) 1350008.
ogy, Computer Engineering and Management Sciences, vol. 1, 2011, pp. [101] W. Siriseriwan, K. Sinapiromsaran, Adaptive neighbor synthetic minority
393–397. oversampling technique under 1NN outcast handling, in: Songklanakarin
[77] L. Abdi, S. Hashemi, To combat multi-class imbalanced problems by Journal of Science and Technology, vol. 39, 2017, pp. 565–576.
means of over-sampling techniques, IEEE Trans. Knowl. Data Eng. 28 (1) [102] B. Zhou, C. Yang, H. Guo, J. Hu, A quasi-linear SVM combined with assem-
(2016) 238–251. bled SMOTE for imbalanced data classification, in: The 2013 International
[78] F. Fernández-Navarro, C. Hervás-Martinez, P.A. Gutiérrez, A dynamic over- Joint Conference on Neural Networks, IJCNN, 2013, pp. 1–7.
sampling procedure based on sensitivity for multi-class problems, Pattern [103] J. Li, S. Fong, R.K. Wong, V.W. Chu, Adaptive multi-objective swarm fusion
Recognit. 44 (8) (2011) 1821–1833. for imbalanced data classification, Inf. Fusion 39 (2018) 1–24.
[79] K. Borowska, J. Stepaniuk, Imbalanced data classification: A novel re- [104] H. Li, P. Zou, X. Wang, R. Xia, A new combination sampling method
sampling approach combining versatile improved SMOTE and rough sets, for imbalanced data, in: Proc. of 2013 Chinese Intelligent Automation
in: Computer Information Systems and Industrial Management, Springer Conference, Springer Berlin Heidelberg, 2013, pp. 547–554.
Int. Publishing, 2016, pp. 31–42. [105] G. Douzas, F. Bacao, F. Last, Improving imbalanced learning through a
[80] M.A.H. Farquad, I. Bose, Preprocessing unbalanced data using support heuristic oversampling method based on k-means and SMOTE, Inf. Sci.
vector machine, Decis. Support Syst. 53 (1) (2012) 226–233. 465 (2018) 1–20.
[81] K. Jiang, J. Lu, K. Xia, A novel algorithm for imbalance data classification [106] G. Menardi, N. Torelli, Training and assessing classification rules with
based on genetic algorithm improved SMOTE, Arab. J. Sci. Eng. 41 (8) imbalanced data, Data Min. Knowl. Discov. 28 (1) (2014) 92–122.
(2016) 3255–3266. [107] J. Alcala-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, L. Sanchez, F.
[82] K. Puntumapon, K. Waiyamai, A pruning-based approach for searching Herrera, KEEL data-mining software tool: Data set repository, integration
precise and generalized region for synthetic minority over-sampling, of algorithms and experimental analysis framework, J. Mult.-Valued Logic
in: Advances in Knowledge Discovery and Data Mining, Springer Berlin Soft Comput. 17 (2–3) (2011) 255–287.
Heidelberg, 2012, pp. 371–382. [108] D. Dheeru, E. Karra Taniskidou, UCI Machine Learning Repository, Univer-
[83] I. Nekooeimehr, S.K. Lai-Yuen, Adaptive semi-unsupervised weighted sity of California, Irvine, School of Information and Computer Sciences,
oversampling (A-SUWO) for imbalanced datasets, Expert Syst. Appl. 46 2017, URL http://archive.ics.uci.edu/ml.
(2016) 405–416. [109] X. Zhang, Y. Li, A positive-biased nearest neighbour algorithm for
[84] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, SMOTE-RSB*: a hybrid imbalanced classification, in: Proc. of PAKDD 2013, 2013, pp. 293–304.
preprocessing approach based on oversampling and undersampling for [110] W. Liu, S. Chawla, Class confidence weighted kNN algorithms for
high imbalanced data-sets using SMOTE and rough sets theory, Knowl. imbalanced data sets, in: Proceedings of PAKDD 2011, 2011, pp. 354–356.
Inf. Syst. 33 (2) (2012) 245–265. [111] S. Shalev-Shwartz, S. Ben-David, Understanding Machine Learning: From
[85] E. Ramentol, I. Gondres, S. Lajes, R. Bello, Y. Caballero, C. Cornelis, Theory to Algorithms, Cambridge University Press, 2014, p. 410.
F. Herrera, Fuzzy-rough imbalanced learning for the diagnosis of high [112] H.-P. Kriegel, P. Kröger, A. Zimek, Clustering high-dimensional data: A
voltage circuit breaker maintenance: The SMOTE-FRST-2T algorithm, Eng. survey on subspace clustering, pattern-based clustering, and correlation
Appl. Artif. Intell. 48 (2016) 134–139. clustering, ACM Trans. Knowl. Discov. Data 3 (1) (2009) 1:1–1:58.

You might also like