You are on page 1of 19

Information Sciences 569 (2021) 508–526

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

Impact of resampling methods and classification models on the


imbalanced credit scoring problems
Jin Xiao a, Yadong Wang a, Jing Chen b, Ling Xie c,⇑, Jing Huang d,⇑
a
Business School, Sichuan University, Chengdu, PR China
b
School of Management, Xi’an Jiaotong University, Xi’an, PR China
c
School of Medical Information Engineering, Zunyi Medical University, Zunyi, PR China
d
School of Public Administration, Sichuan University, Chengdu, PR China

a r t i c l e i n f o a b s t r a c t

Article history: For imbalanced credit scoring, the most common solution is to balance the class distribu-
Received 11 March 2020 tion of the training set with a resampling method, and then train a classification model and
Received in revised form 10 May 2021 classify the customer samples in the test set. However, it is still difficult to select the most
Accepted 12 May 2021
appropriate resampling methods and classification models, and the optimal combinations
Available online 15 May 2021
of them have not been identified. Therefore, this study proposes a new benchmark models
comparison framework for imbalanced credit scoring. In the framework, we introduce the
Keywords:
index of balanced accuracy and four other evaluation measures, experimentally compare
Customer credit scoring
Imbalanced class distribution
the performance of 10 benchmark resampling methods and nine benchmark classification
Benchmark resampling methods models respectively on six credit scoring data sets, and analyze the optimal combinations
Benchmark classification models of them. The experimental result shows: (1) as for benchmark resampling methods, ran-
Evaluation measures dom under-sampling (a traditional resampling method) and synthetic minority over-
sampling technique combined with Wilson’s edited nearest neighbor (an intelligent resam-
pling method) present the best performance; (2) as for benchmark classification models,
logistic regression (a single classification model) and adaptive boosting (an ensemble clas-
sification model) present the best performance; (3) as for optimal combinations, random
under-sampling combined with random subspace (an ensemble classification model) can
obtain the most satisfactory credit scoring performance.
Ó 2021 Elsevier Inc. All rights reserved.

1. Introduction

The credit risk is a crucial factor affecting the business performance of enterprises and banks [1,2]. More importantly,
extremely high credit risks may directly lead to the bankruptcy of enterprises or banks [3]. Hence, it is of substantial practical
significance to develop customer credit scoring models that can be used to predict possible credit default accurately and
effectively [2–4]. In essence, customer credit scoring is a binary classification problem. Specifically, according to the degree
of default risk, customers are classified into two types: customers with good credit and customers with bad credit. In recent
years, scholars have used different classification models to deal with this problem, including single classification models and
ensemble classification models. The commonly used single classification models in customer credit scoring include decision
tree (DT) [5], artificial neural network (ANN) [4], naïve Bayes (NB) [6], support vector machine (SVM) [7], logistic regression

⇑ Corresponding authors.
E-mail addresses: xie_ling0101@126.com (L. Xie), totojh@scu.edu.cn (J. Huang).

https://doi.org/10.1016/j.ins.2021.05.029
0020-0255/Ó 2021 Elsevier Inc. All rights reserved.
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

(LR) [8] and so on. With further study, Hansen and Salamon [9] found that several base classifiers could be integrated to
reduce the classification error rate and improve model robustness. Since then, ensemble classification models have received
increasing attention. The commonly used ensemble classification models include Bagging [10], adaptive boosting (AdaBoost)
[11], random forest (RF) [12], random subspace (RSS) [13] and so on. Some studies have applied the ensemble classification
models to customer credit scoring, and achieved better performance than the single classification models [14,15]. For exam-
ple, Yu et al. [16] used the ensemble classification models to improve the customer credit scoring performance of ANN and
SVM respectively.
With the studies continued to develop, people find that there is an imbalanced class distribution on customer data sets for
credit scoring. Specifically, the sample size of customers with bad credit is far smaller than that with good credit, which can
cause the following phenomenon: high overall classification accuracy but low classification accuracy of customers with bad
credit. In practice, the misclassification of a customer with bad credit usually causes heavier economic losses [17]. To address
this problem, the most common solution is using a resampling method to balance the class distribution of the training set
and then training a classification model. Traditional resampling methods include random oversampling (ROS) and random
under-sampling (RUS). For example, Crone and Finlay [18] used ROS and RUS to balance the class distribution of training sets,
and then trained four single classification models (LDA, LR, DT, and ANN) respectively; finally, they made an empirical anal-
ysis on two customer credit scoring data sets and used the Gini coefficient to evaluate the performance of the models; the
experimental results showed that ROS was superior to RUS. Yu et al. [19] used ROS and RUS to balance the class distribution
of training sets respectively, and then built an ensemble classification model with SVM; finally, they made an empirical anal-
ysis on two customer credit scoring data sets, and used the true positive rate (TPR) and true negative rate (TNR) to evaluate
their performance; the experimental results showed that ROS could improve the performance of classification models more
significantly.
However, traditional resampling methods may have certain deficiencies because their sampling modes are extremely
simple. For example, RUS cannot select a sample according to its degree of importance, so the drawn sample is likely to miss
certain important information; ROS may copy the same sample repeatedly and add it to the original data set, so the classifier
may be over-fitted [20]. To improve the performance of traditional resampling methods, scholars have developed a few intel-
ligent resampling methods in recent years including cluster-based over-sampling (CBOS) [21], one-sided selection (OSS)
[22], synthetic minority over-sampling technique (SMOTE) [23], Wilson’s edited nearest neighbor [24] combined with
SMOTE (SMOTE + ENN) [25], adaptive synthetic sampling approach (ADASYN) [26], safe-level synthetic minority over-
sampling technique (SL-SMOTE) [27], majority weighted minority over-sampling technique (MWMOTE) [28] and so on.
For example, Marqués et al. [29] used nine resampling methods, such as SL-SMOTE, SMOTE, ROS, RUS, to balance the class
distribution of training sets respectively, and then trained two single classification models LR and SVM; finally, they made an
empirical analysis on five customer credit scoring data sets, and used the area under the receiver operating characteristic
curve (AUC) to evaluate the performance of models, and it showed that SMOTE was superior to RUS in most cases. García
et al. [30] used eight resampling methods, such as SMOTE + ENN, SMOTE, ROS, RUS, and OSS, to balance the class distribution
of training sets respectively, and then trained four single classification models including ANN and SVM; finally, they made an
empirical analysis on five credit scoring data sets, and used AUC to evaluate the performance of models, and it showed that
SMOTE provided the best performance.
The above works have made important contributions to the studies of customer credit scoring, but they still have the fol-
lowing limitations: (1) the general process of existing customer credit scoring is using the resampling method to balance the
customer credit scoring training set, and then training the classification model on the balanced training set. Therefore, the
choice of resampling method and classification model will affect the performance of customer credit scoring. However, in
customer credit scoring, most studies of comparing benchmark models only compare the performance of resampling meth-
ods or classification models, without considering the impact of both on customer credit scoring performance, which may
lead to biased research conclusions; (2) the existing studies rarely consider the optimal combination problem between
benchmark resampling methods and classification models, namely, which benchmark resampling methods and classification
models should be combined to obtain the best customer credit scoring performance?
In order to solve the above problems, this study proposes a new benchmark models comparison framework, which makes
a comprehensive comparison experiment on six customer credit scoring data sets to analyze the impact of benchmark
resampling methods and classification models on imbalanced customer credit scoring. First, we divide each data set into
the training set and the test set at random. Then, we use 10 benchmark resampling methods (including the unused resam-
pling method, two traditional and seven intelligent resampling methods) to balance the class distribution of each training set
respectively. Further, we train nine benchmark classification models (including the five single and four ensemble classifica-
tion models) on the training set with balanced class distribution respectively and classify the customer samples in the test
set. Meanwhile, five evaluation measures including the index of balanced accuracy (IBA) [31] are introduced to evaluate the
classification results. Finally, according to the experimental results, we use non-parametric statistical analysis methods to
compare the performance of 10 benchmark resampling methods and nine benchmark classification models respectively,
and further analyze the optimal combination problem between them.
In a word, we summarize the contributions of this study as follows.
First, this study proposes a new benchmark models comparison framework for imbalanced customer credit scoring. In
this framework, we firstly compare the customer credit scoring performance of 10 benchmark resampling methods, then
compare the customer credit scoring performance of 9 benchmark classification models, and finally analyze the optimal
509
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

combinations between benchmark resampling methods and classification models. The framework is expected to obtain
more comprehensive experimental results.
Second, IBA is introduced into customer credit scoring as a performance evaluation measure, which can overcome the
shortcoming that traditional evaluation measures cannot measure the impact of each class on the overall classification per-
formance on the data sets with imbalanced class distribution, and combined with four traditional evaluation measures
including AUC, TPR, F-measure, and G-mean to evaluate the performance of benchmark resampling methods and classification
models for credit scoring models.
Finally, the experimental results of this study are discussed in detail and some interesting conclusions are drawn. For
example, in the experiment of comparing benchmark resampling methods, we find that the traditional resampling method
RUS has better customer credit scoring performance than the intelligent resampling method SMOTE; in the experiment of
comparing benchmark classification models, we find that customer credit scoring performance of the single classification
model LR outperforms the ensemble classification model Bagging; in the experiment of analyzing optimal combinations,
we find that the traditional resampling method RUS may have the widest range of application, especially its combination
with the ensemble classification models may obtain better customer credit scoring performance than the intelligent resam-
pling methods. The conclusion of this study can provide a more comprehensive reference for other scholars or practitioners
in the field of financial risk management to choose the appropriate benchmark models when dealing with the problem of
imbalanced customer credit scoring.
The remainder of this study is organized as follows. We introduce some related preliminaries including notations and def-
initions, compared benchmark resampling methods and classification models in Section 2. The experimental analysis frame-
work, data sets, parameter settings, and evaluation measures are described in Section 3. The detailed experimental results
and the in-depth discussions are presented in Section 4. Section 5 presents a summary of the entire study and indicates the
direction of future studies.

2. Preliminaries

2.1. Notations and definitions

To make the presentation of this study more clearly, we give some mathematical notations and definitions used in the
following sections in Table 1.

Table 1
Notations and definitions.

Notations Definitions
S the training set composed of m samples, andS ¼ fðxi ; yi Þg; i ¼ 1; 2;    ; m
xi the sample in the n-dimensional characteristic space
yi the class label associated with xi , andyi 2 f0; 1g
Smin the minority class sample subset, andSmin  S
Smaj the majority class sample subset, andSmaj  S
N maj the number of samples inSmaj
N min the number of samples inSmin
E the newly generated sample subset in the resampling process
Emaj the majority class sample subset in E
Emin the minority class sample subset in E
NNk ðxi Þ the set of k-nearest neighbors with respect toxi
 
d xi ; xj the Euclidean distance between xi andxj
xnew the newly generated sample in the resampling process
d a random number
TP the number of positive samples predicted as positive
TN the number of negative samples predicted as negative
FP the number of negative samples predicted as positive
FN the number of positive samples predicted as negative
TPR the proportion of TP in all positive samples
TNR the proportion of TN in all negative samples
FPR the proportion of FP in all negative samples
FNR the proportion of FN in all positive samples
Precision the proportion of TP in all the samples predicted as positive
Recall it is equal to TPR
F-measure the harmonic mean value of Precision and Recall
G-mean the integrated value of TPR and TNR
AUC the area under the receiver operating characteristic (ROC) curve
IBAc ðG  meanÞ the improved value of G-mean

510
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

2.2. Compared benchmark resampling methods

Resampling methods can change the class distribution of training sets through transforming the imbalanced training sets
into relative balance. To find out which benchmark resampling methods are most appropriate for customer credit scoring,
we make a comprehensive experimental comparison between two traditional resampling methods (i.e., RUS and ROS)
and seven intelligent resampling methods (i.e., OSS [22], CBOS [21], SMOTE [23], SMOTE + ENN [25], ADASYN [26], SL-
SMOTE [27], and MWMOTE [28]). The detailed processes of benchmark resampling methods are described in Appendix A.

2.3. Compared benchmark classification models

To find out which benchmark classification models are most appropriate for customer credit scoring, we make a compre-
hensive experimental comparison between five single classification models (i.e., NB [6], DT [5], ANN [4], SVM [7], and LR [8])
and four ensemble classification models (i.e., Adaboost [11], Bagging [10], RSS [13], and RF [12]). The detailed processes of
four ensemble classification models are described in Appendix B.

3. Experimental design

3.1. Experimental analysis framework

To compare the performance of various benchmark resampling methods and classification models, we propose an exper-
imental analysis framework illustrated in Fig. 1. Specifically, we use the five-fold cross-validation method to achieve more
objective comparison results in the experiment. As the figure shows, the experiment involves the following main tasks. First,
each dataset is divided equally into five subsets at random. During each experiment, one subset is selected as the test set, and
the remaining four subsets are selected as the training set. Second, we use benchmark resampling methods on Matlab 2014a

Fig. 1. The benchmark models comparison framework.

511
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

to add/remove samples in each training set until a balanced class distribution is gained respectively (in the experiment, we
retain the training set that is not processed by any resampling method, which is denoted as ‘‘NONE” and used as the control
group for comparison with other benchmark resampling methods), this decision is motivated by the suggestion in [25].
Third, we import each balanced training set on Weka 3.5.8 to train nine different benchmark classification models respec-
tively, and classify the customer samples in the test set. Further, we introduce five evaluation measures IBAc ðG  meanÞ, TPR,
F-measure, G-mean, and AUC to evaluate the classification results. In Fig. 1, Result n-m represents the evaluation measure
value of the classification results obtained by the combination of the m-th (m = 1,2,. . .,10) benchmark resampling method
and the n-th (n = 1,2,. . .,9) benchmark classification model. Among them, each Result n-m contains five evaluation measure
values. For example, Result 1–1 contains Result 1–1–1 (IBAc ðG  meanÞ), Result 1–1-2 (TPR), Result 1–1-3 (F-measure), Result
1–1-4 (G-mean), and Result 1–1-5 (AUC). The above process is repeated five times to ensure that each subset is used as the
test set one time, and the entire process undergoes a five-fold cross-validation. For each data set, we perform five-fold cross-
validation 10 times for each experiment. The final results under each circumstance are all average values obtained through
five-fold cross-validation 10 times. Therefore, the experiment is performed 27,000 times (10  9  6  5  10) in total, and it
is performed on the Windows 10 64-bit system with an Intel(R) Core(TM) i3 processor. Finally, we use non-parametric sta-
tistical analysis methods to compare the performance of benchmark resampling methods and benchmark classification mod-
els respectively, and find out the optimal combinations between them.

3.2. The data sets

The benchmark models comparison experiment in this study is made on six customer credit scoring data sets. Table 2
presents the six data set profiles. From it we can see that they all belong to imbalanced data sets. Each of the data set is com-
posed of two classes: customers with good credit and customers with bad credit. Sourced from the UCI database (http://
www.ics.uci.edu/~mlearn/MLRepository.html), the Germany data set describes the Germany customer credit scoring prob-
lem and is composed of 1,000 customer samples, each of which contains 20 features, including 7 quantitative features and 13
qualitative features. Also sourced from the UCI database, the Australia data set describes Australia customer credit scoring
problem and is composed of 690 customer samples, each of which contains 14 features, including 6 quantitative features and
8 qualitative features. The UK-Thomas data set describes the United Kingdom customer credit scoring problem [32] and is
composed of 1,225 customer samples, each of which contains 14 features, including 9 quantitative features and 5 qualitative
features. The Give-credit data set is acquired from the ‘‘Give me some credit” contest held in 2011 (www.kaggle.com). In this
study, we remove the customer samples with missing values from the Give-credit data set, and the processed data set is
composed of 120,269 customer samples, each of which contains 10 features, including 5 quantitative features and 5 quali-
tative features. The PAKDD2009 data set (abbreviated as the PAKDD) is acquired from the data mining contest of the Pacific-
Asia Conference on Knowledge Discovery and Data Mining held in 2009 [33]. The PAKDD data set describes the customer
credit scoring problem of a Brazilian chain retail shop. In this study, we only use the data subset from the PAKDD for mod-
eling. After the data subset is subjected to data cleaning, the processed data subset is composed of 49,904 customer samples,
each of which contains 20 features, including 11 quantitative features and 9 qualitative features. The IFCD data set is
acquired from an internet finance company in China. It describes the customer credit scoring problem about resident per-
sonal loans. In this study, we remove noise samples and features that contain many missing values from the IFCD data
set, then the processed data set is composed of 6458 customer samples, each of which contains 589 features, including
424 quantitative features and 165 qualitative features. Further, we select the top 10% features with recursive feature elim-
ination [34] to form the IFCD10 data set.

3.3. Parameter settings

In the experiment, the main parameters of the benchmark resampling methods are set according to the related literature,
while the main parameters of the benchmark classification models are set according to the default parameters in the Weka.
The details can refer to Table 3 (int means rounded down). In addition, We set the number of base classifiers in Adaboost,
Bagging, RSS, and RF to be 40 according to the reference [36,37], and set b in F-measure to be 1 and c in IBAc ðG  meanÞ
to be 0.05 according to the reference [38].

Table 2
Data sets for the experiment.

Data sets Classes Samples Features Ratio (positive : negative)


German 2 1000 20 1:2.3333
Australia 2 690 14 1:1.2476
UK-thomas 2 1225 14 1:2.7926
Give-credit 2 120,269 10 1:4.0628
PAKDD 2 49,904 20 1:13.3914
IFCD10 2 6458 59 1:28.0901

512
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Table 3
Parameter settings of benchmark resampling methods and classification models in the experiment.

Parameters setting
(a) The benchmark resampling methods
RUS None
ROS None
OSS None
SMOTE Number of nearest neighbors k = 5, p = 3 [26]
CBOS Number of clusters K = 4 [21,26]
SMOTE + ENN Number of nearest neighbors k = 5 [26]
ADASYN Number of nearest neighbors k = 5, u ¼ 1 [26]
SL-SMOTE Number of nearest neighbors k = 5 [27]
MWMOTE k1 = 5, k2 = 3, k3 =N min =2 [28]
(b) The benchmark classification models
NB None
DT Classifier = J48, the confidence factor used for pruning = 0.25, the minimum number of instances per leaf = 2
ANN Classifier = multilayer perceptron, hidden layers = int ðnumber of ðattributes þ classesÞ=2Þ, learning rate = 0.3, training time = 500
SVM Classifier = sequential minimal optimization algorithm, complexity parameter C = 1, kernel = ploy kernel, tolerance
parameter = 0.001
LR Ridge value = 1.0E-8
Adaboost Base classifier = REPtree [35], N = 40
Bagging Base classifier = REPtree [35], N = 40
RSS Base classifier = REPtree [35], subspace size = 0.5, N = 40
RF Number of features = int ðlog 2 ðpredictorsÞ þ 1Þ, predictors are the number of features included by the tree node, minimum number of
instances = 1, N = 40

3.4. Evaluation measures

At present, the performance of binary classifiers can be assessed most intuitively based on a confusion matrix [25]. Table 4
describes the confusion matrix for customer credit scoring. Through the confusion matrix, we can obtain diverse evaluation
measures:
TP
TPR ¼ ð1Þ
TP þ FN

TN
TNR ¼ ð2Þ
TN þ FP

FP
FPR ¼ ð3Þ
TN þ FP

FN
FNR ¼ ð4Þ
FN þ TP

TP
Precision ¼ ð5Þ
TP þ FP
There is a certain contradiction between the above evaluation measures, and certain problems may be caused when they
are used comprehensively. Hence, a few new measures are proposed. For example, F-measure, G-mean, and AUC [39].
F-measure integrates two measures including Precision and Recall:

1 þ b2  Recall  Precision
F  measure ¼ ð6Þ
b2  Recall þ Precision
where Recall is equal to TPR, b is the coefficient of Recall relative to Precision. In addition, G-mean integrates two measures
including TPR and TNR, now it is widely used to evaluate the customer credit scoring models:

Table 4
Evaluation matrix for customer credit scoring.

Predicted positive Predicted negative Total


Actual positive (bad credit customer) TP FN TP + FN
Actual negative (good credit customer) FP TN FP + TN
Total TP + FP FN + TN TP + FN + FP + TN

513
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
G  mean ¼ TPR  TNR ð7Þ
The receiver operating characteristic (ROC) curve is a common evaluation measure for classification models under the
condition of imbalanced class distribution. For a binary classification problem such as customer credit scoring, the ROC curve
is an FPR-TPR diagram in which the abscissa indicates FPR and the ordinate indicates the TPR [39]. In many cases, however, it
is not convenient to directly compare the ROC curves of different models; therefore, the area under the ROC curve (namely,
AUC) is used to evaluate the model performance. AUC is widely adopted in many studies [40,41]. In a binary classification
problem, AUC can be approximately denoted as follows [42]:

1 þ TPR  FPR TPR þ TNR


AUC ¼ ¼ ð8Þ
2 2
Evidently, although AUC and G-mean consider the impact of positive class and negative class on imbalanced customer
credit scoring, they cannot distinguish the impact of a certain class in the data set with imbalanced class distribution on
the overall model performance, nor can they identify the dominant class. This implies that the values obtained by combining
different TPR and TNR may be equal [31]. To address this problem, García et al. [31] proposed the new measure
IBAc ðG  meanÞ, which integrates overall precision and class precision:

IBAc ðG  meanÞ ¼ ð1 þ c  DomÞ  ðG  meanÞ ð9Þ

where Dom = TPR  TNR denotes dominance, c0 denotes the weight factor of Dom, and G-mean is an index used to measure
overall precision. The magnitude of c can adjust the impact of Dom on G-mean.
In summary, this study selects five evaluation measures including TPR, F-measure, G-mean, AUC, and IBAc ðG  meanÞ. The
higher the values of evaluation measures, the better the customer credit scoring performance.

4. Experimental results

4.1. Comparison of different benchmark resampling methods

4.1.1. Comparison of different benchmark resampling methods in terms of IBA0:05 ðG  meanÞ


This section displays the comparison process for the benchmark resampling methods as exemplified by the evaluation
measure IBA0:05 ðG  meanÞ. We first use 10 benchmark resampling methods (including NONE without resampling) to process
the training sets of the five data sets, and then train nine benchmark classification models to classify customer samples in the
test sets and calculate the corresponding IBA0:05 ðG  meanÞ values. The experimental results are shown in Table 5. The last
row gives the average value of the IBA0:05 ðG  meanÞ ranking in each column; the lower the average value is, the better the
performance of the associated benchmark resampling method is. To check whether there are significant differences in per-
formance among the 10 benchmark resampling methods, we consider the use of non-parametric statistical tests recom-
mended by Demsar [43], namely, the Friedman test and Iman-Davenport test. If there are statistically significant
differences in performance, then we further use the Nemenyi method as a post hoc test, which is used to compare 10 bench-
mark resampling methods with each other. We set a = 0.05 as the significance level in all cases.
The null hypothesis of Friedman and Iman-Davenport tests is that 10 benchmark resampling methods have equivalent
performance. We employ the v2 -distribution with 9ð¼ 10  1Þ degrees of freedom and the F-distribution with
9  53ð¼ 6x9  1Þ degrees of freedom. The test results are shown in Table 6. If the test values are greater than the corre-
sponding distribution values, the null hypothesis is rejected. According to Table 6, we can conclude that there are statistically
significant differences in performance among the 10 benchmark resampling methods at the 95% confidence level. Further, to
compare the 10 benchmark resampling methods with each other, we introduce the Nemenyi test. When the number of
benchmark resampling methods is 10 and the critical value [43] is 3.164, the corresponding critical difference (CD)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
CD ¼ 3:164 ð10  11Þ=ð6  54Þ  1:84. The test results are shown in Fig. 2. If two benchmark resampling methods are con-
nected by a line segment, there is no statistically significant difference between their customer credit scoring performance.
Fig. 2 shows the following results: (1) in terms of IBA0:05 ðG  meanÞ, there is no statistically significant difference among
benchmark resampling methods RUS, ROS, and SMOTE + ENN at the 95% confidence level, and their performance is the best;
(2) there is no statistically significant difference among ROS, SMOTE + ENN, SMOTE, SL-SMOTE, ADASYN, and CBOS; (3) there
is no statistically significant difference among SMOTE, SL-SMOTE, ADASYN, CBOS, and MWMOTE; (4) there is no statistically
significant difference among SL-SMOTE, ADASYN, CBOS, MWMOTE, and OSS; (5) there is no statistically significant difference
between OSS and NONE.

4.1.2. Comparison of the different benchmark resampling methods in terms of other four measures
Table 7 shows the average rank of the performance of the 10 benchmark resampling methods in terms of TPR, F-measure,
G-mean, and AUC. To check whether there are significant differences in performance among the 10 benchmark resampling
methods, we still consider the use of non-parametric statistical tests similar to Section 4.1.1. Table 8 shows the results of
Friedman and Iman-Davenport tests for the 10 benchmark resampling methods in terms of four other measures. According
to the results in Table 8, there are statistically significant differences in performance among the 10 benchmark resampling
514
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Table 5
Comparison of IBA0:05 ðG  meanÞ for 10 benchmark resampling methods on six data sets.

Classification NONE RUS ROS OSS CBOS SMOTE SMOTE ADASYN SL- MWMOTE
models +ENN SMOTE
German data set
NB 0.6363 0.7027 0.7085 0.6766 (7) 0.6913(3) 0.6745(8) 0.6845(5) 0.6473(9) 0.6857(4) 0.6800(6)
(10) (2) (1)
DT 0.6054(8) 0.6484 0.6105 0.6247(4) 0.6020(9) 0.6634(2) 0.6699(1) 0.6120(6) 0.5822 0.6223(5)
(3) (7) (10)
ANN 0.6116(9) 0.6435 0.6157 0.6275(7) 0.6026 0.6681(1) 0.6661(2) 0.6350(6) 0.6430(5) 0.6483(3)
(4) (8) (10)
SVM 0.6016 0.7050 0.7200 0.6535(9) 0.7019(6) 0.7069(4) 0.7018(7) 0.7089(3) 0.7109(2) 0.6963(8)
(10) (5) (1)
LR 0.6233 0.7118 0.7134 0.6627(9) 0.7053(5) 0.6973(7) 0.6904(8) 0.7057(4) 0.7085(3) 0.7045(6)
(10) (2) (1)
Bagging 0.6010 0.7041 0.6861 0.6507(8) 0.6601(7) 0.7080(1) 0.6947(3) 0.6642(6) 0.6760(5) 0.6461(9)
(10) (2) (4)
Adaboost 0.6217 0.7165 0.7080 0.6686(8) 0.6999(5) 0.7223(1) 0.7047(4) 0.6874(7) 0.6929(6) 0.6443(9)
(10) (2) (3)
RSS 0.4435 0.7133 0.6787 0.5160(9) 0.6532(5) 0.7198(1) 0.6997(3) 0.6114(7) 0.6379(6) 0.5400(8)
(10) (2) (4)
RF 0.6066 0.6873 0.6524 0.6452(5) 0.6400(8) 0.7241(1) 0.6943(2) 0.6432(7) 0.6438(6) 0.6343(9)
(10) (3) (4)
Australia data set
NB 0.7941 0.8204 0.8187 0.8276(3) 0.8090(7) 0.8531(1) 0.8080(8) 0.8358(2) 0.8053(9) 0.8225(4)
(10) (5) (6)
DT 0.8600(5) 0.8644 0.8600 0.8684(1) 0.8578(8) 0.8650(2) 0.8581(7) 0.8615(4) 0.8529(9) 0.8429
(3) (6) (10)
ANN 0.8274(5) 0.8220 0.8230 0.8277(4) 0.8303(3) 0.8227(8) 0.8502(1) 0.8072 0.8248(6) 0.8372(2)
(9) (7) (10)
SVM 0.8653(5) 0.8653 0.8653 0.8653(5) 0.8652(8) 0.8665(2) 0.8672(1) 0.8653(5) 0.8508 0.8557(9)
(5) (5) (10)
LR 0.8567(8) 0.8568 0.8584 0.8492 0.8613(3) 0.8598(4) 0.8553(9) 0.8646(1) 0.8645(2) 0.8589(5)
(7) (6) (10)
Bagging 0.8697(7) 0.8743 0.8753 0.8745(2) 0.8723(5) 0.8730(4) 0.8685(8) 0.8711(6) 0.8570(9) 0.8290
(3) (1) (10)
Adaboost 0.8702(1) 0.8655 0.8694 0.8650(5) 0.8638(7) 0.8541(9) 0.8564(8) 0.8676(3) 0.8439 0.8649(6)
(4) (2) (10)
RSS 0.8441 0.8594 0.8640 0.8638(3) 0.8621(7) 0.8637(4) 0.8637(5) 0.8680(1) 0.8636(6) 0.8582(9)
(10) (8) (2)
RF 0.8653(7) 0.8677 0.8681 0.8717(1) 0.8659(6) 0.8668(4) 0.8629(10) 0.8634(9) 0.8640(8) 0.8668(5)
(3) (2)
UK-thomas data set
NB 0.5463(2) 0.5371 0.5459 0.5685(1) 0.5188 0.5227(9) 0.5435(5) 0.5300(8) 0.5357(7) 0.5449(4)
(6) (3) (10)
DT 0.3171 0.5634 0.5260 0.4262(9) 0.5515(3) 0.5533(2) 0.5435(4) 0.5024(7) 0.5366(5) 0.4290(8)
(10) (1) (6)
ANN 0.4389 0.5651 0.5595 0.4404(9) 0.5542(5) 0.5628(3) 0.5777(1) 0.5530(6) 0.5505(7) 0.5106(8)
(10) (2) (4)
SVM 0.5558 0.5870 0.6028 0.5558 0.5634(7) 0.3296 0.5866(5) 0.5988(3) 0.5996(2) 0.5862(6)
(8.5) (4) (1) (8.5) (10)
LR 0.2907 0.5860 0.6009 0.3583(9) 0.5754(7) 0.5145 0.5937(5) 0.6173(1) 0.6143(2) 0.6039(3)
(10) (6) (4) (10)
Bagging 0.3749 0.5863 0.5564 0.4452(9) 0.5538(5) 0.5937(1) 0.5693(3) 0.5308(7) 0.5427(6) 0.4827(8)
(10) (2) (4)
Adaboost 0.2669 0.5624 0.5740 0.3558(9) 0.5348(7) 0.5383(6) 0.5860(1) 0.5679(3) 0.5639(4) 0.5126(8)
(10) (5) (2)
RSS 0.0620 0.5849 0.5297 0.2411(9) 0.5281(5) 0.5968(1) 0.5660(3) 0.4978(7) 0.5128(6) 0.4092(8)
(10) (2) (4)
RF 0.4198 0.5898 0.5055 0.4889(8) 0.4943(7) 0.5793(2) 0.5581(3) 0.5279(4) 0.5095(5) 0.4852(9)
(10) (1) (6)
Give-credit data set
NB 0.3732(7) 0.3540 0.2052 0.3763(6) 0.1712 0.5105(4) 0.7056(1) 0.5704(2) 0.5124(3) 0.4314(5)
(8) (9) (10)
DT 0.4473 0.7364 0.5691 0.4727(9) 0.5463(5) 0.5288(7) 0.5740(2) 0.5336(6) 0.5495(4) 0.5144(8)
(10) (1) (3)
ANN 0.3371(9) 0.6845 0.6925 0.3247 0.6757(4) 0.6525(6) 0.6829(3) 0.6455(7) 0.6384(8) 0.6721(5)
(2) (1) (10)
SVM 0.0684 0.6892 0.6394 0.0684 0.6252(4) 0.6288(3) 0.6194(5) 0.3651(8) 0.5926(7) 0.6104(6)
(9.5) (1) (2) (9.5)
LR 0.2693 0.7400 0.7107 0.2837(9) 0.6939(6) 0.7046(5) 0.7216(2) 0.6636(8) 0.6853(7) 0.7144(3)

(continued on next page)

515
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Table 5 (continued)

Classification NONE RUS ROS OSS CBOS SMOTE SMOTE ADASYN SL- MWMOTE
models +ENN SMOTE
(10) (1) (4)
Bagging 0.4490 0.7677 0.6136 0.4526(9) 0.6226(2) 0.5632(6) 0.6061(4) 0.5452(7) 0.5774(5) 0.5290(8)
(10) (1) (3)
Adaboost 0.4413 0.7481 0.7591 0.4475(9) 0.7206(3) 0.6559(6) 0.6771(5) 0.6162(8) 0.6997(4) 0.6425(7)
(10) (2) (1)
RSS 0.1646 0.7753 0.5326 0.1665(9) 0.5377(3) 0.5064(5) 0.5568(2) 0.4187(8) 0.4822(6) 0.4223(7)
(10) (1) (4)
RF 0.4635 0.7696 0.5307 0.4702(9) 0.5120(7) 0.5395(5) 0.5786(2) 0.5673(3) 0.5033(8) 0.5506(4)
(10) (1) (6)
PAKDD data set
NB 0.3502 0.5818 0.5785 0.3828(9) 0.5538(5) 0.5488(6) 0.5382(7) 0.5064(8) 0.5615(4) 0.5628(3)
(10) (1) (2)
DT 0.0734 0.5484 0.4857 0.2087(9) 0.4828(5) 0.3957(7) 0.5019(2) 0.4960(3) 0.4646(6) 0.2522(8)
(10) (1) (4)
ANN 0.2674 0.5335 0.5121 0.2827(9) 0.5391(2) 0.5136(6) 0.5471(1) 0.5202(4) 0.5150(5) 0.3371(8)
(10) (3) (7)
SVM 0.5711 0.5852 0.5867 0.5711 0.5587(9) 0.5898(1) 0.5073(10) 0.5785(5) 0.5779(6) 0.5851(4)
(7.5) (3) (2) (7.5)
LR 0.0446 0.5884 0.5933 0.0631(9) 0.5668(7) 0.5937(1) 0.5375(8) 0.5886(4) 0.5919(3) 0.5882(6)
(10) (5) (2)
Bagging 0.0868 0.5758 0.4867 0.1272(9) 0.4792(4) 0.2895(7) 0.5198(2) 0.3300(6) 0.3943(5) 0.1495(8)
(10) (1) (3)
Adaboost 0.0138 0.5903 0.5903 0.0272(9) 0.5668(6) 0.5408(7) 0.5893(3) 0.5720(5) 0.5846(4) 0.2229(8)
(10) (1) (2)
RSS 0.4320(6) 0.5934 0.5595 0.4320(6) 0.5409(4) 0.1685 0.5464(3) 0.2112(9) 0.4040(8) 0.4320(6)
(1) (2) (10)
RF 0.2632 0.5577 0.3851 0.2869(8) 0.3663(4) 0.2954(7) 0.4885(2) 0.3281(6) 0.3363(5) 0.2797(9)
(10) (1) (3)
IFCD10 data set
NB 0.1784 0.1887 0.1791 0.1784(9) 0.1800(3) 0.1799(5) 0.1798(6) 0.1797(7) 0.1799(4) 0.1801(2)
(10) (1) (8)
DT 0.2202(2) 0.2192 0.2089 0.2279(1) 0.2083(7) 0.2145(5) 0.2040(9) 0.1897 0.2061(8) 0.2177(4)
(3) (6) (10)
ANN 0.1720 0.2198 0.1940 0.1810(9) 0.1854(8) 0.1882(7) 0.2021(2) 0.1937(5) 0.1948(3) 0.1920(6)
(10) (1) (4)
SVM 0.1762(7) 0.2287 0.1813 0.2382(1) 0.1767(5) 0.1759(9) 0.1759(8) 0.1763(6) 0.1769(4) 0.1745
(2) (3) (10)
LR 0.2188 0.2393 0.2273 0.2282(6) 0.2289(4) 0.2289(5) 0.2301(3) 0.2245(9) 0.2281(7) 0.2302(2)
(10) (1) (8)
Bagging 0.1766(7) 0.2234 0.2164 0.1838(4) 0.1767(6) 0.1759(8) 0.1445(10) 0.2086(3) 0.1769(5) 0.1745(9)
(1) (2)
Adaboost 0.1691(7) 0.2130 0.2435 0.1208 0.2017(6) 0.2098(5) 0.1321(9) 0.2443(2) 0.3024(1) 0.1641(8)
(4) (3) (10)
RSS 0.1768(6) 0.2231 0.2146 0.1838(4) 0.1767(7) 0.1759(8) 0.1434(10) 0.2178(2) 0.1769(5) 0.1745(9)
(1) (3)
RF 0.1988 0.2396 0.2217 0.2382(7) 0.2599(1) 0.2596(2) 0.2393(6) 0.2338(8) 0.2540(4) 0.2560(3)
(10) (5) (9)
Average rank 8.50 2.83 3.85 6.91 5.65 4.80 4.61 5.54 5.54 6.48

Note: The boldface in each row marks the highest IBA0:05 ðG  meanÞ for different benchmark resampling methods combined with a certain classification
model. The numbers in parentheses are the ranks of IBA0:05 ðG  meanÞ for different benchmark resampling methods.

Table 6
Friedman and Iman-Davenport tests results for 10 benchmark resampling methods in terms of IBA0:05 (G-mean).

Method Test value Distribution value Hypothesis


Friedman 97.778 16.919 Reject
Iman-Davenport 14.004 2.062 Reject

methods at the 95% confidence level. Further, to compare the 10 benchmark resampling methods with each other, we apply
the Nemenyi test. The test results are shown in Fig. 3.
Fig. 3a shows the following results: (1) in terms of TPR, there is no statistically significant difference among benchmark
resampling methods SMOTE, RUS, SMOTE + ENN, and CBOS at the 95% confidence level, and their performance is the best; (2)
516
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Fig. 2. The Nemenyi post-hoc test results for 10 benchmark resampling methods in terms of IBA0:05 (G-mean).

there is no statistically significant difference among RUS, SMOTE + ENN, CBOS, and ADASYN; (3) there is no statistically sig-
nificant difference among SMOTE + ENN, CBOS, ADASYN, ROS, and SL-SMOTE; (4) there is no statistically significant differ-
ence among CBOS, ADASYN, ROS, SL-SMOTE, MWMOTE, and OSS; (5) the performance of NONE is the worst.
Fig. 3b shows the following results: (1) in terms of F-measure, there is no statistically significant difference among bench-
mark resampling methods SMOTE + ENN, RUS, ROS, SMOTE, SL-SMOTE, and ADASYN at the 95% confidence level, and their
performance is the best; (2) there is no statistically significant difference among RUS, ROS, SMOTE, SL-SMOTE, ADASYN, and
CBOS; (3) there is no statistically significant difference among SL-SMOTE, ADASYN, CBOS, MWMOTE, and OSS; (4) there is no
statistically significant difference between OSS and NONE.
Fig. 3c shows the following results: (1) in terms of G-mean, there is no statistically significant difference among bench-
mark resampling methods RUS, ROS, and SMOTE + ENN at the 95% confidence level, and their performance is the best; (2)
there is no statistically significant difference among ROS, SMOTE + ENN, CBOS, and SL-SMOTE; (3) there is no statistically
significant difference among SMOTE + ENN, CBOS, SL-SMOTE, ADASYN, SMOTE, and MWMOTE; (4) there is no statistically
significant difference among CBOS, SL-SMOTE, ADASYN, SMOTE, MWMOTE, and OSS; (5) there is no statistically significant
difference between OSS and NONE.
Fig. 3d shows the following results: (1) in terms of AUC, there is no statistically significant difference among benchmark
resampling methods RUS, OSS, NONE, ROS, SMOTE + ENN, SMOTE, and SL-SMOTE at the 95% confidence level, and their per-
formance is the best; (2) there is no statistically significant difference among OSS, NONE, ROS, SMOTE + ENN, SMOTE, SL-
SMOTE, and MWMOTE; (3) there is no statistically significant difference among NONE, ROS, SMOTE + ENN, SMOTE, SL-
SMOTE, MWMOTE, and ADASYN; (4) there is no statistically significant difference among SMOTE + ENN, SMOTE, SL-
SMOTE, MWMOTE, ADASYN, and CBOS.

4.2. Comparison of the different benchmark classification models

Table 9 shows the average rank of the performance of nine benchmark classification models in terms of IBA0:05 ðG  meanÞ,
TPR, F-measure, G-mean, and AUC. To check whether there are significant differences among the nine benchmark classifica-
tion models, we still consider the use of non-parametric statistical tests similar to Section 4.1. In this section, the null
hypothesis of Friedman and the Iman-Davenport tests is that nine benchmark classification models have equivalent classi-
fication performance.
We utilize the v2 -distribution with 8ð¼ 9  1Þ degrees of freedom and F distribution with 8  59ð¼ 6  10  1Þ degrees of
freedom. The test results are shown in Table 10. According to the results in Table 10, the values of Friedman and Iman-
Davenport tests are all greater than that of the corresponding distribution values in terms of the five measures. Therefore,
the null hypothesis is rejected; namely, there are statistically significant differences among the nine benchmark classifica-
tion models at the 95% confidence level. Further, to compare the nine benchmark classification models with each other, we
apply the Nemenyi test. When the number of benchmark classification models is nine and the critical value is 3.102, the cor-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
responding CD ¼ 3:102 ð9  10Þ=ð6  60Þ  1:55. The test results are shown in Fig. 4.

Table 7
The average rank of 10 benchmark resampling methods in terms of the other four evaluation measures.

Evaluation measures NONE RUS ROS OSS CBOS SMOTE SMOTE+ENN ADASYN SL-SMOTE MWMOTE
TPR 8.98 3.63 5.54 6.63 5.07 3.24 3.96 5.28 5.83 6.41
F-measure 8.52 4.07 4.22 6.74 5.91 4.31 3.72 5.44 5.20 6.52
G-mean 8.26 2.93 3.72 6.85 5.39 5.85 4.44 5.69 5.44 6.26
AUC 4.79 4.41 4.89 4.50 7.11 5.40 5.39 6.37 5.67 6.31

Note: The boldface in each row marks the benchmark resampling methods with the best average rank in terms of the corresponding evaluation measure.

517
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Table 8
Friedman and Iman-Davenport tests results for 10 benchmark resampling methods in terms of the other four evaluation measures.

Method TPR
Test value Distribution value Hypothesis
Friedman 102.322 16.919 Reject
Iman-Davenport 14.874 2.062 Reject
F-measure
Test value Distribution value Hypothesis
Friedman 81.047 16.919 Reject
Iman-Davenport 11.008 2.062 Reject
G-mean
Test value Distribution value Hypothesis
Friedman 94.540 16.919 Reject
Iman-Davenport 13.399 2.062 Reject
AUC
Test value Distribution value Hypothesis
Friedman 22.018 16.919 Reject
Iman-Davenport 2.530 2.062 Reject

Fig. 3. The Nemenyi post-hoc test results for 10 benchmark resampling methods in terms of the other four evaluation measures.

Fig. 4a shows the following results: (1) in terms of IBA0:05 ðG  meanÞ, there is no statistically significant difference in clas-
sification performance among LR, Adaboost, SVM, and RF at the 95% confidence level, and their performance is the best; (2)
there is no statistically significant difference among Adaboost, SVM, RF, Bagging, and ANN; (3) there is no statistically sig-
nificant difference among SVM, RF, Bagging, ANN, and NB; (4) there is no statistically significant difference among RF, Bag-
ging, ANN, NB, DT, and RSS.
Fig. 4b shows the following results: (1) in terms of TPR, there is no statistically significant difference in classification per-
formance among LR, SVM, Adaboost, and NB at the 95% confidence level, and their performance is the best; (2) there is no
518
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Table 9
The average rank of nine benchmark classification models in terms of five evaluation measures.

Evaluation measures NB DT ANN SVM LR Bagging Adaboost RSS RF


IBA0:05 (G-mean) 5.86 6.11 5.58 4.34 3.30 4.91 4.07 6.19 4.64
TPR 4.03 6.31 5.38 3.61 3.58 5.74 3.78 6.88 5.68
F-measure 4.83 6.85 5.89 4.58 4.21 4.65 3.46 5.72 4.83
G-mean 6.19 5.88 6.18 4.38 3.40 4.45 4.50 5.67 4.36
AUC 6.00 7.73 7.11 7.53 3.22 3.71 3.09 2.43 4.18

Note: The boldface in each row marks the benchmark classification models with the best average rank in terms of the corresponding evaluation measure.

Table 10
Friedman and Iman-Davenport tests results for nine benchmark classification models in terms of five evaluation measures.

Method IBA0.05(G-mean)
Test value Distribution value Hypothesis
Friedman 48.341 15.507 Reject
Iman-Davenport 6.825 2.100 Reject
TPR
Test value Distribution value Hypothesis
Friedman 76.783 15.507 Reject
Iman-Davenport 11.929 2.100 Reject
F-measure
Test value Distribution value Hypothesis
Friedman 48.594 15.507 Reject
Iman-Davenport 6.866 2.100 Reject
G-mean
Test value Distribution value Hypothesis
Friedman 47.672 15.507 Reject
Iman-Davenport 6.716 2.100 Reject
AUC
Test value Distribution value Hypothesis
Friedman 210.202 15.507 Reject
Iman-Davenport 61.742 2.100 Reject

statistically significant difference between NB and ANN; (3) there is no statistically significant difference among ANN, RF,
Bagging, DT, and RSS.
Fig. 4c shows the following results: (1) in terms of F-measure, there is no statistically significant difference in classification
performance among Adaboost, LR, SVM, Bagging, NB, and RF at the 95% confidence level; (2) there is no statistically signif-
icant difference among LR, SVM, Bagging, NB, RF, and RSS; (3) there is no statistically significant difference among SVM, Bag-
ging, NB, RF, RSS, and ANN; (4) there is no statistically significant difference among RSS, ANN, and DT.
Fig. 4d shows the following results: (1) in terms of G-mean, there is no statistically significant difference in classification
performance among LR, RF, SVM, Bagging, and Adaboost at the 95% confidence level, and their performance is the best; (2)
there is no statistically significant difference among RF, SVM, Bagging, Adaboost, RSS, and DT; (3) there is no statistically sig-
nificant difference among Adaboost, RSS, DT, ANN, and NB.
Fig. 4e shows the following results: (1) in terms of AUC, there is no statistically significant difference in classification per-
formance among RSS, Adaboost, LR, and Bagging at the 95% confidence level, and their performance is the best; (2) there is no
statistically significant difference among Adaboost, LR, Bagging, and RF; (3) there is no statistically significant difference
between NB, ANN, and SVM; (4) there is no statistically significant difference between ANN, SVM, and DT.

4.3. Optimal combinations between benchmark resampling methods and classification models

To further analyze the performance of the combinations between the benchmark resampling methods and the classifica-
tion models, we calculate the mean value of five evaluation measures (MEmn ) of 10 benchmark resampling methods com-
bined different benchmark classification models on the six data sets:

1 X6 X
5
MEmn ¼ Eij ðRem ; Can Þ; m ¼ 1; 2;    ; 10; n ¼ 1; 2;    ; 9 ð10Þ
30 i j

where Rem ðm ¼ 1; 2;    ; 10Þ denotes the benchmark resampling methods, andCan ðn ¼ 1; 2;    ; 9Þ denotes the benchmark
classification models, i = 1,2,. . .,6 denotes six data sets, j = 1,2,. . .,5 denotes five evaluation measures, Eij ðRem ; Can Þ denotes
519
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Fig. 4. The Nemenyi post-hoc test results for nine classification models in terms of five evaluation measures.

the value of the j-th evaluation measure in the i-th data set regarding the combination of Rem and Can (the results are listed in
Table 11).
Table 11 shows the following results: (1) from the perspective of the whole table, the combination of the traditional
resampling method RUS and the ensemble classification model RSS have the largest mean value, which means this combi-
nation can achieve the best customer credit scoring performance on the whole; (2) from the perspective of rows, the com-
binations of the intelligent resampling method SMOTE + ENN and single classification models NB, ANN can obtain the best
customer credit scoring performance; in addition, the combinations of intelligent resampling methods SMOTE + ENN, SMOTE
and the single classification model LR can obtain the best customer credit scoring performance. Finally, the combinations of
the traditional resampling method RUS and single classification models DT, SVM can obtain the best customer credit scoring
performance; (3) the combinations of the traditional resampling method RUS and ensemble classification models Bagging,
AdaBoost, RSS, RF can obtain the best customer credit scoring performance; (4) from the perspective of columns, the tradi-
tional resampling method RUS combined with single and ensemble classification models can all obtain good customer credit
scoring performance, which shows that RUS has the most wide range of application, while the combination of the traditional
resampling method ROS and the ensemble classification model AdaBoost can obtain best customer credit scoring perfor-
520
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

Table 11
Mean value of five evaluation measures for 10 benchmark resampling methods combined with nine benchmark classification models.

NONE RUS ROS OSS CBOS SMOTE SMOTE+ENN ADASYN SL-SMOTE MWMOTE
NB 0.4701 0.5338 0.5196 0.4911 0.5304 0.5730 0.5809 0.5509 0.5492 0.5398
DT 0.4207 0.5661 0.5029 0.4563 0.5055 0.5151 0.5399 0.5055 0.5010 0.4615
ANN 0.4399 0.5536 0.5274 0.4450 0.5451 0.5572 0.5687 0.5366 0.5330 0.5104
SVM 0.3271 0.5831 0.5524 0.3452 0.5627 0.5690 0.5755 0.5351 0.5636 0.5651
LR 0.4050 0.5996 0.5794 0.4260 0.5875 0.6011 0.6011 0.5901 0.5930 0.5941
Bagging 0.4422 0.5990 0.5448 0.4660 0.5403 0.5331 0.5535 0.5125 0.5244 0.4729
Adaboost 0.4271 0.5950 0.5833 0.4446 0.5888 0.5807 0.5844 0.5695 0.5873 0.5031
RSS 0.3434 0.6038 0.5372 0.3796 0.5340 0.5154 0.5511 0.4719 0.4933 0.4249
RF 0.4651 0.5957 0.5089 0.4920 0.5058 0.5334 0.5483 0.5108 0.5035 0.5474

Note: The boldface marks the highest average value in each row. The underlining marks the largest average value in each column.

mance; (5) the combinations of intelligent resampling methods SMOTE, SMOTE + ENN, ADASYN, SL-SMOTE, MWMOTE and
the single classification model LR can obtain the best customer credit scoring performance, while the combination of the
intelligent resampling method CBOS and the ensemble classification model AdaBoost can obtain the best customer credit
scoring performance. Finally, the combination of the intelligent resampling method OSS and the ensemble classification
model RF can obtain the best customer credit scoring performance.

4.4. Discussion

Based on the experimental results in Sections 4.1, 4.2, and 4.3, we draw the following discussions.
First, according to the results in Section 4.1, we can conclude as follow: (1) the benchmark resampling methods RUS and
SMOTE + ENN show the best customer credit scoring performance in terms of five evaluation measures, followed by the
benchmark resampling methods ROS and SMOTE, because they have the best customer credit scoring performance in terms
of four evaluation measures; (2) we find that the experimental results of this study are inconsistent with those of some exist-
ing studies. Our experimental results show that the performance of the traditional resampling method RUS is better than
that of ROS, while the experimental results in [18] showed that the performance of ROS is better than that of RUS in customer
credit scoring, the inconsistency may be caused by the fact that their study only trains single classification models on the
balanced training set, no training the ensemble classification models, and only uses the GINI coefficient to evaluate the per-
formance of the models, which may lead to the deviation of the conclusion. Another reason is that ROS has the shortcoming
of introducing many repeated samples into the training set, which may lead to over-fitting of classification models, espe-
cially it may aggravate the over-fitting problem of ensemble classification models; (3) our experimental results show that
the performance of the traditional resampling method RUS is better than that of the intelligent resampling method SMOTE,
while the results in [29] showed that the performance of SMOTE is better than that of RUS. The reason for the inconsistency
may be that these studies only use AUC as the evaluation measure of the performance of the models, which may lead to the
results unable to fully reflect the customer credit scoring performance of the resampling methods. Another reason is that
SMOTE only focuses on the generation of minority samples and ignores the distribution characteristics of the majority sam-
ples, thus it is likely to introduce more minority samples into the majority sample area, which increases the overlapping
degree between the two class samples, and then reduces the performance of customer credit scoring [14,44].
Second, according to the results in Section 4.2, we can come to conclusions: (1) the benchmark classification models LR
and AdaBoost have the best customer credit scoring performance in terms of five evaluation measures, followed by the
benchmark classification models SVM and Bagging, because they have the best customer credit scoring performance in terms
of four evaluation measures; (2) we find that the experimental results of this study are inconsistent with those of some exist-
ing studies. Our experimental results show that the performance of single classification model LR is better than that of
ensemble classification model Bagging, while the experimental results in [45] showed that the performance of Bagging is
better than that of LR, the reason for the inconsistency may be that their study only uses the traditional resampling method
instead of the intelligent resampling method to balance the class distribution of the training set, and only one evaluation
measure is used to evaluate the performance of the models, which may lead to less objective performance comparison
results of classification models; (3) we show that the performance of the single classification model LR and the ensemble
classification model AdaBoost has no statistically significant difference, while the performance of single classification model
SVM and ensemble classification model Bagging has no statistically significant difference. These results confirm the conclu-
sion of article [46] that most customer credit scoring data sets are weakly nonlinear, so the commonly used single classifi-
cation models LR and SVM may achieve similar performance to the ensemble classification models; (4) the ensemble
classification model AdaBoost can achieve the best customer credit scoring performance, while Bagging is next to AdaBoost.
This result shows that the serial ensemble model may be better than the parallel ensemble model in customer credit scoring.
The main reason may be that the serial ensemble model can give higher weight to misclassified samples according to the
521
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

classification results in the process of training classifiers, which can effectively improve the performance of the classification
models, which also confirms the conclusion of articles [47].
Finally, according to the results in Section 4.3, we can summarize as follows: (1) the combination of the traditional resam-
pling method RUS and the ensemble classification model RSS can obtain the best customer credit scoring performance on the
whole. The main reason may be that RUS and RSS can add random disturbance to the samples and features respectively to
improve the robustness of the classification models, so as to further improve the performance of customer credit scoring; (2)
the combinations of the traditional resampling method RUS and ensemble classification models Bagging, AdaBoost, RSS, and
RF can achieve the best customer credit scoring performance, which may because RUS can improve the performance of the
ensemble classification models more effectively. As the results of reference [48], the reason may be that RUS combined with
the ensemble classification models can help to solve the problem of missing some sample information caused by RUS, and
improve the diversity of ensemble classification models; (3) the combinations of the intelligent resampling method
SMOTE + ENN and the single classification models NB, ANN, LR, DT, and SVM can obtain a good performance of customer
credit scoring, which means that the combinations of SMOTE + ENN and single classification models may improve the per-
formance of customer credit scoring. In particular, among the four intelligent resampling methods based on SMOTE com-
pared in this study, SMOTE + ENN can achieve the best customer credit scoring performance, which may be due to the
fact that ENN can effectively eliminate the noise samples and the boundary samples between two classes in the training
set, thus improving the performance of the classification models [44,49]; (4) the combinations of intelligent resampling
methods SMOTE, SMOTE + ENN, ADASYN, SL-SMOTE, and MWMOTE and the single classification model LR can achieve
the best customer credit scoring performance, which may mean that generating new samples with the k-nearest neighbor
method will greatly improve the classification performance of LR; (5) RUS has good customer credit scoring performance
when combining any classification model referred in this study, which means it has the widest range of application. The rea-
son may be that RUS can expand the class borderline by removing the majority samples to effectively improve the classifi-
cation accuracy of minority samples, which also confirms the conclusion of reference [42].

5. Conclusions

At present, there are many studies on comparing the performance of resampling methods or classification models for cus-
tomer credit scoring, but they still have some limitations: (1) in customer credit scoring, most studies of comparing bench-
mark models only compare the performance of resampling methods or classification models, without considering the impact
of both on customer credit scoring performance, which may lead to biased research conclusions; (2) the existing studies
rarely consider the optimal combination problem between resampling methods and classification models.
Therefore, we propose a new benchmark models comparison framework to compare the performance of nine resampling
methods and nine classification models respectively for imbalanced customer credit scoring. In the framework, we introduce
five evaluation measures including IBA and conduct an in-depth experiment on six data sets, which is expected to obtain
more comprehensive and objective comparison results. First, when comparing the performance of benchmark resampling
methods, five single and four ensemble classification models are respectively trained on the customer credit scoring training
set with balanced class distribution, which makes up for the defect that most existing studies only training single classifi-
cation models may lead to the deviation of the research conclusion. Second, when comparing the performance of benchmark
classification models, two traditional and seven intelligent resampling methods are respectively used to balance the class
distribution of training set, which makes up for the defect of less objective comparison results caused by most existing stud-
ies because they only using the traditional resampling methods to balance the class distribution of training sets. Finally, in
customer credit scoring, the optimal combination problem between them is analyzed at the first time.
In the performance comparison of benchmark resampling methods, we find that there is no statistically significant dif-
ference between the traditional resampling method RUS and the intelligent resampling method SMOTE + ENN, and they have
the best customer credit scoring performance. In the performance comparison of benchmark classification models, we find
that there is no statistically significant difference between the single classification model LR and the ensemble classification
model AdaBoost, and they have the best customer credit scoring performance. Finally, in analyzing the optimal combinations
of them, the experimental result shows that the combination of the traditional resampling method RUS and the ensemble
classification model RSS can obtain the most satisfactory customer credit scoring performance on the whole. More interest-
ingly, we find that RUS combined with single and ensemble classification models all can achieve better customer credit scor-
ing performance, which may mean that RUS has a wide range of application, while the intelligent resampling methods based
on SMOTE combined with LR also can achieve the best customer credit scoring performance, which may mean using the
intelligent resampling methods based on SMOTE can effectively improve the classification performance of LR. This study pro-
vides the comprehensive guidance for other scholars on how to choose the most appropriate resampling methods and clas-
sification models when dealing with imbalanced credit scoring problems.
In recent years, deep learning is developed in the fields of image classification and voice recognition, and has exhibited
excellent customer credit scoring performance. For example, some studies [17,41] used deep learning for customer credit
scoring, these experimental results showed that deep learning can provide excellent performance. In the future, we can com-
pare the performance of some state-of-the-art deep learning methods in customer credit scoring with imbalanced class
distribution.
522
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

CRediT authorship contribution statement

Jin Xiao: Writing - original draft, Writing - review & editing, Formal analysis, Investigation, Methodology, Funding acqui-
sition. Yadong Wang: Writing - original draft, Writing - review & editing, Formal analysis, Investigation, Methodology, Visu-
alization, Validation. Jing Chen: Visualization. Ling Xie: Investigation, Supervision. Jing Huang: Conceptualization,
Validation.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgments

Financial supports from the National Natural Science Foundation of China (Grant No. 71974139), the Excellent Youth
Foundation of Sichuan Province (Grant No. 2020JDJQ0021), the Tianfu Ten-thousand Talents Program of Sichuan Province
(Grant No. 0082204151153), the Excellent Youth Foundation of Sichuan University (Grant No. sksyl201709), the Leading Cul-
tivation Talents Program of Sichuan University, 2018 Special Project for Cultivation and Innovation of New Academic-Qian
Platform Talent (Grant No. [2018]5772-012), the High-level Innovative Talents Project of Beijing Academy of Science and
Technology (Grant No. PXM2021-178216-000008) and the Innovation Engineering Pre-research Project of Beijing Academy
of Science and Technology (Grant No. PXM2021-178216-000002) are gratefully acknowledged.

Appendix A. . The algorithm processes of nine benchmark resampling methods

The benchmark resampling methods compared in this study include two traditional resampling methods (i.e., RUS and
ROS) and seven intelligent resampling methods (i.e., OSS [22], CBOS [21], SMOTE [23], SMOTE + ENN [25], ADASYN [26],
SL-SMOTE [27], and MWMOTE [28]). The following section briefly describes the processes of nine benchmark resampling
methods (the mathematical notations and definitions are shown in Table 1).

(1) RUS and ROS

RUS and ROS are the two most commonly used resampling methods. To reduce the sample size of the majority class and
ultimately balance the class distribution of the training sets, RUS randomly selects a certain number of samples without
replacing from Smaj . To increase the sample size of the minority class and ultimately balance the class distribution of training
sets, ROS randomly selects and replicates a certain number of samples with replacing from Smin .

(2) OSS

To improve the effectiveness of traditional resampling methods, Kubat and Matwin [22] proposed the resampling method
OSS. The general principle of OSS is to intelligently identify and remove the majority class samples considered to be redun-
dancies or noises, thus attaining the under-sampling purpose for the majority class samples. Specifically, OSS includes three
steps: (1) removing the repeated and redundant majority class examples with the condensed nearest neighbor rule (CNN)
 
[50] so as to obtain the consistent subset E of the training set S; (2) calculating the distance d xi ; xj between two samples
     
(xi 2 Smaj and xj 2 Smin ) in E, if there is no other sample point xk that makes dðxk ; xi Þ < d xi ; xj or d xk ; xj < d xi ; xj , then xi and
xj constitute a Tomek Link pair [50]; (3) finding all Tomek Link pairs in E and removing the majority class samples in the
pairs.

(3) CBOS

In 2004, Jo and Japkowicz [21] proposed the resampling method CBOS. The general principle of CBOS is to deal with the
issue of imbalanced class distribution in training sets through combining a clustering algorithm and ROS. Specifically, CBOS
includes three steps: (1) using the K-means method to cluster each class of samples contained in the training set S; (2) per-
forming ROS on each cluster of the majority class so that the sample size of each cluster is equal to that of the maximum
cluster; (3) performing ROS on each cluster of the minority class so that N maj ¼ N min .

(4) SMOTE

To address the overfitting of ROS, Chawla et al. [23] proposed the resampling method SMOTE. The general principle of
SMOTE is to perform linear interpolation among minority class samples according to spatial similarity to generate extra sam-
523
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

ples, thus realize the purpose of balancing the class distribution of training sets. Specifically, SMOTE includes three steps: (1)
finding k neighbors of each xi 2 Smin ; (2) randomly selecting a sample b x i in the k neighbors and using the following equation
to synthesize a new artificial sample:
 
xnew ¼ xi þ bx i  xi  d ð11Þ
where xnew denotes the newly generated sample and the random number d 2 ½0; 1 ; (3) repeating the second step p times
to generate p new samples.

(5) SMOTE + ENN

In many cases, the resampling method SMOTE can be used to increase the sample size of the minority class, but an excess
of synthesized artificial samples may lead to the overfitting of the classification model. To address this problem, Batista et al.
[25] proposed the SMOTE + ENN method for resampling. Specifically, the method includes two steps: (1) using SMOTE to
oversample the training set S; (2) using ENN to predict the class of each sample. If the predicted class of a sample is not con-
sistent with the actual class, then the sample is removed.

(6) ADASYN

Some scholars argue that the resampling method SMOTE does not consider the degree of importance of samples when
new samples are synthesized possibly causing severe overlap [26]. To address this problem, He et al. [26] proposed the
resampling method ADASYN. The general principle of ADASYN is to assign different weights for minority class samples
according to the difficulty of classification, thus synthesize more artificial samples around those with high classification dif-
 
ficulty. Specifically, ADASYN includes three steps: (1) using the equation G ¼ N maj  N min  u to calculate the number of
samples to be synthesized, the coefficient u 2 ½0; 1 ; (2) finding the k-nearest neighbor of each xi 2 Smin and using the follow-
ing equation to construct the distribution of minority class samples:
D =k
Ci ¼ P i ð12Þ
i Di =k

where Di denotes the number of majority class samples in the k-nearest neighbor of xi; (3) using the equation g i ¼ Ci  G
to calculate the number of samples to be synthesized for each xi , and synthesizing a new sample according to Eq. (11).

(7) SL-SMOTE

Bunkhumpornpat et al. [27] argue that the resampling method SMOTE may generate samples blindly, thus making the
minority class sample area larger and the decision boundary more ambiguous. To address this problem, they proposed
the resampling method SL-SMOTE. The general principle of SL-SMOTE is to synthesize new samples according to the safe
level (SL) to balance the class distribution of training sets. Specifically, SL-SMOTE includes three steps: (1) calculating the
SL of any xi 2 Smin and the sample b
x i randomly selected from NN k ðxi Þ, which are denoted by SLxi and SLbx respectively; (2) cal-
i

culating the safe level ratio (SLR):


SLxi
SLR ¼ ð13Þ
SLbx
i

(3) using Eq. (11) to synthesize new artificial samples according to the following rules: (i) when SLR ¼ 1 and SLxi ¼ 0, xi
and bx i are both noise samples, at this time, no new sample is generated; (ii) when SLR ¼ 1 and SLxi –0, b x i is a noise sample; at
this time, d ¼ 0, namely, only the sample xi is copied; (iii) when SLR ¼ 1 (namely, SLxi ¼ SLbx ), d ¼ 1, namely, only the sample
i

b
x i is copied; (iv) when SLR > 1 (namely, SLxi > SLbx ), the synthesized new sample is closer to xi ; at this time, d 2 ½0; 1=SLR ; (v)
i

when SLR < 1 (namely, SLxi < SLbx ), the synthesized new sample is closer to b
x i ; at this time, d 2 ½1  SLR; 1 .
i

(8) MWMOTE

When the minority class sample subset has multiple clustering centers, most of the oversampling methods based on k-
nearest neighbors cannot identify the weight of a sample correctly, so incorrect artificial samples are likely to be synthesized.
To address this problem, Barua et al. [28] proposed the resampling method MWMOTE. To balance the class distribution of
training sets, the general principle of MWMOTE is to synthesize artificial samples according to the information from minority
class samples and majority class samples. Specifically, MWMOTE includes five steps: (1) using the k1 nearest neighbors to
remove the noise samples contained in the original training set S and thus obtain the sample subset E; (2) finding the k2 near-
est neighbors of each minority class sample from E, then finding the majority class samples from the neighbor samples to
constitute the set Emaj ; (3) finding the k3 nearest neighbors of each sample from Emaj , then finding the minority class samples
524
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

from the neighbor samples to constitute the set Emin ; (4) according to the specified rules [28], calculating the selection weight
W ðxi Þ and selection probability P ðxi Þ ¼ P W ðxi Þ of each sample xi 2 Emin ; (5) according to Pðxi Þ, selecting a sample xm from
W ðxi Þ
xi 2Emin

the set Emin and using Eq. (11) to synthesize new artificial samples in the cluster to which xm belongs.

Appendix B. . The algorithm processes of four ensemble classification models

There are four ensemble classification models compared in this study (i.e., Adaboost [11], Bagging [10], RSS [13], and RF
[12]). The following section briefly describes the processes of four ensemble classification models.
Adaboost [11] is the most representative serial ensemble algorithm. Specifically, the algorithm includes three steps: (1)
training a base classifier from the original training set, and then adjusting the sample distribution of training sets according
to the performance of the base classifier, and assigning higher weights to the misclassified samples; (2) training the next
base classifier based on the adjusted sample distribution; (3) repeating the above process N times and performing weighted
integration on the results of the N base classifiers.
Bagging [10] is the most representative parallel ensemble algorithm. Specifically, the algorithm includes three steps: (1)
selecting N training subsets randomly from the original training set; (2) training N base classifiers; (3) integrating the results
of the N base classifiers. Bagging applies stochastic disturbance to the sample sets, thus effectively enhancing the general-
ization ability of the ensemble classifier.
RSS [13] shares a similar ensemble principle with Bagging except that RSS applies stochastic disturbance to feature sets.
Specifically, the algorithm mainly includes three steps: (1) selecting N feature subsets randomly from the feature space; (2)
acquiring N training subsets in the training set through a mapping calculation, then training one base classifier on each train-
ing subset; (3) integrating the results of the N base classifiers.
RF [12] is an ensemble algorithm in which decision trees are used as base classifiers. The algorithm mainly includes three
steps: (1) selecting N training subsets randomly from the original training set; (2) training N decision trees—during the train-
ing process, selecting a subset composed of k features randomly from the feature set (assuming that there are d features) of
each node of the decision trees (k
d), and selecting an optimal feature from the subset to divide the decision trees; (3)
using the majority voting method to integrate the results of the N decision trees.

References

[1] D. Karlan, J. Zinman, Microcredit in theory and practice: using randomized credit scoring for impact evaluation, Science 332 (6035) (2011) 1278–1284.
[2] J. Sun, J. Lang, H. Fujita, H. Li, Imbalanced enterprise credit evaluation with DTE-SBD: decision tree ensemble based on SMOTE and bagging with
differentiated sampling rates, Inf. Sci. 425 (2018) 76–91.
[3] S. Maldonado, G. Peters, R. Weber, Credit scoring using three-way decisions with probabilistic rough sets, Inf. Sci. 507 (2020) 700–714.
[4] C. Luo, D. Wu, D. Wu, A deep learning approach for credit scoring using credit default swaps, Eng. Appl. Artif. Intell. 65 (2017) 465–470.
[5] J. Xiao, Y. Tian, L. Xie, X. Jiang, J. Huang, A hybrid classification framework based on clustering, IEEE Trans. Ind. Inf. 16 (2019) 2177–2188.
[6] A.C. Antonakis, M.E. Sfakianakis, Assessing naïve Bayes as a method for screening credit applicants, Journal of Applied Statistics 36 (5) (2009) 537–545.
[7] Y. Tian, B. Bian, X. Tang, J. Zhou, A new non-kernel quadratic surface approach for imbalanced data classification in online credit scoring, Inf. Sci. 563
(2021) 150–165.
[8] D.M. Silva, G.H. Pereira, T.M. Magalhes, A class of categorization methods for credit scoring models, Eur. J. Oper. Res. (2021), https://doi.org/10.1016/j.
ejor.2021.04.029.
[9] L.K. Hansen, P. Salamon, Neural network ensembles, IEEE Trans. Pattern Anal. Mach. Intell. 12 (10) (1990) 993–1001.
[10] P. Pławiak, M. Abdar, U.R. Acharya, Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring, Appl.
Soft Comput. 84 (2019) 105740.
[11] Y. Xia, C. Liu, Y. Li, N. Liu, A boosted decision tree approach using Bayesian hyper-parameter optimization for credit scoring, Expert Syst. Appl. 78
(2017) 225–241.
[12] N. Arora, P.D. Kaur, A Bolasso based consistent feature selection enabled random forest classification algorithm: an application to credit risk
assessment, Appl. Soft Comput. 86 (2020) 105936.
[13] A. Huang, F. Wu, Two-stage adaptive integration of multi-source heterogeneous data based on an improved random subspace and prediction of default
risk of microcredit, Neural Comput. Appl. 33 (2021) 4065–4075.
[14] K. Niu, Z. Zhang, Y. Liu, R. Li, Resampling ensemble model based on data distribution for imbalanced credit risk evaluation in P2P lending, Inf. Sci. 536
(2020) 120–134.
[15] F. Shen, X. Zhao, G. Kou, F.E. Alsaadi, A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling
technique, Appl. Soft Comput. 98 (2021) 106852.
[16] L. Yu, W. Yue, S. Wang, K.K. Lai, Support vector machine based multiagent ensemble learning for credit risk evaluation, Expert Syst. Appl. 37 (2) (2010)
1351–1360.
[17] J. Xiao, H. Cao, X. Jiang, X. Gu, L. Xie, GMDH-based semi-supervised feature selection for customer classification, Knowl.-Based Syst. 132 (2017) 236–
248.
[18] S.F. Crone, S. Finlay, Instance sampling in credit scoring: an empirical study of sample size and balancing, Int. J. Forecast. 28 (1) (2012) 224–238.
[19] L. Yu, R. Zhou, T. Ling, R. Chen, A DBN-based resampling SVM ensemble learning paradigm for credit classification with imbalanced data, Appl. Soft
Comput. 69 (2018) 192–202.
[20] R.C. Holte, L. Acker, B.W. Porter, Concept learning and the problem of small disjuncts, in, in: Proccedings of the 11th International Joint Conferences on
Artificial Intelligence, 1989, pp. 813–818.
[21] T. Jo, N. Japkowicz, Class imbalances versus small disjuncts, ACM Sigkdd Explorations Newsletter 6 (2004) 40–49.
[22] M. Kubat, S. Matwin, in: Addressing the curse of imbalanced training sets: one-sided selection, in, Morgan Kaufmann, 1997, pp. 179–186.
[23] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research
16 (6) (2002) 321–357.
[24] D.L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, Cybernetics 2 (3) (1972) 408–421.
[25] G.E. Batista, R.C. Prati, M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD
Explorations Newsletter 6 (2004) 20–29.

525
J. Xiao, Y. Wang, J. Chen et al. Information Sciences 569 (2021) 508–526

[26] H. He, Y. Bai, E.A. Garcia, S. Li, ADASYN,, adaptive synthetic sampling approach for imbalanced learning, in, in: Proceedings of the IEEE International
Joint Conference on Neural Networks, 2008, pp. 1322–1328.
[27] C. Bunkhumpornpat, K. Sinapiromsaran, C. Lursinsap, Safe-level-smote: safe-level-synthetic minority over-sampling technique for handling the class
imbalanced problem, in, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009, pp. 475–482.
[28] S. Barua, M.M. Islam, X. Yao, K. Murase, MWMOTE–majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans.
Knowl. Data Eng. 26 (2) (2014) 405–425.
[29] A.I. Marqués, V. García, J.S. Sánchez, On the suitability of resampling techniques for the class imbalance problem in credit scoring, Journal of the
Operational Research Society 64 (7) (2013) 1060–1070.
[30] V. García, A.I. Marqués, J.S. Sánchez, Improving risk predictions by preprocessing imbalanced credit data, in, in: Proceedings of the 19th International
Conference on Neural Information Processing, 2012, pp. 68–75.
[31] V. García, R.A. Mollineda, J.S. Sánchez, Index of balanced accuracy: a performance measure for skewed class distributions, in, in: Proceedings of the
Iberian Conference on Pattern Recognition and Image Analysis, 2009, pp. 441–448.
[32] L. Thomas, J. Crook, D. Edelman, Credit Scoring and Its Applications, Siam (2017).
[33] C. Linhart, G. Harari, S. Abramovich, A. Buchris, PAKDD data mining competition 2009: new ways of using known methods, in, in: Proceedings of the
Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009, pp. 99–105.
[34] W. You, Z. Yang, G. Ji, PLS-based recursive feature elimination for high-dimensional small sample, Knowl.-Based Syst. 55 (2014) 15–28.
[35] W. Iba, P. Langley, Induction of One-level Decision Trees, Elsevier, 1992.
[36] J. Abellán, C.J. Mantas, Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring, Expert Syst. Appl.
41 (2014) 3825–3830.
[37] B. Zhu, B. Baesens, S.K. Vanden Broucke, An empirical comparison of techniques for the class imbalance problem in churn prediction, Inf. Sci. 408
(2017) 84–99.
[38] V. García, R.A. Mollineda, J.S. Sanchez, Theoretical analysis of a performance measure for imbalanced data, in, in: Proceedings of the 20th International
Conference on Pattern Recognition, 2010, pp. 617–620.
[39] M. Zheng, T. Li, R. Zhu, Y.H. Tang, M.J. Tang, L.L. Lin, Z.F. Ma, Conditional Wasserstein generative adversarial network-gradient penalty-based approach
to alleviating imbalanced data classification, Inf. Sci. 512 (2020) 1009–1023.
[40] F. Kamalov, Kernel density estimation based sampling for imbalanced class distribution, Inf. Sci. 512 (2020) 1192–1201.
[41] H.A. Khorshidi, U. Aickelin, Constructing classifiers for imbalanced data using diversity optimisation, Inf. Sci. 565 (2021) 1–16.
[42] O. Loyola-González, J.F. Martínez-Trinidad, J.A. Carrasco-Ochoa, M. García-Borroto, Study of the impact of resampling methods for contrast pattern
based classifiers in imbalanced databases, Neurocomputing 175 (2016) 935–947.
[43] J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30.
[44] D. Elreedy, A.F. Atiya, A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance, Inf. Sci. 505
(2019) 32–64.
[45] S. Finlay, Multiple classifier architectures and their application to credit risk assessment, Eur. J. Oper. Res. 210 (2) (2011) 368–378.
[46] I. Brown, C. Mues, An experimental comparison of classification algorithms for imbalanced credit scoring data sets, Expert Syst. Appl. 39 (3) (2012)
3446–3453.
[47] T.G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization,
Machine learning 40 (2000) 139–157.
[48] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and
hybrid-based approaches, IEEE Transactions on Systems, Man, Cybernetics 42 (2) (2011) 463–484.
[49] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high
imbalanced data-sets using SMOTE and rough sets theory, Knowl. Inf. Syst. 33 (2) (2012) 245–265.
[50] D. Devi, B. Purkayastha, Redundancy-driven modified Tomek-link based undersampling: a solution to class imbalance, Pattern Recogn. Lett. 93 (2017)
3–12.

526

You might also like