You are on page 1of 22

Received: 30 December 2017 Revised: 18 September 2018 Accepted: 9 November 2018

DOI: 10.1111/exsy.12363

ORIGINAL ARTICLE

Credit scoring for a microcredit data set using the synthetic


minority oversampling technique and ensemble classifiers
Adaleta Gicić1 | Abdulhamit Subasi2

1
Info Studio d.o.o, Sarajevo, Bosnia and
Herzegovina Abstract
2
College of Engineering, Effat University, Although microfinance organizations play an important role in developing economies,
Jeddah, Saudi Arabia
decision support models for microfinance credit scoring have not been sufficiently
Correspondence
Abdulhamit Subasi, Effat University, College of
covered in the literature, particularly for microcredit enterprises. The aim of this
Engineering, Jeddah, 21478, Saudi Arabia. paper is to create a three‐class model that can improve credit risk assessment in
Email: absubasi@effatuniversity.edu.sa
the microfinance context. The real‐world microcredit data set used in this study
Funding information
Info Studio d.o.o, Sarajevo, Bosnia and Herze- includes data from retail, micro, and small enterprises. To the best of the authors'
govina, Grant/Award Number: 1/2015 knowledge, existing research on microfinance credit scoring has been limited to
regression and genetic algorithms, thereby excluding novel machine learning algo-
rithms. The aim of this research is to close this gap. The proposed models predict
default events by analysing different ensemble classification methods that empower
the effects of the synthetic minority oversampling technique (SMOTE) used in the
preprocessing of the imbalanced microcredit data set. Initial results have shown
improvement in the prediction results for certain classes when the oversampling
technique with homogeneous and heterogeneous ensemble classifier methods was
applied. A prediction improvement for all classes was achieved via application of
SMOTE and the Consolidated Trees Construction algorithm together with Rotation
Forest. To obtain a complete view of all aspects, an additional set of metrics is used
in the evaluation of performance.

KEY W ORDS

credit scoring, data mining, decision support system, ensemble classifiers, microcredit data set,
synthetic minority oversampling technique (SMOTE)

1 | I N T RO D U CT I O N

1.1 | Background

Credit scoring is an important part of the credit risk management process of banks and other financial organizations because it reduces losses,
improves predictability, and accelerates borrowing decisions ( Chuang & Huang, 2011; Serrano‐Cinca & Gutiérrez‐Nieto, 2016). Because customer
credit management is a crucial task for commercial banks and other financial organizations, they must be careful when dealing with customer loans
to circumvent any inappropriate decisions that can lead to the loss of opportunity or financial losses. Moreover, an incorrect assessment of
customer credibility can significantly affect the stability of the financial organization. The labour‐intensive customer creditworthiness assessment
is time and resource consuming. Furthermore, a labour‐intensive assessment mostly depends on the bank employee; hence, computer‐based auto-
mated credit assessment models that deliver loan assessments are devised and implemented to eliminate the “human factor” in this process. Such
a computer‐based automated credit assessment model can deliver recommendations to the banks and financial organizations in terms of whether
a loan must be given or not and whether the loan will be returned or not. Currently, several credit models have been implemented, but there is no

Expert Systems. 2019;36:e12363. wileyonlinelibrary.com/journal/exsy © 2018 John Wiley & Sons, Ltd 1 of 22
https://doi.org/10.1111/exsy.12363
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 of 22 GICIĆ AND SUBASI

perfect classifier between them because some of them give the correct outputs and some of them do not (Ala'raj & Abbod, 2016). Credit scoring is
the term used to describe the statistical methods that decide if an applicant belongs to a good or bad class (Hand & Henley, 1997). Some authors
use the third, intermediate class of “refused” (Goovaerts, 1989), and others use “good,” “poor,” and “bad” (Sarlija, 2004). In this study, we use the
terms “good,” “poor,” and “bad.” The credit applicant is classified into one of the classes based on a number of quantifiable parameters. A member
of the “bad” risk class has a high possibility of falling into default, which means not repaying the loan as promised (Dinh, 2007).
The credit scoring became a widely investigated area by the financial industry and scholars (Kumar & Ravi, 2007; W.‐Y. Lin, Hu, & Tsai, 2012),
and numerous models are investigated and implemented using different algorithms such as Artificial Neural Network, decision tree (Hung & Chen,
2009); (Makowski, 1985), support vector machine (Baesens, 2003; Huang, Chen, & Wang, 2007; Schebesch & Stecking, 2005), and case‐based
reasoning (Shin & Han, 2001; Wheeler & Aitken, 2000). As a consequence of the financial crises, the Basel Committee on Banking Supervision
demanded from all banking and financial companies to implement rigorous credit risk assessment models when giving a loan to a company or
an individual client (Ala'raj & Abbod, 2016).
The employment of several methods in the implementation of credit scoring models has evolved over time. In the beginning, scientists
employed every method independently, then they started to adapt the design of credit scoring models in order to eliminate the limitations of sim-
ple models (Duan & Da Xu, 2012). In this stage, they introduced complex models with innovative methods, such as ensemble and hybrid modelling,
which achieved a more improved performance than the single methods. Ensemble and hybrid methods can be employed independently or in
combination. Hybrid modelling can be formed by combining clustering and classification methods (Tsai, 2014; Tsai & Chen, 2010) or by cascading
several methods (Lee, Lu, Chen, & Chiu, 2002; Marqués, García, & Sánchez, 2013) or by using synergetic tools in combining several approaches
into single method such as fuzzy‐based rules (Gorzałczany & Rudziński, 2016). On the other side, ensemble modelling tries to combine a group
of learners trained on the same data set and use their ideas to achieve an efficient classification result. Although ensemble modelling has substan-
tial computational and financial cost, this yields universal and better classification models for credit scoring (Ala'raj & Abbod, 2016). Therefore, the
aim of this paper is to implement more complex ensemble classifiers in credit scoring.

1.2 | Research motivations

The research motivation of this paper is threefold: employment of microcredit data, synthetic minority oversampling technique (SMOTE) for data
preprocessing, and ensemble classifiers. There are three different risk models used for credit assessment: corporate (large enterprises), small to
medium enterprises (SMEs), and retail (consumer). The main differences between corporate and retail risk models are related to the descriptive
variables that affect the implementation of a credit scoring. Corporate risk models are based on various financial ratios, balance sheets, or mac-
roeconomic indicators. The retail credit scoring model uses data from customer demographics, application forms, and most probably transactional
data from customer history (Lessmann, Baesens, Seow, & Thomas, 2015). Financial ratios are not sufficient for prediction of business failures,
particularly for small enterprises, because more useful information is gained from nonfinancial data rather than conventional financial indicators
(Psillaki, Tsolas, & Margaritis, 2010). Generally, models described in the literature are focused on either retail or corporate data samples. Small busi-
ness loans play a significant role in the world economy, and there is still generally inadequate research on enterprise and SME credit scoring
because data mining techniques are mostly applied to an individual credit score (Edelman, 2002). Research on the application of SMEs of data
mining in credit scoring is expected to increase significantly because of SME's importance in the economic growth (Sadatrasoul, Gholamian, Siami,
& Hajimohammadi, 2013). One of the studies in this field is default prediction in small enterprises of Italian firms (Ciampi, 2015), and another con-
cerns a small enterprise in the United Kingdom (S. M. Lin, 2012). Data sets of SMEs of Italian firms are used for prediction of business bankruptcy,
not for credit scoring (Gordini, 2014).
Microcredit scoring has not been sufficiently studied in the literature. This could be due to the lack of publicly available microcredit data sets
and some specific features of such data sets that are not suitable for existing microcredit scoring solutions. The main goal of this study is
microcredit credit scoring for both retail and microenterprises. Microenterprises play an important role in poverty reduction and economic growth.
The microcredit data set used in this study consists of small and microbusinesses, and some of them have only one or even zero employees.
Specific attributes that contribute to the complexity of decision support models are not easily confined to the existing credit scoring solutions.
Effective credit scoring models can increase the number of credit loans and reduce the losses for microfinance organizations. SMEs are generally
unstable because of the dependencies on other companies and because of the risk that the owner could withdraw the finances at any time. It is
recommended when also considering SMEs counterparts because SMEs are directly affected by their financial conditions (Edelman, 2002). To
obtain more accurate credit loan monitoring, the use of only economic indicators specific to small enterprises is not sufficient; instead, other
attributes related to the specific company features must also be considered (Ciampi, 2015).
The imbalanced classification problem refers to the case when the instances in one or several classes known as the majority classes are
significantly greater than the instances of other classes known as minority classes. The minority classes are generally the most significant classes
in an imbalance problem. In an imbalanced problem, one ought to be further concerned with this minority instances. Many researches
implemented imbalanced data learning (Barandela, Sánchez, Garcıa, & Rangel, 2003; Chawla, Cieslak, Hall, & Joshi, 2008; Provost, 2000; Weiss,
McCarthy, & Zabar, 2007). Naturally, the methods for solving the imbalanced data problem are divided into two main categories: imbalanced learn-
ing algorithms and resampling methods. The resampling method is essentially a balancing process in order to balance the imbalanced data set.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 3 of 22

Some studies (Batista, Prati, & Monard, 2004; Japkowicz & Stephen, 2002) claimed that imbalanced data sets are not essentially responsible for
the poor performance of some classifiers, though some other studies (Estabrooks, Jo, & Japkowicz, 2004; Weiss & Provost, 2003) have revealed
that balanced data sets deliver better classification accuracy than the imbalanced ones. Resampling methods are effective in case of imbalanced
data situations; they regulate only the original training data set instead of changing the learning algorithm. Therefore, this method is portable
(Drummond & Holte, 2003; Estabrooks et al., 2004), and it delivers a suitable and efficient way to deal with imbalanced learning problems
employing standard classifiers. More precisely, the resampling approach includes the random undersampling, that randomly removes instances
from the majority class, and the random oversampling, that randomly adds replicated instances to the minority class. Moreover, the SMOTE
(Chawla, Bowyer, & Kegelmeyer, 2002) is a well‐known oversampling method in which the minority classes are oversampled by generating
synthetic instances in the feature space formed by the minority instances in their k‐nearest neighbours. Hence, one of the motivations of this
paper is to use the SMOTE in microcredit scoring.
The new research trends have been evolving towards using single machine learning methods in building ensemble models (Wang, Hao, Ma, &
Jiang, 2011). Tsai (2014) claimed that the ensemble models based on the combination of diverse classifiers achieve a better classification accuracy
by eliminating the other classifiers' errors. Actually, an ensemble classifier is a set of classifiers, in which the decisions of every classifier are com-
bined using the same approach (Nanni & Lumini, 2009). In the credit scoring studies, several ensemble classification methods are implemented to
form homogenous classifier ensembles that combine the classifiers of the same algorithm and heterogeneous classifier ensembles that combine
different algorithms (Lessmann et al., 2015; Partalas, Tsoumakas, & Vlahavas, 2010; Tsai, 2014). Ala'raj and Abbod (2016) proposed a credit scoring
architecture to combine the heterogeneous ensemble classifiers and a new combination rule termed the consensus approach, where the classifiers
work as a team to achieve a consensus on all the data points' final outputs by combining their rankings. The second and the main motivation of this
paper is to utilize the ensemble classifier in microcredit scoring.

1.3 | Contribution
Based on the above motivations, a new microcredit scoring model based on SMOTE and ensemble classifier is proposed with the aim to increase
the accuracy and have better classification performance. Experimentally, the SMOTE oversampling method improves the performance of the clas-
sifiers and combining the decisions of the classifier ensembles achieves better accuracy. Actually, in the literature, few studies have employed
microcredit data in their implementation, and in order to fill this gap, we used a microcredit data and SMOTE oversampling method in addition
to the ensemble classifier model.
The contribution of this study is a comparison of 57 different decision support models for a microcredit data set. To the best of the authors'
knowledge, the novel data mining techniques applied to this data set have not been previously applied to the microcredit data set. Moreover,
improvement of the prediction performance of the credit scoring model was also achieved at the data level, not only at the algorithmic level.
Furthermore, instead of the most common “yes/no” credit scoring cards, this model proposes three classes that are rarely considered in the score
cards. Another contribution of this study is the improvement of classification accuracy for the unstable, minority class of the microcredit data. Due
to the imbalanced data set, the SMOTE technique (Chawla et al., 2002) was applied for data‐level optimization. The application of data mining
techniques based on heterogeneous algorithms achieved significant improvements, but remarkable results in the prediction of all classes were
achieved by applying the Consolidated Trees Construction (CTC) decision tree algorithm.
The rest of the paper is structured as follows. In Section 2, previous studies are presented. Section 3 presents the structure and char-
acteristics of the microcredit data set used in this study and the methods applied. Section 4 describes the experimental set‐up and criteria
used for the performance evaluation. The generalization ability of the proposed framework is assessed by utilizing widely used performance
measures across real‐world microfinance credit data sets. The performance results are compared with those of other data mining techniques
by using different measures, such as Percentage Correctly Classified (PCC), F‐Measure, Matthews Correlation Coefficient (MCC), root mean
square error (RMSE), receiver operating characteristic (ROC), and the Kappa statistic. The experimental results are presented in Section 5.
The complete study compares different data mining techniques, such as single classifiers and homogeneous and heterogeneous ensembles
combined with the oversampling technique. Finally, Section 6 presents the study's conclusions and outlines recommendations for future
research.

2 | LITERATURE REVIEW

A credit scoring model that calculates customers' scores using data mining techniques requires historical data related to credit risk and demo-
graphic characteristics (Bee Wah Yap, 2011). In the past, various techniques have been used for credit scoring, such as probit analysis, linear
and logistic regression, and nonparametric smoothing methods such as k‐nearest neighbours (k‐NN), Markov chain models, mathematical program-
ming, artificial neural networks, recursive partitioning, expert systems, and genetic algorithms (Hand & Henley, 1997). Baesens (2003) compared
various classifiers on eight data sets with 17 classifiers. Ensembles of classifiers also emerged as novel classification approaches and have since
been improved and combined (Dahiya, Handa, & Singh, 2017). In another study, Baesens studied the new benchmarking analyses of credit scoring
algorithms (Lessmann et al., 2015), including novel learning methods and new data sets.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 of 22 GICIĆ AND SUBASI

Typical classification methods assume balanced class distributions or equal misclassification costs (He & Garcia, 2009). Thus, data mining
algorithms cannot properly characterize the distributive nature of the data and deliver uncomplimentary classification rates when presented
with imbalanced data sets. The key issue of the imbalanced classification problem is to significantly adjust the imbalance detection rate of
typical learning algorithms. Numerous data processing methods have addressed the imbalanced learning problem. The most popular one is
SMOTE, which oversamples the minority class examples to achieve balanced data sizes between the classes (SMOTE: Synthetic Minority
Over‐sampling Technique, Chawla et al., 2002; Napierala & Stefanowski, 2016). SMOTE is widespread among sampling methods and is
usually combined with ensemble classification methods (Díez‐Pastor, Rodríguez, García‐Osorio, & Kuncheva, 2015; Huang & Zhang, 2017;
Lee, Jun, & Lee, 2017; Sun, Lang, Fujita, & Li, 2018). This contribution suggests an efficient alternative to solve imbalanced classification
problems by combining the SMOTE algorithm (Chawla et al., 2002) and the Ensemble classifier (Gao, Hong, Chen, & Harris, 2011). More
specifically, in the first step, the SMOTE is utilized to produce synthetic instances in the minority class to balance the training data set.
In the second step, the Ensemble classifier is created by utilizing the resulting balanced data set. Hence, in the experiments, SMOTE
with different ensemble classifier methods are utilized to classify a real imbalanced microcredit data set. The experimental results show
that the utilized framework is competitive with the existing state‐of‐the‐art methods for imbalanced problems in credit scoring for
microfinance companies.
Several studies discussed the mechanism and strategies for credit scoring using data mining techniques. Lessmann et al. (2015) discussed
developing models to support decision making in the credit scoring. A credit score is a model‐based estimate that predicts the probability if the
borrower will show any undesirable behaviour in the future. Van Gool, Verbeke, Sercu, and Baesens (2012) stated that banks and microfinance
companies need to seek after their social‐ and money‐related terminations to survive. They discussed whether microfinance firms can profit from
credit scoring, which has been effectively embraced in retail management account. The ultimate threat for banks and financial organizations comes
from the trouble to recognize the financially sound applicants from the individuals who will most likely default on repayments. The current world
money related emergency has stimulated the financial organizations to consider using a loan forecast. The choice to give credit to a customer was
made by human specialists, with the help of past encounters and some directing standards. Hence, in this way, automated credit scoring has
turned into an essential instrument for financial organizations to assess credit chance, enhance income, diminish conceivable dangers, and make
administrative choices. Credit scoring is the arrangement of choice models and their fundamental strategies that help decide the loan specialists if
credit ought to be affirmed to a candidate. A definitive objective of credit scoring is to survey credit value and make a clear difference between
“great” and “terrible” obligations, contingent upon how likely candidates are to default with their reimbursements (Marqués et al., 2013). In the
research study conducted by (Koutanaei, Sajedi, & Khanbabaei, 2015) data mining techniques have been used in credit scoring of clients. They
have shown that using the feature selection and ensemble classifiers can enhance the banks' execution in credit scoring. The data mining
techniques do not assume certain data distribution like statistical techniques. These techniques extract information from training samples. Studies
have shown that data mining techniques are much better and superior than statistical techniques in dealing with credit scoring problems
(Z. Huang, Chen, Hsu, Chen, & Wu, 2004).
The ensemble machine learning utilizes multiple users by training them to solve the same problems (Duan & Binbasioglu, 2017; Polikar,
2006). Ensemble models try to combine the entire hypotheses to use them, unlike other techniques where one hypothesis is learned from
the training data (Zhou, 2009), keeping in mind that the end goal is to lessen the impact of the noisy data and the repetitive attributes that
would improve the performance of the single classifier. Besides the hybrid techniques, the ensemble classifiers are recently introduced for
credit scoring assessment (Li & Zhong, 2012). In the ensemble classifier, the result of many classifiers is combined to produce a decision,
and in the hybrid methods, only one classifier produces the final result, and the other classifiers' outputs are used as an input to this final clas-
sifier (Verikas, Kalsyte, Bacauskiene, & Gelzinis, 2010). Because the ensemble classifiers are able to outperform the best single classifier's per-
formance, they are utilized in credit scoring in an efficient way (Kittler, Hatef, Duin, & Matas, 1998). During the implementation of the
ensemble classifier, every single classifier should be as different as possible from the others (Nanni & Lumini, 2009). Most of the studies on
ensemble classifiers in credit scoring concentrated on homogenous ensemble classifiers by means of basic fusion methods and simple
combination rules such as weighted voting, weighted average, majority voting, stacking, fuzzy rules, and reliability‐based methods (Tsai,
2014; Tsai & Wu, 2008; Wang, Ma, Huang, & Xu, 2012; West, Dellana, & Qian, 2005; Yu, Wang, & Lai, 2008; Yu, Yue, Wang, & Lai,
2010; Yu, Wang, & Lai, 2009). Some researchers used heterogeneous ensemble classifiers but still with the above‐mentioned combination rules
(Ala'raj & Abbod, 2016; Hsieh & Hung, 2010; Lessmann et al., 2015; Tsai, 2014). In ensemble modelling, every classifier is trained individually
to yield its decisions that are afterward combined by means of an empirical algorithm to yield the final decision (Rokach, 2010; Zhang, Zhou,
Leung, & Zheng, 2010). Ala'raj and Abbod (2016) proposed an ensemble model to improve classification performance by using a hybrid
ensemble credit scoring model. Hsieh and Hung (2010) introduced a model based on class‐wise classification in the preprocessing step to
create an effective ensemble learner. The proposed ensemble learner model is created by combining numerous data mining techniques. They
used the optimal associate binning to discretize continuous values and neural network, support vector machine (SVM), and Bayesian network
to develop the ensemble classifier. Xiao, Xiao, and Wang (2016) proposed a novel ensemble classifier model that uses supervised clustering for
credit scoring. Supervised clustering is used to divide each class of data samples into different clusters. Then these clusters are then combined
to create a number of training subsets. A specific base classifier is created in each training subset, and then the results of these base classifiers
are combined by using weighted voting. The classification accuracy in the neighbourhood of the sample determines the weight‐related base
classifier.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 5 of 22

3 | MATERIALS AND METHODS

3.1 | Microcredit data set

The data set used in this study is a microcredit data set of 14,387 instances with aggregated retail and corporate attributes. The number of retail
instances in the data set is 14,337, and the number of enterprise instances is 50. Because microfinance retail customers are often represented by
sole‐ownership companies, it can be argued that the retail and private customers belong to the same class.
Microcredit refers to short‐term small loans given to customers that usually do not have collateral, credit history, or steady employment and
are usually refused by banks. Credit loans are used for microbusiness and small business activities in agriculture, trade, services, and production.
They have a significant role in the development and empower women by supporting women's economic participation. Due to its specific charac-
teristics, customer behaviour is difficult to predict, and microcredit organizations compensate for it with high credit rates. The terms of microcredit
and microfinance are not identical because microcredit is one component of microfinance that provides credit to the poor. Microfinance also con-
sists of additional noncredit financial products, such as savings, insurance, pensions, and payment services (Srinivasan, 2014).
Real‐world data sets generally consist of “normal” samples of the small percentage of “abnormal” or “interesting” samples (Chawla et al., 2002).
In credit scorecards, the types of the customers are generally divided into only two classes: “good” and “bad.” “Good” customers in credit scoring
have not been reported as more than 90 days delinquent during the forecast period, and “bad” customers are those who have. In this study, three‐
class microcredit data sets were used. In the analysed data set, a “good” customer has not been reported as being more than 16 days late, a “poor”‐
class customer is between 16 and 90 days beyond their agreed payment days, and a “bad” customer is reported as more than 90 days delinquent.
The microcredit data set used in this research has two types of imbalanced data: between‐class imbalance and within‐class imbalance. “Within‐
class imbalance” is related to the number of enterprise instances compared with the number of retail instances and the number of “poor”‐class
instances within the complete data set. Another type of imbalance is the negligible percentage of paid‐off credits within the minority “poor” class:
0.060255%. Additionally, the greatest challenge is to improve the prediction of this “poor” class. Hence, the “poor” category is not only the minor-
ity class but also very unstable and difficult to predict.
Every record of the data set consists of 25 attributes that are not correlated and have discrete values, which are given in Appendix A.
The data set of 14,387 different instances has three classes: 8,642 instances of the “good” class, 520 instances of the “poor” class with late
payment, and 5,225 instances of the “bad” class that cannot recover from default. The sample consists of 6,090 active credit instances and 8,297
paid‐off instances. Among paid‐off credit samples, there are only five instances of the “poor” class (0.060255%).

3.2 | Synthetic minority oversampling technique

SMOTE is an effective oversampling technique for imbalanced data sets that solves classification problems regarding the poor prediction of the
minority class and solves the general misleading presentation of the classifier performance (Blagus, 2013). The main problem with data sets with
the minority data samples is that the model is trained on an inadequate data set because the minority class is the one with “true population values”
(Blagus, 2013). The SMOTE approach allows the decision region of the minority class to become generalized (Chawla et al., 2002).
The SMOTE technique, proposed in 2002 by Chawla (Chawla et al., 2002), is based on creating additional samples instead of the data pertur-
bation (Ha & Bunke, 1997). SMOTE with k‐NN chooses the number of amount_percent/100 samples out of k neighbours depending on the per-
centage of the oversampling that we need. In each direction, a sample is generated by taking the difference between the sample and the nearest
neighbour. This difference is multiplied by a random number between 0 and 1 and added to the feature vector (Chawla et al., 2002).
Although SMOTE is the most effective in high‐dimensional data sets according to (Blagus, 2013), we will show that, in the analysed case,
where the number of samples is much larger than the number of attributes, SMOTE considerably meliorates prediction performances.

3.3 | Support vector machine

SVM is a machine learning method for solving classification problems of linear and non‐linear data (Han, Pei, & Kamber, 2011). The main principle
of the algorithm is to find an appropriate hyperplane for the class distinctions. In the case of linearly separable data, a hyperplane would be a linear
decision function (Vapnik & Cortes, 1995). Support vectors are used to determine the maximum margin between two classes, which defines the
optimal hyperplane (Vapnik & Cortes, 1995). The set of training patterns

ðy1; x1Þ; ðyl; xlÞ ∈ f−1; 1g (1)

is linearly separable if vector w and scalar b exist such that the inequalities

w·xi þ b ≥ 1 if yi ¼ 1

w·xi þ b ≤ −1 if yi ¼ −1 (2)
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 of 22 GICIĆ AND SUBASI

are valid for all elements of the training set (1). The optimal hyperplane (Vapnik & Cortes, 1995) is

w0 x þ b0 ¼ 0: (3)

For linearly nonseparable data, input data are mapped into the higher dimensional space using non‐linear mapping. The next step is to find the
linear separating hyperplane in the new space that corresponds to the non‐linear separating hypersurface in the original space (Han et al.,
2011). Even with the small number of support vectors compared with the size of a training set, an effective optimal hyperplane can be produced.
It is also effective with the infinite dimensional space (Vapnik & Cortes, 1995). In practice, SVM is used for various purposes, such as handwritten
digit recognition, object recognition, speaker identification, benchmark time‐series prediction tests, and so on (Han et al., 2011).

3.4 | Multilayer perceptron algorithm

Multilayer perceptron (MLP) is an artificial neural network model that consists of a set of nodes and connections. Connections represent weights
and output signals that are the sum function of the inputs to the node modified by the function (Gardner & Dorling, 1998). Networks of more than
two layers of weight, with the feed‐forward diagram having either a threshold or sigmoidal activation functions, are generally called a multilayer
perceptron (Bishop, 1995). The MLP consists of the input layer (input vector), several hidden layers, and the output layer (output vector; Gardner
& Dorling, 1998). The input vector is non‐linearly mapped to the output vector. Single‐hidden‐layer feed‐forward networks can approximate any
measurable function well, regardless of the activation function, regardless of the dimensions of the input space, and regardless of the input space
environment (Hornik, Stinchcombe, & White, 1989). The MLP has an input layer of d units, and the values of the units are computed layer by layer
until the output units are computed (Baum, 1998).

3.5 | Nearest neighbour

The nearest neighbour concept was proposed in the 1950s, but due to slow calculations, intensive use did not occur immediately. This approach is
a very simple method in which classification is based on comparing test set examples to the trained test set. A test instance defined by n attributes
searches for the closest pattern in terms of a distance metric. Defining the best k attribute is crucial for good performance, and it is achieved
experimentally by minimizing the error rate. Attribute k will be higher for a larger number of training instances. Problems with slow execution
are solved by the introduction of a partial distance method, where further computation of the distance is stopped after reaching a certain thresh-
old, and then the process continues with the next instance (Han et al., 2011).

3.6 | Decision tree algorithms

C4.5 is a decision tree algorithm that was developed by Quinlan (1993) as an improvement of the ID3 algorithm. The algorithm is based on the
principle of information entropy that solves the problem related to training data that has features with large numbers of values. Generally, decision
trees are models where the leaf nodes are classification, inner nodes represent attributes, and branches are conjunctions of attributes that lead to
classification. If we have trained set S = (s1, s2,..., sn), then si represents vectors si = (x1, x2,..., xm, ci), and ci is the class where the vector and x1, x2,...,
xm are predictive attributes that split the data. Making a decision within the tree is based on the importance of the attribute in splitting the set into
subsets. All the samples found in the subset are from the same class, and the leaf node for that decision tree is created to choose that class (Duan
& Da Xu, 2012; Zhang et al., 2010).
CTC is the algorithm created to improve the performance of C4.5, and contrary to bagging and boosting, it is formed by one tree instead of the
set of trees (Pérez, Muguerza, Arbelaitz, & Gurrutxaga, 2004). It is used to resolve classification problems related to highly imbalanced data without
losing the explaining capacity, which is an important characteristic of single decision trees and rule sets (Ibarguren, Pérez, Muguerza, Gurrutxaga, &
Arbelaitz, 2015). The CTC algorithm is based on resampling the training sample that builds the tree from each subsample, but the single tree is
made by consensus among the trees of what is significantly different from Bagging. Using the base classifier, all trees are made from another sub-
sample and propose a variable that will be used to split the current node, which is agreed upon among the trees using voting. The process is
repeated iteratively as long as the stopping condition is not fulfilled (Pérez et al., 2004). Induced from different training samples, consolidated trees
are more stable and less complex than C4.5 trees (Ibarguren et al., 2015). The CTC algorithm can use undersampling to change the class distribu-
tion without loss of information, building more accurate classifiers than C4.5 (Pérez, Muguerza, Arbelaitz, Gurrutxaga, & Martín, 2007).

3.7 | Bayesian networks

Bayesian networks represent the probabilistic relationships among the large number of variables creating probabilistic inference with those vari-
ables by directed acyclic graph (Neapolitan, 2003). A graph is the first component of a Bayesian network classifier, and the second component
is a conditional distribution of all variables given its parents in a graph (Friedman, Linial, Nachman, & Pe'er, 2000). Bayesian network classifiers
learn with a fixed structure, and the paradigmatic example is the Naïve Bayes classifier (Sebe, Cohen, Garg, & Huang, 2005; Xu & Duan, 2018).
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 7 of 22

Naïve Bayes is a probabilistic classifier based on conditional independence in which the features are assumed independent given the class
(Sebe et al., 2005).
Posterior probability is calculated as

PðcjxÞ ¼ ðPðxjcÞPðcÞÞ=PðxÞ (4)

where P(c) is the prior probability of the class, P(x|c) is the probability of the predictor given a class, and P(x) is the prior probability predictor. Naïve
Bayes classifiers have been widely used in many classification problems. As it is simple and described with a small number of estimated parame-
ters, the Naïve Bayes is a successful classifier (Sebe et al., 2005).
Averaged n‐Dependence Estimators (AnDE) is the generalized Averaged One‐Dependence (AODE) algorithm that appears as a significant
improvement of Naïve Bayes. The advantages of this algorithm are high training and classification efficiencies, together with high classification
accuracy. To avoid structure learning, for each attribute, a one‐dependence estimator is created, and then an average prediction of all estimators
is performed (Wang, Zhou, & Guo, 2007). Weighted averaging improves the accuracy of AODE's estimates (Webb, Boughton, & Wang, 2005).
When classifying an object x = ˂x1, …,xn˃ and applying the product rule

Pðy; xÞ ¼ Pðy; xi ÞPðxjy; xi Þ; (5)

AODE can be expressed as

∑i:1≤i≤n∧ FðxᵢÞ≥m Pðy; xᵢÞPðxjy; xᵢÞ


Pðy; xÞ ¼ (6)
∣fi: 1 ≤ i ≤ n ∧ F ðxᵢÞ ≥ mg∣

where m is the number of examples. F (xi) is a count of the number of training examples having attribute‐value xi used to implement the limit m.
AnDE further relaxes the independence assumption by generalizing AODE to higher levels of dependence (Webb et al., 2005).

3.8 | Ensemble of classifiers

Bagging, Boosting, Wagging, and Stacking are ensembles of up to hundreds of classifiers. Among them, boosting is the most powerful ensemble
because it uses the “statistical technique of additive models” (Witten, Frank, & Hall, 2011).
In both Bagging and Boosting, the new model is formed by combining the same type of classification algorithms and voting based on the
weighted vote in classification problems and on the weighted average in the case of numerical predictions. The training data set is split into smaller
data sets of the same size. Data sets are resampled, and because data sets are derived from the same data set, they still have dependencies that
improve their prediction capabilities. Models are individually trained and constructed on each data set. Models do not make the same predictions,
which is the case with a smaller data set and using decision trees. This difference in predictions is caused by the instability of decision tree models,
where one wrong split can lead to an incorrect final estimation. Furthermore, classes with the largest number of votes are chosen. The main
difference between bagging and boosting is that, in bagging, all models have the same weight, and in boosting, different weights are assigned
to the model depending on its success in predictions. Boosting differs from bagging not only in the assigning of different weights to the models
depending on the performance but also in the principle of making the model. Whereas bagging builds models separately, boosting creates them in
an iterative manner so that all previously created models have an influence on the new ones (Witten et al., 2011).
Wagging is an ensemble of classifiers that is similar to bagging with the difference that Gaussian noise is added to the weights to vary them.
Random bootstrap samples are not used; instead, weights are repeatedly perturbed. The advantage of this method is that, by increasing the stan-
dard deviation of noise, it is possible to make a trade‐off between bias and variance. However, some weights could be decreased to zero, and
those instances would be removed (Ron Kohavi, 1998). This can be circumvented by using a continuous Poisson distribution to assign random
instance weights over continuous rather than discrete space (Webb, 2000).
Stacking is different from bagging and boosting as it allows the combination of different algorithms, which makes the performance analysis
more complicated. Making a decision using unweighted votes is acceptable as long as the classifiers have approximately similar performances,
but if two out of three algorithms have bad performance, it can severely influence the prediction results. Because it is not clear which model is
reliable via voting, a metaclassifier is introduced instead of voting. It receives inputs from the base classifiers, and the number of attributes of
metamodel training instances is equal to the number of base classifiers. The metamodel is then trained, but there is a danger of choosing classifiers
that overfit the data compared with realistic ones. To eliminate this problem, some instances of a training set are put aside for independent
evaluation. Models built from base classifiers are used for the classification of instances extracted from the training set and to create upper level
training data. This is a valid indicator of the base classifiers' performance because they were not trained on the extracted data set in the first place
(Witten et al., 2011).
Voting, unlike Stacking, does not have a metaclassifier. The decision is made based on the plurality, probabilistic, or weighted vote. The
plurality vote is the simplest because the class of the candidate with the highest number of votes is chosen (Džeroski & Ženko, 2004).
AdaBoost is another variation of boosting algorithm. Classifiers must be of the same type, and a model is created using the weights of the
instances. The C4.5 algorithm is the most appropriate as a base classifier because it does not require modifications to handle weights because
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
8 of 22 GICIĆ AND SUBASI

it already uses the notion of fractional instances to resolve the missing values. With different weights of instances, AdaBoost improves the
prediction capabilities because the model can focus on “important” instances that are crucial to making decisions. The error of AdaBoost is
calculated by dividing the weights of misclassified instances by the total weight of all instances (Witten et al., 2011).
MultiBoosting begins with the assignment of the same weights to all instances of the training data. When the new model is created,
prediction results are compared with the correct classes. Based on the classification of the model, instances are specified as increased or
decreased weights. The new set of instances is then classified again with another classifier, and some “hard” instances become even “harder”
and “light” even “lighter.” There is a danger of overfitting the data, and sometimes, the accuracy with a single classifier is better than a model
created by boosting (Webb, 2000).
Random Forest is an ensemble of independent decision trees. A model is created by voting among trees. At each node, a random number of
attributes is chosen for splitting. The number of attributes is generally much smaller than the number of attributes itself and up to log2d + 1
attributes are chosen, where d is the number of tuples (Han et al., 2011). In some cases, Random Forests outperform AdaBoost with regard to
error rates. That depends on single decision tree performances and their correlation (Breiman, 2001). Random Forest is a type of Bagging algo-
rithm with the difference being that Random Forest makes the random selection of attributes whereas Bagging makes splits based on a complete
set of attributes (Breiman, 2001).
Rotation Forest is an ensemble of independently trained decision trees. It is more accurate than Bagging, Random Forest, and AdaBoost
(Kuncheva & Rodríguez, 2007). Rotation Forest supports the Random Forest idea because it also independently builds decision trees, but unlike
Random Forest, Rotation Forest is built on the entire data set in rotated feature space (Kuncheva & Rodríguez, 2007). It takes k as the input
parameter, trains k features randomly organized in subsets, and then applies Principal Component Analysis on subsets with k‐axis rotations, cre-
ating new features (Rodriguez, Kuncheva, & Alonso, 2006). Each decision tree is built on the rotated training data. Decision trees are a good option
because they are sensitive to the rotation of the feature axes. Good accuracy is achieved because all principal components are kept and the entire
data set is used in base classifier training (Rodriguez et al., 2006).

4 | E X P E R I M E N T A L DE S I G N

The data set used in this study was obtained from a production database of the microcredit organization and thus represents a real‐world data.
Data sets of both retail and enterprises are gathered and analysed as a whole, and therefore, the proposed models include both types of indicators.
From the literature, it is evident that the microcredit data set, particularly for microenterprises and small enterprises in Asia, Africa, and South
America, was not publicly available and was gathered mostly using questionnaires.

4.1 | Data set creation and preprocessing

First, data mining techniques are applied to create a structured data set by identifying target data sets, replacing missing values, extracting relevant
attributes from various database sources, removing redundant and correlated data and transforming it with the creation of common units, and
generating new fields. Hence, the data set used in this study consists only of independent variables. Data transformation was applied to most
of the attributes that are converted and mapped to numerical values. Discrete ranges of variables were mapped and reduced to a few categories
to improve the data set quality during the decision‐making processes. Data dependencies cost more than code dependencies.

4.2 | Experimental set‐up

In this study, microcredit scoring models were built using a real‐world training–testing data set. The publically available, open‐source software
Weka was employed (Weka, 2017). Weka is a set of state‐of‐the‐art machine learning algorithms and preprocessing tools that allows easy and
efficient creation of the prediction models for new data sets (Eibe Frank, 2010).
In the literature, the k‐fold cross‐validation technique has often been applied for evaluation purposes. A split‐sample set‐up was well
established in the literature and was also used by Lessmann et al., (2015). By applying this approach, the data set was randomly partitioned into
a training set and a hold‐out test set for model building and evaluation, respectively (Lessmann et al., 2015). During the experiment, 10‐fold
cross‐validation was used with nine subsets from training and one for testing purposes.
Various classification techniques were applied to build the credit scoring models, including single classification algorithms and homogeneous
and heterogeneous ensembles. Classification performances were calculated and compared. Algorithms were used with default parameters. For
example, for Rotation Forest, attributes such as the maximum and minimum numbers of groups were set to 3, the number of iterations of Rotation
Forest was set to 10, and percentage of instances to be removed was set to 50%. Different algorithms were applied and compared as classifiers
for Rotation Forest. Projection filters, Principal components, and Random Subset used with Rotation Forest are also used with the default
parameters. In Section 5, it was marked which projection filter was used with Rotation Forest. Homogenous ensemble algorithms, such as the
Voting algorithm, also combine various sets of single classifiers. A heterogeneous ensemble classifier, Stacking, combines different base classifiers
and metaclassifiers.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 9 of 22

The classification performances of the models were additionally improved using SMOTE with the default values. The random seed value was
set to 1. The class value was set to 0 to detect the nonempty minority class. The percentage of SMOTE instances to create was set to 100%. The
number of nearest neighbours was set to 5. SMOTE was applied to the data set once and then applied a second time only for the best performing
algorithms.
The feature selection that was covered much in the literature showed good results when applied to the publicly available data (Ala'raj &
Abbod, 2016) in various fields, such as medicine (Aličković & Subasi, 2015), bioinformatics, and banking, but when applied to the analysed
microcredit data set, it did not have an impact on the improvement of the prediction performance, so feature selection is not used in
the experiments.

4.3 | Performance evaluation criteria

For accuracy measurement, there are variety of available indicators (Hand & Henley, 1997). In this paper, we measured the performance of the
classifiers by PCC, F‐Measure, MCC, ROC, RMSE, and Kappa statistics. The various performance indicators measure different trade‐offs in the
predictive performance, and some methods can perform well on one metric and not optimal on others (Caruana & Niculescu‐Mizil, 2006). Mostly,
the performance estimations are based on one metric and on measures of the same type (Lessmann et al., 2015). These measures can be divided
into three different categories. The first type is the measurements related to the discriminatory ability of the scorecard based on the area under
the curve and F‐Measure. The second type of the measures is metrics that estimate the accuracy of the scorecards. Brier Score is used for this
type of measure. The third group of the measures concerns the correctness of the predictions (Lessmann et al., 2015).
PCC is sometimes called accuracy. It is a very popular performance indicator in credit scoring. The equation for PCC or total accuracy is

PCC ¼ ðTP þ TNÞ=ðTP þ TN þ FP þ FNÞ: (7)

F‐Measure combines precision and recall and is defined by the formula

F − Measure ¼ 2xðPrecision × RecallÞ=ðPrecision þ RecallÞ (8)

or

F − Measure ¼ 2TP=ð2TP þ FP þ FNÞ: (9)

MCC is considered as one of the best way to describe the confusion matrix of true and false positives and negatives using a single number (Powers,
2011). It is generally regarded as a balanced measure that can be used even if the classes are of very different sizes. The MCC (Matthews, 1975) is
calculated as

MCC ¼ ðTP × TN − FP × FNÞ=√ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ: (10)

An ROC graph or ROC curve is a technique for visualizing, organizing, and selecting classifiers based on their performance. ROC graphs are two‐
dimensional graphs in which the true positive rate is plotted on the Y‐axis, and the false positive rate is plotted on the X‐axis (Fawcett, 2006). ROC
graphs have been used in signal detection to depict relative trade‐offs between benefits (true positives) and costs (false positives; Egan, 1975;
Swets, Dawes, & Monahan, 2000).
RMSE is the abbreviation for root mean square error. Let us assume that we have n samples of model error E calculated as (ei, i = 1,2..., n), and
let us also assume that the error sample set E is unbiased and follows a normal distribution. Using RMSE or the standard error helps to provide a
complete picture of the error distribution (Chai & Draxler, 2014). RMSE can be calculated on the data set using this formula

n
pffi 
RMSE ¼ ð1=nÞ*∑ ei 2 : (11)
i¼1

The Kappa statistic, also known as Cohen's kappa coefficient, is a measure of the interobserver variation of two or more independent observers
evaluating the same thing. It expresses the percentage of the agreement between the observers (Viera & Garrett, 2005). The equation for κ is

κ ¼ ðp0 − pe Þ=ð1 − pe Þ (12)

where p0 is the relative observed agreement among observers, and pe is the hypothetical probability of a chance agreement. In the literature, there
are various estimations of the goodness of the classifiers based on Kappa statistics. This is a standardized observation of the kappa: (0.01–0.20)
slight agreement, (0.21–0.40) fair agreement, (0.41–0.60) moderate agreement, (0.61–0.80) substantial agreement, and (0.81–0.99) almost perfect
agreement (Viera & Garrett, 2005).
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 of 22 GICIĆ AND SUBASI

5 | E X P E R I M E N T A L RE S U LT S

In this research, a microcredit real data set with an imbalanced structure was used. For data‐level optimization, SMOTE was employed to balance
the minority‐class sample. In the experiments, well‐known base classifiers were applied, namely, Bayesian networks, Naïve Bayes, SVM, Random
Forest, k‐NN, MLP, and C4.5, as well as the ensemble of homogeneous and heterogeneous classifiers, such as A1DE, A2DE, CTC, Rotation Forest,
Bagging, AdaBoost, MultiBoosting, Stacking, and Voting. The ensemble of ensembles is also combined and evaluated. The CTC decision tree algo-
rithm with the homogeneous and heterogeneous ensembles was used to improve the prediction performance of the “poor” minority class.

5.1 | Predictive performance of single classifiers without SMOTE

The experimental results consist of the performance measures for the Bayesian networks, Naïve Bayes, Random Forest, Rotation Forest, k‐NN,
SVM, MLP, C4.5, ensemble of classifiers, A1DE, A2DE, CTC, Bagging, MultiBoosting, AdaBoost, Voting, and Stacking algorithm in terms of
PCC, F‐Measure, MCC, ROC, Kappa statistics, and RMSE. Analyses were made with a focus on two aspects. The first aspect was based on the
average, overall performance for all classes with no evidence of how the “poor” class performs because it is a minority class, and the second aspect
related particularly to the “poor” class's performance. The average performance results of all classes using individual classifiers are shown in Table 1
. The boldface is used to indicate the best scoring algorithm per measure.
Among the individual classifiers, Random Forest had the best performance on all measures with a PCC of 0.962, F‐Measure of 0.96, MCC of
0.937, ROC of 0.997, RMSE of 0.1316, and Kappa statistic of 0.9234.
The results of the best performing single classifiers on minority‐class prediction are presented in Table 2. The same data set was used, and
10‐fold cross‐validation was applied. Classification results were low, and the best performing algorithm based on accuracy was Bayesian networks
with a PCC of 0.802. Prediction performance based on all other measures except PCC was the best for Random Forest: F‐Measure of 0.553, MCC
of 0.546 ROC of 0.98, RMSE of 0.1316, and Cohen's kappa of 0.924.

5.2 | Predictive performance of ensemble classifiers without SMOTE

Table 3 shows average prediction performance of models based on techniques that belong to the family of homogeneous ensembles. In the table,
base classifiers, metaclassifiers, and the list of classifiers are placed within the parentheses. AdaBoost and MultiBoosting with Random Forest as
the base classifier gave the best results compared with other homogeneous ensembles, better than AnDE, Bagging, and Rotation Forest.
When compared with the best average result achieved by Random Forest, the homogeneous ensemble did not outperform the best single
classifier's performance. The accuracy of 96.2% (PCC of 0.962) had not been achieved by the best performed homogeneous classifier regarding
to the overall prediction. For the other measures, it is evident that the results of the best performing single classifier were not surpassed. Only
the Kappa statistic was slightly better with MultiBoosting on C4.5 (0.9235 vs. 0.9234).
Because the aim of this study is to improve the performance of the minority‐class prediction without deterioration of the average prediction
results, classification results for the “poor” class using homogeneous classifiers are shown in Table 4.
The best performing machine learning technique for the “poor”‐class prediction accuracy was Rotation Forest with CTC as the base classifier
with an accuracy of 94.8%. It significantly outperformed the best single classifier result of 80.2% achieved by Bayesian networks. Rotation
Forest with A2DE achieved the best performance of F‐Measure, resulting in 0.585. The MCC of 0.599 with Bagging using CTC as the base
classifier was also better than the best performing single classifier. The ROC of 0.980 achieved by Rotation Forest with Random Forest as
the base classifier was the same as with Random Forest alone. The best RMSE result of 0.1318 was achieved by AdaBoost and Random Forest
as the base classifier, which was slightly worse than the best result among single classifiers. Cohen's kappa was slightly better than the best
performing homogeneous ensemble. The result was 0.924 versus 0.9235 achieved by AdaBoost and MultiBoosting on Random Forest as the

TABLE 1 Average performance results for the prediction using single classifiers

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


Random Forest 0.962 0.96 0.937 0.997 0.1316 0.9234
C4.5 0.957 0.956 0.931 0.972 0.1559 0.9149
Bayes Net 0.91 0.928 0.885 0.975 0.2358 0.8322
Naïve Bayes 0.895 0.912 0.854 0.954 0.2447 0.8046
SVM 0.948 0.941 0.908 0.959 0.2971 0.8949
k‐NN 0.942 0.942 0.906 0.953 0.1956 0.8864
MLP 0.952 0.951 0.922 0.988 0.1588 0.9045

Note. MCC: Matthews Correlation Coefficient; MLP: Multilayer perceptron; PCC: Percentage Correctly Classified; RMSE: root mean square error; ROC:
receiver operating characteristic; SVM: support vector machine.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 11 of 22

TABLE 2 Performance results for “poor”‐class prediction using single classifiers

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


Random Forest 0.483 0.553 0.546 0.98 0.1316 0.9234
C4.5 0.506 0.541 0.527 0.81 0.1559 0.9149
Bayes Net 0.802 0.417 0.445 0.947 0.2358 0.8322
Naïve Bayes 0.704 0.385 0.400 0.914 0.2447 0.8046
SVM 0.212 0.306 0.328 0.774 0.2971 0.8949
k‐NN 0.446 0.448 0.428 0.953 0.1956 0.8864
MLP 0.492 0.516 0.500 0.941 0.1588 0.9045

Note. MCC: Matthews Correlation Coefficient; MLP: Multilayer perceptron; PCC: Percentage Correctly Classified; RMSE: root mean square error; ROC:
receiver operating characteristic; SVM: support vector machine.

TABLE 3 Average classification performance results using homogeneous classifiers

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


A1DE 0.947 0.952 0.923 0.995 0.1591 0.8969
A2DE 0.953 0.956 0.932 0.996 0.1470 0.9089
AdaBoost (Random Forest)a 0.960 0.958 0.936 0.997 0.1320 0.9209
Bagging (CTC) 0.944 0.953 0.931 0.995 0.1669 0.8937
Bagging (Random Forest) 0.961 0.958 0.935 0.997 0.1318 0.9215
Bagging (Rotation Forest (C4.5))b 0.961 0.959 0.936 0.997 0.1324 0.9225
CTC 0.937 0.948 0.925 0.993 0.1797 0.8813
MultiBoosting (Bayes Net) 0.914 0.930 0.889 0.99 0.2316 0.8283
MultiBoosting (C4.5) 0.961 0.960 0.938 0.995 0.1571 0.9235
MultiBoosting (CTC) 0.946 0.954 0.932 0.995 0.1825 0.8973
MultiBoosting (Random Forest) 0.960 0.958 0.936 0.997 0.1320 0.9209
c
Rotation Forest (A2DE) 0.954 0.955 0.927 0.994 0.1561 0.9096
Rotation Forest (C4.5)b 0.960 0.959 0.936 0.996 0.1354 0.9211
Rotation Forest (CTC)b 0.928 0.942 0.912 0.993 0.1841 0.8647
Rotation Forest (Naïve Bayes)b 0.909 0.924 0.876 0.964 0.2218 0.8289
b
Rotation Forest (Random Forest) 0.960 0.957 0.934 0.997 0.1348 0.9202

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic.
a
Base classifier is in the brackets.
b
Principal Components as the Projection Filter.
c
Random Subset as the Projection Filter.

base classifier. Based on the presented results, we can conclude that homogeneous classifiers have better prediction performance for the minor-
ity class than any of the single classifiers.
The results of all classes using heterogeneous classifiers are presented in Table 5. The Voting was always built with the average of proba-
bilities because it performed the best. Among heterogeneous classifiers regarding to the accuracy, RMSE and Kappa, the best results were
achieved using the Stacking technique. Stacking was implemented with the ensemble of Random Forest, CTC, and A2DE algorithms, using
Rotation Forest as the metaclassifier. The average classification accuracy of 96.1% was slightly poorer than the best accuracy achieved by
homogeneous classifiers. The F‐Measure of 0.961 and MCC of 0.940 obtained by Vote was better than the results obtained by single and homo-
geneous classifiers. The best ROC result of the heterogeneous ensemble of 0.997 was the same as the best ROC result of the homogeneous and
heterogeneous ensemble in the overall performance. The RMSE of 0.132 was poorer than the best RMSE for the single and homogeneous
ensemble of classifiers. The best Cohen's kappa of 0.9216 also did not outperform the Kappa of a single classifier at 0.924 and homogeneous
classifiers at 0.9235.
Analyses of the heterogeneous ensemble classifiers performances of the “poor”‐class prediction are shown in Table 6. Among the heteroge-
neous ensembles, the best performing algorithm was the Voting that combined the Random Forest and CTC algorithms. The accuracy of 91.3%
was less than the best result achieved by the best homogeneous classifier (94.8%) but higher than the best single classifier's result of 80%. The
best result of 0.607 using heterogeneous ensembles of classifiers considering F‐Measure surpassed the single classifier's top result of 0.553. It
was also better than the best homogeneous ensemble's result of 0.585. The MCC result of 0.607 was better than the results of single classifiers
and homogeneous classifiers. The ROC of 0.982 was slightly higher compared with the 0.98 achieved by single and homogeneous ensembles. The
RMSE of 0.132 for the heterogeneous ensemble exceeded the best single classifier and the best of the homogeneous classifiers. The best result
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 of 22 GICIĆ AND SUBASI

TABLE 4 Classification performance for the “poor” class using homogeneous classifiers

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


A1DE 0.765 0.562 0.563 0.975 0.1591 0.8969
A2DE 0.696 0.577 0.577 0.978 0.1470 0.9089
AdaBoost (Random Forest)a 0.462 0.535 0.527 0.981 0.1320 0.9209
Bagging (CTC)a 0.925 0.570 0.599 0.978 0.1669 0.8937
Bagging (Random Forest)a 0.435 0.528 0.527 0.982 0.1318 0.9215
Bagging (Rotation Forest (C4.5)a)b 0.479 0.559 0.554 0.979 0.1324 0.9225
CTC 0.942 0.539 0.576 0.947 0.1797 0.8813
MultiBoosting (Bayes Net)a 0.825 0.435 0.465 0.946 0.2316 0.8383
a
MultiBoosting (C4.5) 0.519 0.568 0.556 0.967 0.1571 0.9235
MultiBoosting (CTC)a 0.898 0.573 0.596 0.975 0.1825 0.8973
MultiBoosting (Random Forest)a 0.462 0.535 0.527 0.981 0.1320 0.9209
Rotation Forest (A2DEa)c 0.627 0.585 0.570 0.975 0.1561 0.9096
Rotation Forest (C4.5a)b 0.481 0.545 0.535 0.976 0.1354 0.9211
Rotation Forest (CTCa)b 0.948 0.509 0.552 0.973 0.1841 0.8647
Rotation Forest (Naïve Bayesa)b 0.662 0.387 0.391 0.921 0.2218 0.8289
Rotation Forest (Random Foresta)b 0.431 0.528 0.529 0.980 0.1348 0.9202

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic.
a
In the brackets is the base classifier.
b
Principal Components as the Projection Filter.
c
Random Subset as the Projection Filter.

TABLE 5 Average classification performance using heterogeneous classifiers


Algorithm PCC F‐Measure MCC ROC RMSE Kappa
a
Stacking (A1DE, A2DE; C4.5 ) 0.955 0.955 0.927 0.986 0.1499 0.911
Stacking (CTC, A2DE; Random Foresta) 0.957 0.956 0.937 0.995 0.1386 0.9146
Stacking (Random Forest, CTC, A2DE; Rot (C4.5)a) 0.961 0.961 0.937 0.996 0.132 0.9216
Vote (Bagging (CTC)), Rot (C4.5), AB (Random Forest)) 0.958 0.961 0.941 0.997 0.1345 0.9176
Vote (Bagging (CTC)), Rot (C4.5), Random Forest) 0.956 0.960 0.938 0.997 0.1344 0.9146
Vote (A1DE) 0.947 0.952 0.923 0.995 0.1591 0.8969
Vote (A1DE, A2DE) 0.950 0.954 0.927 0.996 0.1519 0.9025
Vote (A1DE, A2DE, BayesNet, C4.5, Random Forest) 0.963 0.959 0.987 0.978 0.1459 0.9117
Vote (A1DE, A2DE, BayesNet, CTC, Random Forest) 0.958 0.96 0.938 0.997 0.1531 0.9003
Vote (A1DE, A2DE, C4.5) 0.958 0.959 0.936 0.996 0.142 0.9173
Vote (A1DE, BayesNet) 0.92 0.935 0.897 0.994 0.1867 0.8498
Vote (A1DE, BayesNet, CTC) 0.937 0.948 0.921 0.995 0.1754 0.8808
Vote (A1DE, Random Forest) 0.957 0.960 0.937 0.997 0.1388 0.9163
Vote (A2DE, BayesNet) 0.922 0.937 0.901 0.995 0.1779 0.8538
Vote (A2DE, BayesNet, CTC) 0.940 0.95 0.926 0.995 0.17 0.8864
Vote (A2DE, CTC, Random Forest, Rota (C4.5)) 0.957 0.961 0.940 0.997 0.1357 0.9169
Vote (Random Forest, CTC) 0.946 0.954 0.933 0.997 0.1437 0.8962
Vote (Random Forest, CTC, A1DE, A2DE) 0.955 0.959 0.938 0.997 0.1421 0.9124

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic; AB: AdaBoost; Rot: Rotation Forest.
a
Metaclassifier.

for Kappa statistics for the heterogeneous ensembles was 0.9216. That was slightly less than the best homogeneous ensemble at 0.9235 and the
best single classifier at 0.924.
Analysing single classifiers, it could be noticed that the prediction performance results considering different measures were correlated.
However, homogenous and heterogeneous algorithms behaved differently. Those that had good performance for one measure usually did not
have the best performances in others, so it was not easy to decide which method was the best. Heterogeneous ensembles had better performance
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 13 of 22

TABLE 6 Average classification performance results for the “poor” class using heterogeneous classifiers

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


a
Stacking (A1DE, A2DE; C4.5 ) 0.537 0.558 0.543 0.930 0.1499 0.911
Stacking (CTC, A2DE; Random Foresta) 0.498 0.537 0.523 0.969 0.1386 0.9146
Stacking (Random Forest, CTC, A2DE; Rot (C4.5)a) 0.498 0.549 0.537 0.977 0.132 0.9216
Vote (Bagging (CTC)),Rot (C4.5), AB (Random Forest)) 0.763 0.607 0.603 0.983 0.1345 0.9176
Vote (Bagging (CTC)), Rot (C4.5), Random Forest) 0.754 0.598 0.594 0.983 0.1344 0.9146
Vote (A1DE) 0.765 0.562 0.563 0.975 0.1591 0.8969
Vote (A2DE, CTC, Random Forest, RotFor (C4.5)) 0.744 0.597 0.607 0.982 0.1357 0.9169
Vote (Random Forest, CTC) 0.913 0.570 0.596 0.982 0.1437 0.8962
Vote (Random Forest, CTC, A1DE, A2DE) 0.788 0.595 0.596 0.981 0.1421 0.9124

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic; Rot: Rotation Forest; AB: AdaBoost.
a
Metaclassifier.

for F‐Measure, MCC, and ROC and poorer PCC, RMSE, and Kappa statistics. Nevertheless, we can conclude that, generally, the homogeneous
ensemble classifiers performed better than the heterogeneous ensemble classifiers.
The four best performing algorithms, considering the results of the “poor” category prediction performances regarding to the all measures, are
shown in Table 7. To make a final decision and choose the best algorithm, we also had to make some trade‐offs and have chosen the algorithms
with the best accuracy and lower performances for other measures. Random Forest, Bagging, and Stacking had the best results for overall
performance, but other results were not significantly better. The CTC algorithm undoubtedly improved the prediction performance of the “poor”
minority class.
To make the final decision, the overall performances of the same four algorithms were analysed again. Those algorithms were not the top
three in overall performance, but this perspective contributed to the final decision (Table 8). The dilemma still remained. Voting had the poorest
accuracy of the minority class of the top four algorithms but best other indicators, and it achieved the best results in overall performance for all
measures. However, as the best algorithm before applying data oversampling, we chose the CTC algorithm for the analysed data set taking into
account all cons and pros.

5.3 | Solving the minority class problem with SMOTE

All of the techniques showed satisfying “overall” results because the misclassification of the minority class instances cannot affect it much. By
“overall,” we mean the average results of all categories. Among the applied algorithms, the CTC decision tree algorithm resulted in the best pre-
diction accuracy in the “poor” class prediction, but considering other measures, it was not the best. To mitigate the imbalance in the microcredit
data set, we applied the SMOTE technique on the best performing algorithms and on the individual classifiers. The minority class data set was
oversampled at 100% of its original size.
Furthermore, the SMOTE technique was applied again by oversampling the same data set to the amount of 200% to additionally improve the
prediction of the “poor” class and the “overall” performance. On that data set, we have applied the same algorithms: single classifiers and the best
performing homogeneous and heterogeneous classifiers. Applying SMOTE with the rate of 200% on the data set resulted in a degradation of
Naïve Bayes and Bayes Net performances. Naïve Bayes performed better without SMOTE applied. However, the SMOTE technique boosted per-
formances when used with homogeneous and heterogeneous ensembles. The average results for the performance of the algorithms on which
SMOTE was applied are shown in Table 9. The boldface is again used to indicate the best scoring algorithm per measure. The best technique
in overall performance was Random Forest after applying a SMOTE rate of 200% to the data set.

TABLE 7 Best performed algorithms regarding the “poor” minority class


Algorithm PCC F‐Measure MCC ROC RMSE Kappa

Vote (Random Forest, CTC) 0.913 0.570 0.596 0.982 0.1437 0.8962
CTC 0.942 0.539 0.576 0.947 0.1797 0.8813
Rotation Forest (CTC) 0.948 0.509 0.552 0.973 0.1841 0.8647
Bagging (CTC) 0.925 0.570 0.599 0.978 0.1669 0.8937

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 of 22 GICIĆ AND SUBASI

TABLE 8 Best performing algorithms

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


Vote (Random Forest, CTC) 0.946 0.954 0.933 0.997 0.1437 0.8962
CTC 0.937 0.948 0.925 0.993 0.1797 0.8813
Rotation Forest (CTC) 0.928 0.942 0.912 0.993 0.1841 0.8647
Bagging (CTC) 0.944 0.953 0.931 0.995 0.1669 0.8937

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic.

TABLE 9 Overall performance results with SMOTE applied


Algorithm PCC F‐Measure MCC ROC RMSE Kappa

Bagging (CTC) + SMOTE 100% 0.952 0.956 0.935 0.995 0.1556 0.9126
Bagging (CTC) + SMOTE 200% 0.960 0.960 0.943 0.996 0.1363 0.9313
Bayes Net + SMOTE 100% 0.921 0.927 0.889 0.986 0.2114 0.8564
Bayes Net + SMOTE 200% 0.934 0.933 0.896 0.992 0.1855 0.8846
CTC + SMOTE 100% 0.949 0.954 0.934 0.992 0.1645 0.9081
CTC + SMOTE 200% 0.957 0.958 0.939 0.990 0.1544 0.9263
Naïve Bayes + SMOTE 100% 0.858 0.868 0.781 0.936 0.2796 0.7464
Naïve Bayes + SMOTE 200% 0.851 0.858 0.819 0.934 0.2894 0.7532
Random Forest + SMOTE 100% 0.964 0.964 0.944 0.997 0.1309 0.9335
Random Forest + SMOTE 200% 0.966 0.966 0.948 0.998 0.1261 0.9414
Rotation Forest (CTC) + SMOTE 100% 0.943 0.948 0.923 0.994 0.1678 0.8958
Rotation Forest (CTC) + SMOTE 200% 0.954 0.956 0.935 0.996 0.1467 0.9225
Vote (Random Forest, CTC) + SMOTE 100% 0.953 0.957 0.938 0.997 0.1386 0.9146
Vote (Random Forest, CTC) + SMOTE 200% 0.957 0.958 0.940 0.998 0.1326 0.9276

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic; SMOTE: synthetic minority oversampling technique.

The results of the “poor” category prediction with SMOTE applied to the same data set using the same algorithms are presented in Table 10.
The best accuracy result in the “poor” class prediction of 96.1% was achieved by Rotation Forest with CTC as the base classifier and SMOTE with
the rate of 200%.
Table 11 shows the best two algorithms regarding the “overall” and accuracy of the “poor” category prediction. When SMOTE with 200% was
applied, the CTC performance regarding accuracy deteriorated, but all the other measures were improved. The Random Forest algorithm with
SMOTE of 200% achieved the best results for all the measures except for the accuracy of 90.1%, F‐Measure of 0.903, MCC of 0.888, ROC of
0.996, RMSE of 0.1263, and Kappa of 0.941.

TABLE 10 Performance of the “poor” class prediction using SMOTE


Algorithm PCC F‐Measure MCC ROC RMSE Kappa*

Bagging (CTC) + SMOTE 100% 0.935 0.753 0.748 0.983 0.1556 0.9126
Bagging (CTC) + SMOTE 200% 0.940 0.877 0.860 0.994 0.1363 0.9313
Bayes Net + SMOTE 100% 0.847 0.638 0.627 0.974 0.2114 0.8564
Bayes Net + SMOTE 200% 0.798 0.838 0.816 0.988 0.1855 0.8846
CTC + SMOTE 100% 0.954 0.742 0.740 0.966 0.1645 0.9081
CTC + SMOTE 200% 0.953 0.869 0.851 0.972 0.1544 0.9263
Naïve Bayes + SMOTE 100% 0.678 0.501 0.473 0.936 0.2796 0.7464
Naïve Bayes + SMOTE 200% 0.770 0.656 0.604 0.925 0.2894 0.7532
Random Forest + SMOTE 100% 0.792 0.807 0.793 0.991 0.1309 0.9335
Random Forest + SMOTE 200% 0.901 0.903 0.888 0.995 0.1261 0.9414
Rotation Forest (CTC) + SMOTE 100% 0.951 0.721 0.720 0.984 0.1678 0.8958
Rotation Forest (CTC) + SMOTE 200% 0.961 0.865 0.848 0.993 0.1467 0.9225
Vote (Random Forest, CTC) + SMOTE 100% 0.941 0.754 0.749 0.991 0.1386 0.9146
Vote (Random Forest, CTC) + SMOTE 200% 0.950 0.870 0.853 0.996 0.1326 0.9276

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic; SMOTE: synthetic minority oversampling technique.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 15 of 22

TABLE 11 The best algorithms regarding the “overall” and the “poor” category predictive performance

Algorithm PCC F‐Measure MCC ROC RMSE Kappa


Random Forest + SMOTE 200% overall 0.966 0.966 0.948 0.998 0.1261 0.9414
Random Forest + SMOTE 200% poor category 0.901 0.903 0.888 0.995 0.1261 0.9414
Rot Forest (CTC) + SMOTE 200% overall 0.954 0.956 0.935 0.996 0.1467 0.9225
Rot Forest (CTC) + SMOTE 200% poor category 0.961 0.865 0.848 0.993 0.1467 0.9225

Note. CTC: Consolidated Trees Construction; MCC: Matthews Correlation Coefficient; PCC: Percentage Correctly Classified; RMSE: root mean square
error; ROC: receiver operating characteristic; SMOTE: synthetic minority oversampling technique.

5.4 | Discussion

In this study, models built using the state‐of‐the‐art algorithms were benchmarked with the classical techniques. Empirical results demonstrated
that the homogenous and heterogeneous ensemble classifiers achieved better prediction performance than the models proposed in the previous
researches in the microfinance credit scoring field. However, before introducing CTC algorithm with the ensemble classifiers, we had encountered
limitations in the ability to distinguish minority class in the highly imbalanced data set. Building the ensemble of the ensemble, including the CTC
algorithm as the base classifier within those models, high prediction accuracy for the minority class was improved but other measures such as
MCC and F‐Measure indicated that our model still had the prediction limitations before applying SMOTE.
The crucial point in resolving this issue was applying the SMOTE technique in the preprocessing of imbalanced microcredit data set. SMOTE
combined with the ensemble of algorithms using CTC empowered the minority class prediction significantly and retained the superior results in
the overall prediction taking into account all of the measures considered in this research.

5.4.1 | Theoretical implications

The aim of this paper is to find a credit scoring model using microfinance data set and novel algorithms to improve the credit risk assessment in the
microfinance context. Moreover, this paper differs from the previous studies in the high quality of data and preprocessing using SMOTE, that is,
one of the crucial steps for the future improvements in this field. Due to the risk of overfitting the model by using SMOTE, 10‐fold cross‐validation
was used in this research in order to make training and testing data set balanced. Furthermore, we took particular attention for test data not to be
repeated in the training set so the system decisions can be performed and measured on unseen data instances. In addition, the ensemble classifiers
are very effective in preventing overfitting because ensembling takes the model averages. Although the prediction performance can yield different
results using different indicators (Lessmann et al., 2015), the empirical evidence in this study demonstrates superior results considering various
metrics for the assessment of the microcredit scoring model effectiveness in the overall prediction.
Before applying the SMOTE technique, the CTC algorithm outperformed all other algorithms regarding the accuracy for the “poor” category
prediction. Other measures have also reached satisfied level, so CTC can be considered the best technique for the imbalanced microcredit three‐
class data set with an accuracy of 94.8%. This is in line with earlier findings (Ibarguren et al., 2015), where CTC combined with SMOTE was found
to be one of the best techniques designed to tackle class imbalance. However, the results did not vary much from the best performing algorithm's
overall performance (Voting, based on the Random Forest and CTC decision tree algorithm with PCC of 96.1%). Complex models based on homo-
geneous and heterogeneous ensembles did not perform as expected before introducing CTC as the base classifier. The accuracy of the Voting
algorithm was lower than the accuracy of the one best algorithm in this ensemble. In this study, CTC plays a crucial role because of the focus
on improvement of imbalanced class prediction performance.
PCC is a very popular measure in credit scoring, but because we are dealing with the imbalanced data set, this measure cannot be sufficient
and may be misleading in our experiment. Therefore, we have introduced additional metrics to gain a complete view on models performance and
to validate the applicability. Total of six different measures where used: PCC (sometimes called accuracy), F‐Measure, MCC, ROC, RMSE,
and Kappa statistic. Credit scoring model evaluated this way had proved the credibility of our claims that using SMOTE in preprocessing data
and models built with the ensemble of algorithms with CTC is applicable in credit scoring for microfinance. MCC is a relevant measure for
the data set analysed because it is least influenced by imbalances, and it is useful for the interpretation of the confusion matrix for the imbalanced
data set. It is a correlation coefficient between the observed and predicted classifications with ranges from −1 to 1. However, MCC was not often
used by now in credit scoring systems evaluations.
Results shown in Table 4 demonstrate a significant improvement of minority class prediction based on PCC measure of 94.8% after introduc-
ing CTC technique as the base algorithm in the homogenous ensemble classifiers. As a comparison, the accuracy of minority class prediction was
21.2% for SVM algorithm and 44.6% for k‐NN. Nevertheless, MCC measure of 55.2% and F‐Measure of 50.9% using CTC still indicated model
limitations, and the heterogeneous ensemble did not perform better than homogenous for the “poor” class prediction as well. We presented
how to overcome this limitation by introducing SMOTE.
Table 11 summarizes the results for best performed algorithms after applying SMOTE where all of the analysed measures of different nature
showed exceptional results. Rotation Forest with CTC as the base classifier and SMOTE technique applied twice on the data set in the
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 of 22 GICIĆ AND SUBASI

preprocessing had the highest accuracy of 96.1% in the minority class prediction with F‐Measure of 86.5%, MCC of 84.8%, ROC of 99.6%, and
RMSE of 0.1467 and Kappa of 92.25%.
F‐Measure does not consider true negatives, it performs well only on balanced data, and it gives the balance between precision and sensitiv-
ity. RMSE is a performance measure mostly used for regression problems. The Kappa statistic uses the probability of random agreement, and it is
good for estimating algorithm performance from a different angle. Kappa is seldom comparable across studies, but it is a very reliable metric if
classes are imbalanced (Feinstein & Cicchetti, 1990; Thompson & Walter, 1988).
Another argument for choosing CTC is that the model built based on CTC is less complicated and faster generated than the other based on
homogeneous and heterogeneous ensembles. Other algorithms,i.e., homogeneous and heterogeneous ensembles such as Rotation Forest,
Voting and Bagging with CTC as the base algorithms, also show significant improvement in the “poor” class prediction compared to the
single classifiers.
The application of SMOTE once (100%) and twice (200%) on the minority data set improved the other measures as well. The results are
unexpected because the single classifier Random Forest remarkably improved the overall prediction regarding all its measures. Thus, Random
Forest cannot be underestimated even when used in the microcredit scoring that comprises retail and business attributes. In “poor” class
prediction, all measures except PCC show that it was the best algorithm, even better than any other homogeneous or heterogeneous
ensemble.
Similar behaviour and results were presented in studies related to retail credit scoring in banking. More recent ensembles, such as stochastic
gradient boosting or Rotation Forest, did not exhibit better performance compared with Bagging or Random Forest. This indicates that further
progress, beyond Bagging or Random Forest, is not easy to achieve (Lessmann et al., 2015). However, Rotation Forest creates more accurate
classifiers than AdaBoost regarding kappa‐error diagrams (Kuncheva & Rodríguez, 2007).
However, the highest accuracy (PCC) of 96.1% for the prediction of the minority class was achieved using Rotation Forest with CTC and the
SMOTE technique of 200%, following by the 95.3% achieved by the CTC algorithm with SMOTE at a rate of 100%. The Voting algorithm in
combination with CTC and Random Forest combined with SMOTE also gave good results for all measures, regarding which we agree with
(Ibarguren et al., 2015).

5.4.2 | Managerial implications

The important research question of this paper is the applicability of state‐of‐the‐art algorithms to the microcredit scoring models. Credit scoring in
microfinance is generally underestimated compared with the credit scoring in banking. Previous researches and empirical evidence in this field did
not achieve satisfactory results in the discriminatory power of the proposed models mostly due to the lack of the good benchmarking data sets
and due to the constraints in the system modelling. However, the financial instability of clients and a higher risk of repayment failures for
microcredits means that the need for such models is even more important than in banking. Such models can significantly improve the quality
of the managerial decisions although the higher risk of default events is partially compensated by the higher interest rates, which are usually three
times higher than in banking. Microcredit is specific in the way that it is intended for the poor population usually refused by banks and without
credit history and stable income. Customers are prone to the change of business type, and the failure itself could not necessarily be linked to
the customers' characteristics, but it also depends on the local and global economic circumstances. Thus, microcredit has a complex and dynamic
nature, and this additionally complicates the risk management and the prediction of repayment. Loans are usually given for the small start‐ups and
could be important for the economic development and poverty reduction in the region. Therefore, microcredit scoring models would be closer to
the models for small enterprises than to the retail models. We have proposed a hybrid model because the data set that we created was a mixture
of the data instances and parameters of both retail and small enterprises. All of these facts contribute to the complexity of the credit scoring sys-
tems for the microfinance and complicates risk management and prediction of repayment.
It is a managerial decision whether to invest in IT system improvements, such business intelligence to access easily the benchmark data for
reliable credit scoring systems, similar to the ones used in the banks. Although the microfinance sector had recognized the necessity for such
credit scoring models in the risk management, it is mostly used only as a refinement tool, and it did not replace the traditional risk management
processes by now. In spite of additional aggravating circumstances based on the fact that we build a three class instead of the common two‐class
model with highly unbalanced data set, empirical results confirm the applicability of credit scoring models that we proposed in this paper for the
microcredit loans.
The dilemma is only which model among more than 50 trained and evaluated models to select as the most adequate because the estima-
tion could be based either on the overall model prediction capabilities or on the performance in the “poor” class prediction. Despite the fact
that the “poor” class prediction was the most demanding challenge in this research, the “poor” class itself is not as relevant for the managers
and credit officers as the “bad” class that could result in financial losses. “Poor” class can even float between “good” and “bad” category during
the whole repayment period not affecting significantly the management of the microcredit institution except in the terms of the obligation to
increase the reserves for that loan based on the financial regulations. However, the introduction of credit scoring models proposed in this study
based on any of the criteria mentioned above could increase managerial effectiveness in terms of more effective measurement of the riskiness
of the customers.
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 17 of 22

6 | C O N C L U S I O N S A N D RE C O M M E N D A T I O N S

In this study, a total of 16 distinct algorithms with 57 different decision‐support models had been proposed for the prediction of microcredit
behaviour. To the best of the authors' knowledge, such a detailed scientific study has not been proposed previously in the microcredit context.
There is generally an evident lack of scientific studies on SME in standard credit scoring. Microenterprises and small enterprises in microcredit
scoring are not covered at all. A very limited number of machine learning algorithms related to retail have been applied to microcredit scoring
and have mostly been limited to regression and genetic algorithms. State‐of‐the‐art machine learning algorithms with a large number of variants
have been applied to a real‐life three‐class data set. Predictive performances of single classifiers, homogeneous and heterogeneous ensembles,
and ensembles of ensembles have been compared and measured by PCC, F‐Measure, MCC, ROC, RMSE, and Kappa statistics because the average
accuracy can be a misleading metric with the imbalanced data set. Strained with SMOTE, models proposed had outperformed all of the existing
models used and compared in previous publications related to microcredit scoring. The introduction of such models for making decisions in credit
risk evaluation and assessment is recommended because it ameliorates the process, reduces errors, and can bring financial benefits due to the
prediction of default events, repayment failures, and cutting of time spent for decision making and error reduction.
This work also closes the gap regarding the lack of unique and comprehensive microcredit data sets. Success in good predictive performance is
even more important if we take into account that microenterprises and small enterprises are financially weak and prone to financial fluctuations
and bankruptcies. In this study, even prediction results of the unstable interviewing and minority class was very successful.
One limitation of this study could be the fact that enterprise instances are the minority in the microcredit data set. Because this can pose a
generalizability threat, the experiment should be repeated with a larger number of enterprise instances to validate the predictive performance.
Due to the specific nature of microcredit loans and clients, retail and microenterprises can be treated identically because most of the
microenterprises have one or no employees. Nevertheless, the data set used in this study includes the relevant attributes for retail and enterprise
entities, which is another reason for the very good results in the prediction of credit behaviour in microfinance, because it is not confined to one
type of indicators. The second limitation that has to be considered is the fact that the real data sets usually vary in practice, and it is evident that
it is impossible to establish and define a universal learner. Accurate models have to be adjusted according to the structure of the domain (Eibe
Frank, 2010). Another issue would be the complexity of the credit scoring systems for the microfinance. It requires a lot of effort in choosing
the most appropriate combination of machine learning algorithms for a specific data set, and it also needs the expertise in the selection of the
relevant variables for the data sets, applying the adequate preprocessing techniques and data cleaning as the most tedious task but on the other
hand, the most important one for the model training.
Training of some models could be time‐consuming, but CTC algorithm that our best performing models are built on is not complicated, and it
is faster than most of the other models based on homogeneous and heterogeneous ensembles that had similar performance. Additionally, some of
the machine learning algorithms are very difficult to interpret such as Random Forest, Neural Networks, or Kernel methods. There is also a general
problem in the credit scoring data set regarding to the refused customers that are not considered in the analysis and in the training due to the fact
that they were not given a loan, and therefore, the information of their behaviour is not available. The system learns from the examples of the
behaviour of customers that are currently paying off or had already paid off credits in the particular institution.
Credit scoring models in this research successfully comprehended three groups of essentially different types of factors: financial factors, non-
financial factors, and behavioural factors. Suggestion for the future work would be to consider the additional group of indicators in order to gen-
eralize the use of such models. It would be necessary to introduce balance sheets for small enterprises and macroeconomic indicators. Building
credit scoring models that will be able to learn independently as exposed to the new data should also be considered in the future. Model built
in this way using machine learning techniques would be automated and more accurate. Another important direction would be a creation of models
based on the new accounting standard “IFRS 9 Financial Instruments” that was adopted in January 2018 in the banks, but it is still not mandatory
for the microfinance sector. Probability of default, loss given default, exposure at default, and expect loss models used for credit risk could be
improved by applying machine learning.

ACKNOWLEDGEMEN T
This research has been supported by Info Studio d.o.o, Sarajevo, Bosnia and Herzegovina.

CONF LICT OF INT E RE ST


The authors declare that they have no conflicts of interest.

ORCID

Abdulhamit Subasi https://orcid.org/0000-0001-7630-4084

RE FE R ENC E S
Ala'raj, M., & Abbod, M. F. (2016). A new hybrid ensemble credit scoring model based on classifiers consensus system approach. Expert Systems with Appli-
cations, 64, 36–55. https://doi.org/10.1016/j.eswa.2016.07.017
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
18 of 22 GICIĆ AND SUBASI

Aličković, E., & Subasi, A. (2015). Breast cancer diagnosis using GA feature selection and Rotation Forest. Neural Computing and Applications, 1–11.
Baesens, B. V. (2003). Benchmarking state‐of‐the‐art classification algorithms for credit scoring. Journal of the Operational Research Society, 54(6), 627–635.
https://doi.org/10.1057/palgrave.jors.2601545
Barandela, R., Sánchez, J. S., Garcıa, V., & Rangel, E. (2003). Strategies for learning in class imbalance problems. Pattern Recognition, 36(3), 849–851. https://
doi.org/10.1016/S0031‐3203(02)00257‐1
Batista, G. E., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd
Explorations Newsletter, 6(1), 20–29. https://doi.org/10.1145/1007730.1007735
Baum, E. B. (1998). On the capabilities of multilayer perceptrons. Journal of Complexity, 4, 193–215.
Bee Wah Yap, S. H. (2011). Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Systems with Applications, 38,
13274–13283. https://doi.org/10.1016/j.eswa.2011.04.147
Bishop, C. M. (1995). Neural networks for pattern recognition. Birmingham, UK: CLARENDON PRESS, Oxford.
Blagus, R. (2013). SMOTE for high‐dimensional class‐imbalanced data. BMC Bioinformatics, 14(1), 106. Retrieved from https://doi.org/10.1186/1471‐2105‐
14‐106
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324
Caruana, R., & Niculescu‐Mizil, A. (2006). An empirical comparison of supervised learning algorithms. In W. W. Cohen, & A. Moore (Eds.), Proc. of the 23rd
Intern. Conf. on Machine Learning (pp. 161–168). ACM: Pittsburgh, Pennsylvania, USA.
Chai, T., & Draxler, R. R. (2014). Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature.
Geoscientific Model Development, 7, 1247–1250. https://doi.org/10.5194/gmd‐7‐1247‐2014
Chawla, N. V., Bowyer, K. W., & Kegelmeyer, L. O. (2002). SMOTE: Synthetic minority over‐sampling technique. The Journal of Artificial Intelligence Research,
16, 321–357. https://doi.org/10.1613/jair.953
Chawla, N. V., Cieslak, D. A., Hall, L. O., & Joshi, A. (2008). Automatically countering imbalance and its empirical relationship to cost. Data Mining and
Knowledge Discovery, 17(2), 225–252. https://doi.org/10.1007/s10618‐008‐0087‐0
Chuang, C. L., & Huang, S. T. (2011). A hybrid neural network approach for credit scoring. Expert Systems, 28(2), 185–196. https://doi.org/10.1111/j.1468‐
0394.2010.00565.x
Ciampi, F. (2015). Corporate governance characteristics and default prediction modeling for small enterprises. An empirical analysis of Italian firms. Journal
of Business Research, 68, 1012–1025. https://doi.org/10.1016/j.jbusres.2014.10.003
Dahiya, S., Handa, S. S., & Singh, N. P. (2017). A feature selection enabled hybrid‐bagging algorithm for credit risk evaluation. Expert Systems, 34. https://doi.
org/10.1111/exsy.12217
Díez‐Pastor, J. F., Rodríguez, J. J., García‐Osorio, C. I., & Kuncheva, L. I. (2015). Diversity techniques improve the performance of the best imbalance learn-
ing ensembles. Information Sciences, 325, 98–117. https://doi.org/10.1016/j.ins.2015.07.025
Dinh, T. A. (2007). A credit scoring model for Vietnam's retail banking market. International Review of Review of Finacial Analysis, 16, 471–495. https://doi.
org/10.1016/j.irfa.2007.06.001
Drummond, C., & Holte, R. C. (2003). C4. 5, class imbalance, and cost sensitivity: Why under‐sampling beats over‐sampling. Presented at the Workshop on
learning from imbalanced datasets II. 11. Citeseer.
Duan, L., & Binbasioglu, M. (2017). An ensemble framework for community detection. Journal of Industrial Information Integration, 5(1–5), 1–5. https://doi.
org/10.1016/j.jii.2017.01.001
Duan, L., & Da Xu, L. (2012). Business intelligence for enterprise systems: A survey. IEEE Transactions on Industrial Informatics, 8(3), 679–687. https://doi.
org/10.1109/TII.2012.2188804
Džeroski, S., & Ženko, B. (2004). Is combining classifiers with stacking better than selecting the best one? Machine Learning, 54, 255–273. https://doi.org/
10.1023/B:MACH.0000015881.36452.6e
Edelman, D. A. (2002). Credit scoring and its applications. Philadelphia, Pa: Society for Industrial Mathematics.
Egan, J. (1975). Signal detection theory and ROC analysis, series in cognition and perception. New York: Academic Press.
Eibe Frank, M. H. (2010). Weka: A machine learning workbench for data mining. In Data mining and knowledge discovery handbook (pp. 1269–1277). Berlin:
Springer.
Estabrooks, A., Jo, T., & Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational Intelligence, 20(1), 18–36.
https://doi.org/10.1111/j.0824‐7935.2004.t01‐1‐00228.x
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters ‐ Special Issue: ROC Analysis in Pattern Recognition Archive, 27(8), 861–874.
https://doi.org/10.1016/j.patrec.2005.10.010
Feinstein, A. R., & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43(6), 543–549.
https://doi.org/10.1016/0895‐4356(90)90158‐L
Friedman, N., Linial, M., Nachman, I., & Pe'er, D. (2000). Using Bayesian networks to analyze expression data. Journal of Computational Biology, 7(3/4),
601–620.
Gao, M., Hong, X., Chen, S., & Harris, C. J. (2011). A combined SMOTE and PSO based RBF classifier for two‐class imbalanced problems. Eurocomputing,
74(11), 3456–3466. https://doi.org/10.1016/j.neucom.2011.06.010
Gardner, M. W., & Dorling, S. R. (1998). Artificial neural networks (the multilayer perceptron)—A review of applications in the atmospheric sciences.
Atmospheric Environment, 32(14–15), 2627–2636. https://doi.org/10.1016/S1352‐2310(97)00447‐0
Goovaerts, A. S. (1989). A credit scoring model for personal loans. Insurance: Mathematics & Economics, 8(1), 31–34.
Gordini, N. (2014). A genetic algorithm approach for SMEs bankruptcy prediction: Empirical evidence from Italy. Expert Systems with Applications, 41,
6433–6445. https://doi.org/10.1016/j.eswa.2014.04.026
Gorzałczany, M. B., & Rudziński, F. (2016). A multi‐objective genetic optimization for fast, fuzzy rule‐based credit classification with balanced accuracy and
interpretability. Applied Soft Computing, 40, 206–220. https://doi.org/10.1016/j.asoc.2015.11.037
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 19 of 22

Ha, T. M., & Bunke, H. (1997). Off‐line, Handwritten Numeral Recognition by Perturbation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
19(5), 535–539. https://doi.org/10.1109/34.589216
Han, J., Pei, J., & Kamber, M. (2011). Data mining: Concepts and techniques. Morgan Kaufmann, Elsevier, San Francisco, CA 94111, USA.
Hand, D. J., & Henley, W. E. (1997). Statistical Classification Methods in Consumer Credit Scoring: a Review. Journal of the Royal Statistical Society: Series A
(Statistics in Society), 160(3), 523–541.
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. https://doi.
org/10.1016/0893‐6080(89)90020‐8
Hsieh, N.‐C., & Hung, L.‐P. (2010). A data driven ensemble classifier for credit scoring analysis. Expert Systems with Applications, 37(1), 534–545. https://doi.
org/10.1016/j.eswa.2009.05.059
Huang, C.‐L., Chen, M.‐C., & Wang, C.‐J. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with
Applications, 33(4), 847–856. https://doi.org/10.1016/j.eswa.2006.07.007
Huang, Q., & Zhang, X. (2017). An improved ensemble learning method with SMOTE for protein interaction hot spots prediction. Bioinformatics and
Biomedicine (BIBM), 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).
Huang, Z., Chen, H., Hsu, C. J., Chen, W. H., & Wu, S. (2004). Credit rating analysis with support vector machines and neural networks: A market
comparative study. Decision Support Systems, 37(4), 543–558. https://doi.org/10.1016/S0167‐9236(03)00086‐1
Hung, C., & Chen, J.‐H. (2009). A selective ensemble based on expected probabilities for bankruptcy prediction. Expert Systems with Applications, 36(3),
5297–5303. https://doi.org/10.1016/j.eswa.2008.06.068
Ibarguren, I., Pérez, J. M., Muguerza, J., Gurrutxaga, I., & Arbelaitz, O. (2015). Coverage‐based resampling: Building robust consolidated decision trees.
Knowledge‐Based Systems, 79, May 2015, 51–67. https://doi.org/10.1016/j.knosys.2014.12.023
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449. https://doi.org/10.3233/
IDA‐2002‐6504
Kittler, J., Hatef, M., Duin, R. P., & Matas, J. (1998). On combining classifiers. IEEE transactions on pattern analysis and machine intelligence, 20(3), 226–239.
Koutanaei, F. N., Sajedi, H., & Khanbabaei, M. (2015). A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit
scoring. Journal of Retailing and Consumer Services, 27, 11–23. https://doi.org/10.1016/j.jretconser.2015.07.003
Kumar, P., & Ravi, V. (2007). Bankruptcy prediction in banks and firms via statistical and intelligent techniques—A review. European Journal of Operational
Research, 180(1), 1–28. https://doi.org/10.1016/j.ejor.2006.08.043
Kuncheva, L. I., & Rodríguez, J. J. (2007). An Experimental Study on Rotation Forest Ensembles. In M. Haindl, J. Kittler, & F. Roli (Eds.), Multiple Classifier
Systems. MCS 2007. Lecture Notes in Computer Science (Vol 4472) (pp. 459–468). Berlin, Heidelberg: Springer.
Lee, T.‐S., Lu, C.‐J., Chen, I.‐F., & Chiu, C.‐C. (2002). Credit‐scoring using the hybrid neural discriminant technique. Expert Systems with Applications, 23(3),
245–254. https://doi.org/10.1016/S0957‐4174(02)00044‐1
Lee, W., Jun, C. H., & Lee, J. S. (2017). Instance categorization by support vector machines to adjust weights in AdaBoost for imbalanced data classification.
Information Sciences, 381, 92–103. https://doi.org/10.1016/j.ins.2016.11.014
Lessmann, S., Baesens, B., Seow, H.‐V., & Thomas, L. C. (2015). Benchmarking state‐of‐the‐art classification algorithms for credit scoring. European Journal
of Operational Research, 247(1), 124–136. https://doi.org/10.1016/j.ejor.2015.05.030
Li, X., & Zhong, Y. (2012). An overview of personal credit scoring: Techniques and future work. International Journal of Intelligence Science, 2, 181–189.
https://doi.org/10.4236/ijis.2012.224024
Lin, S. M. (2012). Predicting default of a small business using different definitions of financial distress. Journal of the Operational Research Society, 63,
539–548. https://doi.org/10.1057/jors.2011.65
Lin, W.‐Y., Hu, Y.‐H., & Tsai, C.‐F. (2012). Machine learning in financial crisis prediction: A survey. IEEE Transactions on Systems Man and Cybernetics Part C
(Applications and Reviews), 42(4), 421–436. https://doi.org/10.1109/TSMCC.2011.2170420
Makowski, P. (1985). Credit scoring branches out. Credit World, 75(1), 30–37.
Marqués, A. I., García, V., & Sánchez, J. S. (2013). A literature review on the application of evolutionary computing to credit scoring. Journal of the
Operational Research Society, 64(9), 1384–1399. https://doi.org/10.1057/jors.2012.145
Matthews, B. W. (1975). Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta, 405(2),
442–451. https://doi.org/10.1016/0005‐2795(75)90109‐9
Nanni, L., & Lumini, A. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with
Applications, 36(2, Part 2), 3028–3033. https://doi.org/10.1016/j.eswa.2008.01.018
Napierala, K., & Stefanowski, J. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of
Intelligent Information Systems, 46(3), 563–597. https://doi.org/10.1007/s10844‐015‐0368‐1
Neapolitan, R. E. (2003). Learning Bayesian networks. Illinois: Prentice‐Hall, Inc. Upper Saddle River, NJ, USA ©2003.
Partalas, I., Tsoumakas, G., & Vlahavas, I. (2010). An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Machine Learning,
81(3), 257–282. https://doi.org/10.1007/s10994‐010‐5172‐0
Pérez, J. M., Muguerza, J., Arbelaitz, O., & Gurrutxaga, I. (2004). A new algorithm to build consolidated trees: Study of the error rate and steadiness. In M.A.
Kłopotek, S.T. Wierzchoń, & K. Trojanowski (Eds.), Intelligent Information Processing and Web Mining. Advances in Soft Computing (Vol 25). Berlin, Heidel-
berg: Springer.
Pérez, J. M., Muguerza, J., Arbelaitz, O., Gurrutxaga, I., & Martín, J. I. (2007). Combining multiple class distribution modified subsamples in a single tree.
Pattern Recognition Letters, 28(4), 414–422. https://doi.org/10.1016/j.patrec.2006.08.013
Polikar, R. (2006). Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3), 21–45. https://doi.org/10.1109/
MCAS.2006.1688199
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
20 of 22 GICIĆ AND SUBASI

Powers, D. M. (2011). Evaluation: From precision, recall and F‐measure to ROC, informedness, markedness & correlation. Journal of Machine Learning
Technologies, 2(1), 37–63.
Provost, F. (2000). Machine learning from imbalanced data sets. Presented at the Proceedings of the AAAI'2000 workshop on imbalanced data sets. 101,
1–3.
Psillaki, M., Tsolas, I. E., & Margaritis, D. (2010). Evaluation of credit risk based on firm performance. European Journal of Operational Research, 201(3),
873–881. https://doi.org/10.1016/j.ejor.2009.03.032
Quinlan, J. (1993). C4.5: Programs for machine learning.
Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 28(10), 1619–1630. https://doi.org/10.1109/TPAMI.2006.211
Rokach, L. (2010). Ensemble‐based classifiers. Artificial Intelligence Review, 33, 1–2), 1–39. https://doi.org/10.1007/s10462‐009‐9124‐7
Ron Kohavi, E. B. (1998). An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 1–38.
Sadatrasoul, S. M., Gholamian, M., Siami, M., & Hajimohammadi, Z. (2013). Credit scoring in banks and financial institutions via data mining techniques: A
literature review. Journal of AI and Data Mining, 1(2), 119–129.
Sarlija, N. M. (2004). Multinomial model in consumer credit scoring. In 10th International Conference on Operational Research, Trogir, Croatia.
Schebesch, K. B., & Stecking, R. (2005). Support vector machines for classifying and describing credit applicants: Detecting typical and critical regions. Jour-
nal of the Operational Research Society, 56(9), 1082–1088. https://doi.org/10.1057/palgrave.jors.2602023
Sebe, N., Cohen, I., Garg, A., & Huang, T. S. (2005). Machine learning in computer vision (Vol. 29). Springer Science & Business Media, 3300 AA Dordrecht, The
Netherlands.
Serrano‐Cinca, C., & Gutiérrez‐Nieto, B. (2016). The use of profit scoring as an alternative to credit scoring systems in peer‐to‐peer (P2P) lending. Decision
Support Systems, 89, 113–122. https://doi.org/10.1016/j.dss.2016.06.014
Shin, K.‐s., & Han, I. (2001). A case‐based approach using inductive indexing for corporate bond rating. Decision Support Systems, 32(1), 41–52. https://doi.
org/10.1016/S0167‐9236(01)00099‐9
Srinivasan, G. (2014). Microfinance India: The social performance report 2013. SAGE Publications India.
Sun, J., Lang, J., Fujita, H., & Li, H. (2018). Imbalanced enterprise credit evaluation with DTE‐SBD: Decision tree ensemble based on SMOTE and bagging
with differentiated sampling rates. Information Sciences, 425, 76–91. https://doi.org/10.1016/j.ins.2017.10.017
Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Better decisions through science. Scientific American, 283, 82–87. https://doi.org/10.1038/
scientificamerican1000‐82
Thompson, W. D., & Walter, S. D. (1988). A reappraisal of the kappa coefficient. Journal of Clinical Epidemiology, 41(10), 949–958. https://doi.org/10.1016/
0895‐4356(88)90031‐5
Tsai, C., & Wu, J. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34(4), 2639–2649.
Tsai, C.‐F. (2014). Combining cluster analysis with classifier ensembles to predict financial distress. Journal Information Fusion, 16, 46–58. https://doi.org/
10.1016/j.inffus.2011.12.001
Tsai, C.‐F., & Chen, M.‐L. (2010). Credit rating by hybrid machine learning techniques. Applied Soft Computing, 10(2), 374–380. https://doi.org/10.1016/j.
asoc.2009.08.003
Van Gool, J., Verbeke, W., Sercu, P., & Baesens, B. (2012). Credit scoring for microfinance: Is it worth it? International Journal of Finance and Economics, 17(2),
103–123. https://doi.org/10.1002/ijfe.444
Vapnik, V., & Cortes, C. (1995). Support‐vector networks. Machine Learning, 20, 273–297.
Verikas, A., Kalsyte, Z., Bacauskiene, M., & Gelzinis, A. (2010). Hybrid and ensemble‐based soft computing techniques in bankruptcy prediction: A survey.
Soft Computing, 14(9), 995–1010. https://doi.org/10.1007/s00500‐009‐0490‐5
Viera, A. J., & Garrett, J. M. (2005). Understanding interobserver agreement: The kappa statistic. Family Medicine Obstetrics and Gynecology. Family Med-
icine, 37(5), 360–363.
Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 38(1),
223–230. https://doi.org/10.1016/j.eswa.2010.06.048
Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge‐Based Systems, 26, 61–68.
https://doi.org/10.1016/j.knosys.2011.06.020
Wang, Q., Zhou, C.‐H., & Guo, J.‐K. (2007). Learning selective averaged one‐dependence estimators for probability estimation. (p. Fourth International
Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)).
Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159–196. https://doi.org/10.1023/
A:1007659514849
Webb, G. I., Boughton, J. R., & Wang, Z. (2005). Not so naive Bayes: Aggregating one‐dependence. Machine Learning, 58(1), 5–24. https://doi.org/10.1007/
s10994‐005‐4258‐6
Weiss, G. M., McCarthy, K., & Zabar, B. (2007). Cost‐sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?
DMIN, 7, 35–41.
Weiss, G. M., & Provost, F. (2003). Learning when training data are costly: The effect of class distribution on tree induction. Journal of Artificial Intelligence
Research, 19, 315–354. https://doi.org/10.1613/jair.1199
Weka (2017). Data mining with open source machine learning software in Java. Retrieved from Retrieved January 5, 2017, From Weka 3 ‐ data mining with
open source machine learning software in Java. http://www.cs.waikato.ac.nz/ml/weka/
West, D., Dellana, S., & Qian, J. (2005). Neural network ensemble strategies for financial decision applications. Applications of Neural Networks, 32(10),
2543–2559. https://doi.org/10.1016/j.cor.2004.03.017
Wheeler, R., & Aitken, S. (2000). Multiple algorithms for fraud detection. Knowledge‐Based Systems, 13(2–3), 93–99. https://doi.org/10.1016/S0950‐
7051(00)00050‐2
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
GICIĆ AND SUBASI 21 of 22

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning tools and techniques. Burlington: Kaufmann.
Xiao, H., Xiao, Z., & Wang, Y. (2016). Ensemble classification based on supervised clustering for credit scoring. Applied Soft Computing, 43, 73–86. https://
doi.org/10.1016/j.asoc.2016.02.022
Xu, L. D., & Duan, L. (2018). Big data for cyber physical systems in industry 4.0: a survey. Enterprise Information Systems, 1‐22.
Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural network ensemble learning approach. Expert Systems with Applications,
34(2), 1434–1444. https://doi.org/10.1016/j.eswa.2007.01.009
Yu, L., Wang, S., & Lai, K. K. (2009). An intelligent‐agent‐based fuzzy group decision making model for financial multicriteria decision support: The case of
credit scoring. European Journal of Operational Research, 195(3), 942–959. https://doi.org/10.1016/j.ejor.2007.11.025
Yu, L., Yue, W., Wang, S., & Lai, K. K. (2010). Support vector machine based multiagent ensemble learning for credit risk evaluation. Expert Systems with
Applications, 37(2), 1351–1360. https://doi.org/10.1016/j.eswa.2009.06.083
Zhang, D., Zhou, X., Leung, S. C., & Zheng, J. (2010). Vertical bagging decision trees model for credit scoring. Expert Systems with Application, 37(12),
7838–7843. https://doi.org/10.1016/j.eswa.2010.04.054
Zhou, Z. H. (2009). Ensemble. In Encyclopedia of database systems (pp. 988–991). Boston, MA: Springer.

AUTHOR BIOGRAPHI ES
Adaleta Gicić graduated from University of Sarajevo in 2003. She took her MSc degree from University of Sarajevo in 2007, all in Electrical
Engineering. From 2003 until 2005, she worked as engineer at development of Telecom Call Center exchanges related to automatic call dis-
tribution and signalization protocols written in C++. From 2005 till 2008, she worked as Java developer, Oracle and MySQL Database devel-
oper, and designer for LRC on a project funded by the World Bank in order to establish modern credit score reporting agency. From 2008–
2009, she worked as Project Manager and Java Consultant for IT‐Softlab. From 2009–2018, she worked as System Architect and Developer
of Core Banking solutions and Data Warehouse solutions, for banks such as Sparkasse, Volksbank, Unicredit, BBI, and other local banks within
the Info Studio doo company. She has also worked as a Business Intelligence Analyst, Data Warehouse Developer, and implemented Machine
Learning in Credit Scoring. Among other achievements, she had took first place in Bosnia Herzegovina in Einstein's theory of relativity, Atomic
and Nuclear Physics and Optics and therefore participated in the International Physics Olympiad in Ontario, Canada, in 1997. She is also Cisco,
Oracle, and Java certified.

Dr. Abdulhamit Subasi graduated from Hacettepe University in 1990. He took his M.Sc. degree from Middle East Technical University in 1993
and his PhD degree from Sakarya University in 2001, all in Electrical and Electronics Engineering. In 2006, he was senior researcher at Georgia
Institute of Technology, School of Electrical and Computer Engineering, Georgia, USA. Since 2015, he is working as a Professor of Information
Systems at Effat University, Jeddah, Saudi Arabia. His areas of interest are data mining, machine learning, pattern recognition applications in
biomedical signal/image processing, Smart Healthcare, IoT, Big Data, Cybersecurity, computer networks, and security.

How to cite this article: Gicić A, Subasi A. Credit scoring for a microcredit data set using the synthetic minority oversampling technique
and ensemble classifiers. Expert Systems. 2019;36:e12363. https://doi.org/10.1111/exsy.12363

APPENDIX A

Every record of the data set consists of 25 attributes that are not correlated and have discrete values as it follows:

1. Monthly available amount


2. Duration: Credit duration in months
3. Credit history: 0—no credit or all credits paid back regularly, 1—all credits at this organization paid off regularly, 2—existing credit paid back
regularly until now, 3—delay in paying off in the past, 4—critical account/other credit existing (not at this organization)
4. Purpose: 1—business, 2—investment, 3—consumers, 4—everyday expenses, 5—general
5. Credit amount
6. Gender: 0—not defined, 1—female, 2—male
7. Employment status: 1—unemployed, 2—employed

8. Amount of credit: Amount of all credit rates that a person pays monthly
9. Type of customer: F—Retail, P—Company
10. Guarantors: 1—user, 2—mortgage, 3—warrant of internal payment, 4—solidarity group, 5—administrative prohibition, 6—promissory note, 7
—approval of the seizure of salary, 8—cession, 0—none
14680394, 2019, 2, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/exsy.12363 by NASA Shared Services Center (NSSC), Wiley Online Library on [06/06/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
22 of 22 GICIĆ AND SUBASI

11. Reprogrammed numeric: 0—no, 1—yes


12. Family size
13. Age
14. Marital status: 1—married, 2—widow, 3—nonmarried, 4—divorced
15. Housing: 1—rent, 2—owner, 3—free rent, 4—not available

16. Number of active credits in this company


17. Number of all credits in this company
18. Nationality
19. Telephone: 1—none, 2—yes
20. Credit type: 1—External, 2—Seasonal, 3—Internal, 0—Other/Not defined

21. Business type group: 1—Agriculture, 2—Production, 3—Trade, 4—Services, 0—Other/Not defined
22. Number of school children
23. Business duration
24. Number of workers

25. Credit status: A—Active, P—Paid‐off

You might also like