Professional Documents
Culture Documents
11 (2018) 925-935
___________________________________________________________________________________________________________
1
FinanceIT Research Group, University of New South Wales,
Sydney, NSW, Australia
E-mail: Anahita.namvar@gmail.com
f.rabhi@unsw.edu.au
2
Centre for Artificial Intelligence, University of Technology Sydney,
Sydney, NSW, Australia
E-mail: Mohammad.siaminamini@uts.edu.au
Mohsen.Naderpour@uts.edu.au
Abstract
Credit risk prediction is an effective way of evaluating whether a potential borrower will repay a loan, particularly
in peer-to-peer lending where class imbalance problems are prevalent. However, few credit risk prediction models for
social lending consider imbalanced data and, further, the best resampling technique to use with imbalanced data is still
controversial. In an attempt to address these problems, this paper presents an empirical comparison of various
combinations of classifiers and resampling techniques within a novel risk assessment methodology that incorporates
imbalanced data. The credit predictions from each combination are evaluated with a G-mean measure to avoid bias
towards the majority class, which has not been considered in similar studies. The results reveal that combining random
forest and random under-sampling may be an effective strategy for calculating the credit risk associated with loan
applicants in social lending markets.
Keywords: Risk prediction, peer-to-peer lending, imbalance classification, resampling.
926
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
risk of borrowers, simple statistical approaches such as corrects SMOTE data by finding pairs of minimally
linear discriminate analysis and logistic regression still distanced nearest neighbors of opposite classes. It then
remain popular because of their high accuracy and ease identifies and removes Tomek links to produce a
of implementation [6-9]. balanced dataset with well-defined classes. SMOTE-
ENN follows the same procedure as SMOTE-Tomek
2.2. Class Imbalance Problem but uses edited nearest neighbors to balance the dataset.
The class imbalance problems are common in Class imbalance is a common problem in loan
classification problems. Class imbalance occurs when default prediction. Abeysinghe, Li and He [17], Brown
the number of instances in one class is vastly different and Mues [18], and Sanz, Bernardo, Herrera, Bustince
than the instances in another. In such cases, classifiers and Hagras [19] have all provided solutions to class
tend to be biased towards the majority class, while the imbalance problems for credit risk prediction.
minority class is ignored [10]. Many algorithms However, to the best of our knowledge, only a few
designed to address imbalanced data classification studies have addressed the class imbalance problem in
problems have emerged over the past decade, and detail in the social lending context.
resampling is one of the most important strategies for
solving this issue [11]. Resampling generates a balance 3. Research Methodology
training dataset prior to building the classification
To help lenders evaluate the creditworthiness of
model.
borrowers in social lending platforms, we developed a
The three types of resampling techniques are over-
decision support system that includes a novel
sampling, under-sampling, and a hybrid of the two
prediction model to reduce the risk of loan defaults.
[11].
Figure 1 illustrates the model; The model first takes a
Over-sampling creates new samples in the minority
raw data and passed it through feature engineering step
class. Three prominent oversampling methods are
random over-sampling, synthetic minority over-
sampling technique (SMOTE), and adaptive synthetic
sampling. Random over-sampling randomly duplicates
the minority samples to balance the distribution of data
[12]. SMOTE uses k-nearest neighbors to produce new
instances based on the distance between the minority
data and some randomly selected nearest neighbors
[13]. Adaptive synthetic sampling uses the density
distribution to generate synthetic samples for each
minority instance [14].
Under-sampling discards samples from majority
class [15]. Random under-sampling and instance
hardness threshold are the state-of-the-art under-
sampling techniques. Random under-sampling
randomly eliminates examples from the majority class
to balance the class distribution [12]. The instance
hardness threshold method measures the probability
that an instance will be misclassified and uses a
constant threshold to filter instances in all iterations
[16].
The third resampling approach is a hybrid method
that combines over- and under-sampling. SMOTE +
Tomek links (SMOTE-TOMEK) and SMOTE+ edited
Fig. 1. Research methodology
nearest neighbor (SMOTE-ENN) are the two hybrid
techniques compared in this study. SMOTE-TOMEK
927
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
to clean data and select the appropriate number of These features typically do not have available values at
features. After that, the prepared data is divided into the the time the prediction model is used. So, training
training set and testing set. The challenge is that the model with these features may make unrealistically
imbalanced dataset is a common problem in credit risk good predictions. Surprisingly, the Lending Club
evaluation, and it can cause misclassification. So, dataset contains leakage data, but few studies relying
resampling approaches have taken into consideration to on this dataset have taken that into consideration.
solve imbalance issue in training set, then balanced Data transformation including; converting
data feed to state of the art prediction models to train categorical features to numeric, standardization and log
the model. Then the classification results are validated, transformation. Some classification algorithms, such as
we may feed back and revise our approach to reaching logistic regression, cannot handle categorical features,
better classification results. so they are transformed into new forms of data. For
Since this study intends to investigate the most standardization, the min-max normalization in Eq. (2)
efficient combination of resampling approaches and ensures that all parameters use the same scale.
state of the art machine learning algorithms; therefore, 𝑋 − 𝑋>?=
the model implementation will be repeated for different 𝑋= = (2)
𝑋>@A − 𝑋>?=
combinations.
In the following subsections, each model A log transformation reduces skewness in the data
component is explained in detail. distribution (which cannot be applied to zero or
negative values).
3.1. Feature Engineering Correlations for numeric and binarized nominal
attributes are then computed with respect to the loan
The first component in the model is a feature
status to provide a better understanding of the data and
engineering module. Its main purpose is to enhance
its attributes. Finally, according to Malekipirbazari and
data reliability by cleaning data and selecting the subset
Aksakalli [2], we define variables that are simple ratio
of data features with the most discriminatory power. In
of other features. Moreover, this study introduces the
credit risk prediction, ignoring irrelevant features can
new non-standard financial feature. These ratios help
increase classification accuracy [20] and decrease the
to capture certain borrower characteristics to make the
computational costs associated with running several
most of the available data.
machine learning models [21]. Feature selection also
reduces the dimensionality of the data, which helps
3.2. Imbalanced Learning Approaches
mitigate the risk of overfitting. Our model comprises
four important steps in feature engineering: data This study employs resampling approach to deal with
cleaning, leaky data removal, data transformation and the imbalance problem. Figure2 demonstrates three
correlation analysis, and deriving new attributes. categories of resampling approach including under-
The data is cleaned by first removing missing and sampling, over-sampling, and hybrid methods. We
null values from dataset [22], then all outliers are address the state of the art algorithms in each of these
removed according to the acceptable range defined in categories. The under-sampling approach includes
[23]. Eq. (1) shows the upper bound and lower bounds random under-sampling (RUS), and instance hardness
of this range. threshold(IHT) algorithms. For over-sampling
approach, random over-sampling (ROS), synthetic
𝐿𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑 < 𝐴𝑐𝑐𝑒𝑝𝑡𝑎𝑏𝑙𝑒 𝑟𝑎𝑛𝑔𝑒 <
(1) minority over-sampling technique (SMOTE), and
𝑈𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑
adaptive synthetic sampling (ADASYN) are studied.
𝐿𝑜𝑤𝑒𝑟 𝑏𝑎𝑛𝑑 = 𝑄1 − 1.5× 𝑄3 − 𝑄1 Finally, SMOTE + Tomek links (SMOTE-TOMEK)
𝑈𝑝𝑝𝑒𝑟 𝑏𝑎𝑛𝑑 = 𝑄3 − 1.5× 𝑄3 − 𝑄1 and SMOTE+ edited nearest neighbor (SMOTE-ENN)
are two prominent hybrid approaches that considered
Where Q3 is the third quartile and Q1 is the first by this research. Section 2.2. presents complete
quartile. literature on these algorithms.
Once the data has been cleaned, the features that
may cause data leakage in the model are identified.
928
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
929
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
post charge off collection fee, last payment date, last status, they should not be included in the modeling
payment amount, and fund amount. process or overly optimistic predictions may result.
The variables with skewed distributions were The new DTI ratio sits in third place with a
annual income, income to payment ratio, and revolving correlation of 0.171, which shows that it can have a
to income ratio. Log transformations were applied to positive effect on the classification results.
create normal distributions for these variables.
Table 1. Lending Club attributes
Four categorical attributes were selected as Category Attribute Description
nominal attributes and transformed into binarized data: Loan status
variable
Target
term (2 categories), home ownership (3 categories), Current status of the loan
verification status (3 categories), and purpose (12
categories). Ultimately, 2+3+3+12 = 20 numerical Term The number of monthly payments on the loan
characteristic
– either 36 or 60.
attributes replaced these four categorical attributes.
loan
Loan amount The total amount of the loan
The new attributes, defined based on monthly
purpose A category provided by the borrower for the
income, installment amount, and revolving amount loan request. 12 purposes included.
features were Income-to payment ratio and revolving- New debt to The ratio of new monthly repayment amount
income (if the loan is approved) to monthly income.
to-payment ratio [2]. Further, we defined a new ratio (New DTI) This considers the impact of the new loan
repayments.
that, to the best of our knowledge, has not been Income to
(annual income / 12) / installment
previously used in any study called new debt to income Payment Ratio
(New DTI). New DTI considers the impact of the loan Debt to Ratio of the borrower’s total monthly debt
Borrowers' solvency
with respect to loan status. Table 2 shows the installment The monthly payment owed by the borrower
if the loan originates.
correlations for the top-20 attributes. inquiries last 6 The number of inquiries in past 6 months
LC grade and interest rate sit at the top of the list months (excluding auto and mortgage inquiries)
numbers between 1 and 7 (1=A and 7=G). They also credit age How long has the earliest account been
opened by the borrower
assign an interest rate to each loan. G-grade loans are Delinquencies The number of delinquencies in the
given the highest interest rate; A-grade loans have the borrower's credit file for the past 2 years
Public record Number of derogatory public records
lowest rate. However, LC grades and interest rates are Open account The number of open credit lines in the
leaky data, so despite their high correlation with loan borrower's credit file.
Revolving
Total credit revolving balance
balance
930
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
Table 2. Correlation with respect to loan status accuracy does not consider that false positives are
NO Attributes Correlation worse than false negatives and, therefore, accuracy can
1 Lc Grade -0.242 be a misleading criterion that causes erroneous results
2 Interest Rate -0.224 [28]. As such, other performance measures are more
3 New Dti -0.171 appropriate when working in domains with class
4 Income to Payment Ratio 0.141
5 Dti -0.132 imbalance issues [29]. Receiver operating
6 Home ownership RENT -0.101 characteristics (ROC), G-mean, and F-measure (FM)
7 Revolving utilization rate -0.100 are preferred as the likelihood these measures will be
8 percent_bc_gt_75 -0.096
affected by imbalanced class distributions is low [11].
9 Average Current Balance 0.095
10 Term 0.094 In a binary classification problem with good and
11 Home ownership MORTGAGE 0.092 bad class labels, the classifier’s result is considered
12 Total Current Balance 0.089
successful if both the false positive rate and false
13 Verification status Verified -0.077
14 installment -0.076 negative rate are small [15]. Sensitivity measures the
15 Verification status - not verified 0.074 accuracy of positive samples and specificity measures
16 inquiries last 6 months -0.074 negative samples. Additionally, an effective
17 Loan amount -0.067
18 Annual Income 0.066
performance measure in imbalanced settings will
19 Total Revenue High Limit 0.064 indicate the balance between classification
20 Employment Length 0.043 performance in both the minority and majority class
[30]. The G-mean measure considers both sensitivity
A brief description of the final dataset appears in and specificity for both classes in calculating its scores
Table 3. and is therefore an effective measure for imbalanced
Table 3. Final dataset description datasets [15]. The G-mean equation is shown in Eq. (5).
Imbalance
N Features % Default % Fully paid
Ratio
G-Mean= 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 ∗ 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 (5)
66376 43 18.3 81.7 4.46 where
hiYN WQj?R?kN
𝑆𝑒𝑛𝑠𝑖𝑣𝑖𝑡𝑦 =
“N” represents the number of instances in the hiYN lQj?R?kNmn@TjN MNo@R?kN
931
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
ROS
Over -
Linear Sensitivity
discriminant
SMOTE in accurately predicting defaulters, and its bias toward
ADASYN the majority class despite an over-sampling approach
analysis Specificity
SMOTE-
designed to solve class imbalance issues. Random
Random forest False positive
TOMEK rate forest showed the lowest specificity and G-mean, but
Hybrid
Specificity
Accuracy
FP-Rate
G-mean
AUC
Specificity
LDA-ROS
Accuracy
FP-Rate
G-mean
AUC
Classifier LR-
0.64 0.7 0.64 0.64 0.36 0.64
ADASYN
LR-ROS 0.7 0.703 0.735 0.542 0.458 0.63
RF-RUS 0.692 0.69 0.717 0.582 0.42 0.65 RF-ROS 0.699 0.689 0.74 0.513 0.487 0.616
LR-RUS 0.693 0.71 0.723 0.558 0.442 0.635 RF-
0.6814 0.658 0.725 0.486 0.513 0.594
LDA-RUS 0.676 0.7034 0.695 0.589 0.42 0.64 SMOTE
LR-IHT 0.71 0.7 0.76 0.51 0.49 0.62 RF-
0.8 0.66 0.94 0.16 0.84 0.39
LDA-IHT 0.713 0.7 0.759 0.505 0.494 0.619 ADASYN
RF-IHT 0.75 0.688 0.83 0.4 0.61 0.57
932
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
Given the performance measures in Table.7, the Table 8.Classification Results (Final Comparison)
hybrid re-sampling methods did not produce good
Sensitivity
Specificity
Accuracy
results. All methods had low accuracy. The models
FP-Rate
G-mean
AUC
using SMOTE-ENN resulted in lower false positive Classifier
rates and higher specificity, but their sensitivity scores
were incredibly low. This is an indication that the RF-RUS 0.692 0.69 0.717 0.582 0.42 0.65
model identifies most customers as defaulters based on LDA-
significantly high performance in the minority class 0.64 0.7 0.63 0.65 0.35 0.643
SMOTE
(the defaulting customers) while showing low LR -
performance on the other classes. For example, RF- 0.64 0.7 0.638 0.648 0.352 0.643
Tomek
SMOTE-ENN only correctly predicted 33.7% of the Logistic
customers as good with a class of 1. Despite the poor 0.8173 0.703 0.988 0.048 0.95 0.218
Regression
performance of models presented in Table 7, LR-
Random
SMOTE-Tomek, with a G-means of 0.64, represents 0.8176 0.696 0.996 0.015 0.98 0.12
forest
the best classification of the hybrid re-sampling
methods.
In terms of G-mean, random forest had the lowest
Table 7. Classification Result (Hybrid approach) rating (0.12) on an imbalanced dataset. While, RF-RUS
(i.e., random forest on a balanced dataset) had the
highest G of 0.65. However, random forest and logistic
Sensitivity
Specificity
Accuracy
FP-Rate
G-mean
AUC
933
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
engineering process, and we introduced a non-standard Peer-to-Peer (P2P) lending’, Applied Economics, 2015,
financial feature to increase the reliability of the 47, (1), pp. 54-70
4. Xia, Y., Liu, C., and Liu, N.: ‘Cost-sensitive boosted
computed risk scores. Additionally, given the Lending
tree for loan evaluation in peer-to-peer lending’,
Club dataset contains imbalanced classes, we also Electronic Commerce Research and Applications, 2017,
compared different resampling methods to determine 24, pp. 30-49
the best overall technique. Accordingly, the state-of- 5. Lin, W.-C., Tsai, C.-F., Hu, Y.-H., and Jhang, J.-S.:
the-art classifiers – random forest, logistic regression ‘Clustering-based undersampling in class-imbalanced
data’, Information Sciences, 2017, 409, pp. 17-26
and linear discriminate analysis – were combined with
6. Ala'raj, M., and Abbod, M.F.: ‘Classifiers consensus
different resampling techniques and tested on Lending system approach for credit scoring’, Knowledge-Based
Club’s data. Our experiments show that random forest Systems, 2016, 104, pp. 89-105
and random under-sampling may be an efficient 7. Byanjankar, A., Heikkilä, M., and Mezei, J.: ‘Predicting
combination of classifier and resampling strategy to credit risk in peer-to-peer lending: A neural network
compute risk scores for loan applicants in social approach’, in Editor (Ed.)^(Eds.): ‘Book Predicting
credit risk in peer-to-peer lending: A neural network
lending markets.
approach’ (IEEE, 2015, edn.), pp. 719-725
P2P lenders can take advantage of the credit risk 8. Siami, M., Gholamian, M.R., Basiri, J., and Fathian, M.:
prediction modeling discussed in this study to make ‘An Application of Locally Linear Model Tree
smarter decisions when evaluating loan applications. Algorithm for Predictive Accuracy of Credit Scoring’,
Moreover, lenders might apply the attributes identified in Editor (Ed.)^(Eds.): ‘Book An Application of Locally
Linear Model Tree Algorithm for Predictive Accuracy
in this study to compute the creditworthiness of
of Credit Scoring’ (Springer, 2011, edn.), pp. 133-142
borrowers. Identifying default borrowers in advance 9. Siami, M., Gholamian, M.R., and Basiri, J.: ‘An
can prevent financial loss. Furthermore, more accurate application of locally linear model tree algorithm with
assessments of the probability of default may also help combination of feature selection in credit scoring’,
when developing strategies to compensate for risk, International Journal of Systems Science, 2014, 45,
such as increased interest rates. (10), pp. 2213-2222
10. Chawla, N.V., Japkowicz, N., and Kotcz, A.: ‘Special
One area for future research may consider the
issue on learning from imbalanced data sets’, ACM
support vector machine as a classifier. A support vector Sigkdd Explorations Newsletter, 2004, 6, (1), pp. 1-6
machine algorithm may show better performance than 11. Haixiang, G., Yijing, L., Shang, J., Mingyun, G.,
the algorithms assessed in this paper, even though fine- Yuanyue, H., and Bing, G.: ‘Learning from class-
tuning the parameters is time-consuming. Additionally, imbalanced data: review of methods and applications’,
Expert Systems with Applications, 2017, 73, pp. 220-
the Lending Club regularly publishes its historical data.
239
Given the underlying distribution of incoming data 12. Zhu, B., Baesens, B., Backiel, A., and vanden Broucke,
may change unpredictably over time, i.e., the data may S.K.: ‘Benchmarking sampling techniques for
contain concept drift, these changes could affect the imbalance learning in churn prediction’, Journal of the
accuracy of prediction models in future. Therefore, Operational Research Society, 2017, pp. 1-17
another area for future work could focus on concept 13. Chawla, N.V., Bowyer, K.W., Hall, L.O., and
Kegelmeyer, W.P.: ‘SMOTE: synthetic minority over-
drift in imbalanced data streams on social lending
sampling technique’, Journal of artificial intelligence
platforms. research, 2002, 16, pp. 321-357
14. He, H., Bai, Y., Garcia, E.A., and Li, S.: ‘ADASYN:
References Adaptive synthetic sampling approach for imbalanced
learning’, in Editor (Ed.)^(Eds.): ‘Book ADASYN:
1. Guo, Y., Zhou, W., Luo, C., Liu, C., and Xiong, H.: Adaptive synthetic sampling approach for imbalanced
‘Instance-based credit risk assessment for investment learning’ (IEEE, 2008, edn.), pp. 1322-1328
decisions in P2P lending’, European Journal of 15. Gong, J., and Kim, H.: ‘RHSBoost: Improving
Operational Research, 2016, 249, (2), pp. 417-426 classification performance in imbalance data’,
2. Malekipirbazari, M., and Aksakalli, V.: ‘Risk Computational Statistics & Data Analysis, 2017, 111,
assessment in social lending via random forests’, Expert pp. 1-13
Systems with Applications, 2015, 42, (10), pp. 4621- 16. Smith, M.R., Martinez, T., and Giraud-Carrier, C.: ‘An
4631 instance level analysis of data complexity’, Machine
3. Emekter, R., Tu, Y., Jirasakuldech, B., and Lu, M.: learning, 2014, 95, (2), pp. 225-256
‘Evaluating credit risk and loan performance in online
934
International Journal of Computational Intelligence Systems, Vol. 11 (2018) 925-935
___________________________________________________________________________________________________________
17. Abeysinghe, C., Li, J., and He, J.: ‘A Classifier Hub for based approaches’, IEEE Transactions on Systems,
Imbalanced Financial Data’, in Editor (Ed.)^(Eds.): Man, and Cybernetics, Part C (Applications and
‘Book A Classifier Hub for Imbalanced Financial Data’ Reviews), 2012, 42, (4), pp. 463-484
(Springer, 2016, edn.), pp. 476-479 30. Wang, S., Minku, L.L., and Yao, X.: ‘Online class
18. Brown, I., and Mues, C.: ‘An experimental comparison imbalance learning and its applications in fault
of classification algorithms for imbalanced credit detection’, International Journal of Computational
scoring data sets’, Expert Systems with Applications, Intelligence and Applications, 2013, 12, (04), pp.1-19
2012, 39, (3), pp. 3446-3453
19. Sanz, J.A., Bernardo, D., Herrera, F., Bustince, H., and
Hagras, H.: ‘A compact evolutionary interval-valued
fuzzy rule-based classification system for the modeling
and prediction of real-world financial applications with
imbalanced data’, IEEE Transactions on Fuzzy Systems,
2015, 23, (4), pp. 973-990
20. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., and
Alsaadi, F.E.: ‘A survey of deep neural network
architectures and their applications’, Neurocomputing,
2017, 234, pp. 11-26
21. Koutanaei, F.N., Sajedi, H., and Khanbabaei, M.: ‘A
hybrid data mining model of feature selection
algorithms and ensemble learning classifiers for credit
scoring’, Journal of Retailing and Consumer Services,
2015, 27, pp. 11-23
22. Umarani, J., and Manikandan, S.: ‘Generating Enhanced
Web Log File using Advanced Data Cleansing
Algorithm in Pre-Processing Phase’, Indian Journal of
Science and Technology, 2016, 9, (48), pp. 1-7
23. de Oliveira, E.C., de Faro Orlando, A., dos Santos
Ferreira, A.L., and de Oliveira Chaves, C.E.:
‘Comparison of different approaches for detection and
treatment of outliers in meter proving factors
determination’, Flow Measurement and
Instrumentation, 2016, 48, pp. 29-35
24. Xia, Y., Liu, C., Li, Y., and Liu, N.: ‘A boosted decision
tree approach using Bayesian hyper-parameter
optimization for credit scoring’, Expert Systems with
Applications, 2017, 78, pp. 225-241
25. James, G., Witten, D., Hastie, T., and Tibshirani, R.: ‘An
introduction to statistical learning’ (Springer, 2013.
2013)
26. Kumar, K., and Bhattacharya, S.: ‘Artificial neural
network vs linear discriminant analysis in credit ratings
forecast: A comparative study of prediction
performances’, Review of Accounting and Finance,
2006, 5, (3), pp. 216-227
27. Khemakhem, S., and Boujelbene, Y.: ‘Credit risk
prediction: A comparative study between discriminant
analysis and the neural network approach’, Accounting
and Management Information Systems, 2015, 14, (1),
pp. 60
28. Abellán, J., and Castellano, J.G.: ‘A comparative study
on base classifiers in ensemble methods for credit
scoring’, Expert Systems with Applications, 2017, 73,
pp. 1-10
29. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H.,
and Herrera, F.: ‘A review on ensembles for the class
imbalance problem: bagging-, boosting-, and hybrid-
935