Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: Traditional behavioral scoring models applying classification methods that yield a static probability of
Received 12 September 2017 default may ignore the borrowers’ dynamic characteristics because borrower repayment behavior
Received in revised form 18 December 2017 evolves dynamically. In this study, we propose a novel behavioral scoring model based on a mixture sur-
Accepted 18 December 2017
vival analysis framework to predict the dynamic probability of default over time in peer-to-peer (P2P)
Available online 21 December 2017
lending. A random forest is utilized to identify whether a borrower will default, and a random survival
forest is introduced to model the time to default. The results of an empirical analysis on a Chinese P2P
Keywords:
loan dataset show that the proposed ensemble mixture random forest (EMRF) has a better performance
Behavioral scoring
Dynamic probability
in terms of predicting the monthly dynamic probability of default, while compared with standard mix-
P2P lending ture cure model, Cox proportional hazards model and logistic regression. It is also concluded that the pro-
Random forest posed EMRF model provides a meaningful output for timely post-loan risk management.
Random survival forest Ó 2017 Elsevier B.V. All rights reserved.
Risk management
Survival analysis
https://doi.org/10.1016/j.elerap.2017.12.006
1567-4223/Ó 2017 Elsevier B.V. All rights reserved.
Z. Wang et al. / Electronic Commerce Research and Applications 27 (2018) 74–82 75
analysis method and it considers the status of each analysis object splitting attributes. Because the number of candidate split attri-
in different periods, let us define a censored indicator function di butes m, is typically smaller than the full set of M attributes
for credit assessment where di = 1 denotes that borrower i defaults (m < M), the search process of a random forest is very fast. On
within the observation period and di = 0 otherwise. Let yi denote the other hand, pruning is usually needed when growing a single
the default indicator during entire loan period, where yi = 1 tree to avoid overfitting and achieve the optimal prediction
denotes that borrower i will experience a default event during or strength with an appropriate complexity. The full random forest
after the observation period and yi = 0 denotes that borrower i will performs no pruning. Mitigating the detrimental effect of variance
never experience a default event. This condition, it will be censored through model averaging, random forest with unpruned trees
at the end of any observation period. In the definition of d and y, reduces both bias and variance (Breiman, 2001).
borrowers may be in one of three possible states (see Table 1). The latency component of MRF, which models the effects of the
To start, we provide the expression for the mixture random for- attributes vector x on the survival distribution of the uncured
est (MRF). Let PD(z)denote the PD given the vector of covariates z. subpopulation, is implemented by the random survival forest
Additionally, we define S(t|y = 1, x)as the survival PD at time t given model (Ishwaran et al., 2011). A random survival forest is an
a certain covariate vector x condition that the borrower will even- ensemble survival tree method used to analyze censored survival
tually default. Specifically, S(t|y = 1, x) = P(T > t|y = 1, x), where T data. The process of building a survival forest is similar to that of
denotes the elapsed time until borrower default from loan building random forest in that it starts by drawing multiple boot-
approval. The mixture random forest expression is strap samples. However, in contrast to a random forest, which
working with the classification and regression tree (CART) paradigm,
SðtÞ ¼ ð1 PDðzÞÞ þ PDðzÞSðtjy ¼ 1; xÞ; ð4Þ a random survival forest consists of multiple survival trees based
where S(t) denotes the probability of borrower survival after time on bootstrap samples. Then, each grown survival tree calculates a
period t. PD(z) is referred to as the incidence component and cumulative hazard function (CHF) and average to obtain the ensem-
S(t|y = 1, x) is referred to as the latency component. These two com- ble CHF.
ponents are respectively modeled by a random forest and a random As central elements of random survival forest, growing survival
survival forest. Thus, it is possible to distinguish the factors that trees and calculating the CHFs are discussed in detail. Similar to
affect the PD from factors that affect the distribution of default time. classification and regression trees (CART), the process of growing
Starting with the incidence component of MRF, the effect of the survival trees is implemented by recursively splitting the tree
attributes vector z on the PD is modeled using the random forest nodes. First, the root node is split into a left and right daughter
(Breiman, 2001). A random forest, which builds multiple classifica- node according to a predetermined survival criterion whose pur-
tion and regression trees on bootstrapped samples, can be consid- pose is to maximize the survival difference between daughters.
ered as an enhanced bagging method. In contrast to ensemble Then, each daughter node is subsequently split into child nodes.
methods such as bagging and boosting, which can generally work This process is repeated recursively for each daughter node until
with any type of base learner, random forest works with a partic- the survival tree finally reaches a saturated state, at which point
ular type of learner: the classification and regression tree. In the no new daughter nodes can be formed under the constraint condi-
context of P2P lending, it has been demonstrated that random for- tion that each node must contain at least one unique uncured
est outperforms several other state-of-the-art classification meth- observation. We define the most extreme nodes in a saturated tree
ods such as logistic regression and support vector machines as terminal nodes, TN.
(Malekipirbazari and Aksakalli, 2015). The pseudocode of the ran- Let fðT 1 h ; d1 h Þ; ðT 2 h ; d2 h Þ; . . . ; ðT n h ; dn h Þg be the survival times
dom forest algorithm is given in Fig. 3. and censored indicator values for n borrowers at a terminal node
Compared with growing a single decision tree, a random forest h (h 2 TN). Specifically, let t 1 h < t 2 h < < t NðhÞ h be the NðhÞ
has some unique efficiencies. On one hand, while selecting the distinct default times. The cumulative hazard function (CHF) for
splitting attributes, all the attributes are tested at each node in terminal node h is estimated as follows:
the growing algorithm of a single decision tree, while the random X dl h
^ h ðtÞ ¼
H ; ð5Þ
forest itself needs to test only those selected as the m candidate Y
t 6t l h
l h
Table 1
where dl h denotes the number of borrowers in default at time tl h ,
Possible states of a borrower. and Y l h denotes the number of borrowers at risk at time t l h .
To determine the cumulative hazard function (CHF) for each
d y Description
borrower i, we drop the attribute vector xi of each borrower i down
1 1 Non-censored, and the borrower is observed to default the survival tree. Because of the binary splitting mechanism, the
0 1 Censored, and the borrower will eventually default
0 0 Censored, and the borrower will never default
attribute vector xi will wind up at a unique terminal node, h. The
cumulative hazard function (CHF) for each borrower i is given by
^ h ðtÞ;
Hðtjxi Þ ¼ H ifxi 2 h: ð6Þ the unique observations of O and the rest are duplicated. The final
output of the ensemble mixture random forest (EMRF) is a two-
dimensional PD matrix in which each row represents a borrower
3.2.2. Model estimation and each column represents an observation time point. The output
Let ðti ; di ; zi ; xi Þ denote the observed information for the bor- of the EMRF, supplied by the M ensemble , is derived as follows:
rower i, i ¼ 1; 2; . . . ; n, where ti denotes the observed survival time,
di denotes the censored status, and zi and xi respectively denote the 1X s
Mensemble ¼ MMRF j ; ð11Þ
attribute vector for the random forest (incidence component) and s j¼1
the random survival forest (latency component). The complete
likelihood function of Eq. (4) can be expressed as follows: where MMRF j denotes the output PD matrix of mixture random for-
est built on the jth observation subset Oj.
Y
n
½1 PDðzi Þ1yi PDðzi Þyi hðti jy ¼ 1; xi Þdi yi Sðt i jy ¼ 1; xi Þyi ;
i¼1 4. Empirical evaluation
ð7Þ
The empirical evaluation was performed on a PC with a 3.20
where hðti jy ¼ 1; xi Þ is the hazard function corresponding to
GHz Intel Core i5-3470 CPU and 12 GB of RAM using the Windows
Sðt i jy ¼ 1; xi Þ.
7 operating system. The data mining toolkit from R version 3.31
The log complete likelihood function can be divided into two
was used for the experiment. R is a free software environment
parts: the log-likelihood function of the incidence component LI ,
for statistical computing and graphics.
and the log likelihood function of the latency component LL :
X
n
4.1. Real world dataset of P2P lending
LI ¼ ðyi log½PDðzi Þ þ ð1 yi Þlog½1 PDðzi ÞÞ; ð8Þ
i¼1
We evaluated our proposed ensemble mixture random forest
X
n (EMRF) model by using a large dataset from a major peer-to-peer
LL ¼ ðyi di log½hðt i jy ¼ 1; xi Þ þ yi log½Sðt i jy ¼ 1; xi ÞÞ: ð9Þ platform in China. The data were collected between January 2013
i¼1 and December 2015. The dataset used in evaluation consists of
52,573 approved applications that were subsequently funded for
According to the log-likelihood functions of the incidence and
12-month loans. The definition of a default is when repayment is
latency components, the expectation of yi can be written as:
at least three months overdue; otherwise, loans are non-default.
Eðyi jt i ; di ; zi ; xi Þ ¼ di þ ð1 di Þ As a result, the dataset contains 6079 defaults and 46,494 non-
PDðzi ÞSðti jy ¼ 1; xi Þ default loans in our dataset; the default rate is approximately
: ð10Þ 11.56% (i.e., 6,079/52,573). The attributes used in the analysis,
1 PDðzi Þ þ PDðzi ÞSðti jy ¼ 1; xi Þ
including the loan information, borrower information and histori-
Because yi is an unobserved variable, we use the expectation– cal information (A4–A7), are listed in Table 2.
maximization (EM) method (Sy and Taylor, 2000) to estimate the
expectation of yi . The pseudocode for the expectation maximization 4.2. Evaluation of proposed method
(EM) algorithm is given in Fig. 4.
We evaluated our proposed EMRF model in comparison with
3.3. Averaging ensemble two dynamic models, the mixture cure model (MCM) (Tong et al.,
2012) and the Cox proportional hazards (Cox PH) model, both of
After obtaining the single MRF, we improve the stability and the which predict the PD over time, as well as one typically static
performance of the MRF by mode combination. An averaging model, logistic regression (LR), which predicts only the PD over
ensemble is one of the most intuitive and simplest types of mode the entire repayment period. The MCM is a state-of-art model for
combinations and has been shown to have surprisingly good per- survival analysis that utilizes logistic regression and Cox propor-
formance (Finlay, 2011). Let O denote the dataset processed by tional hazard regression to model the incidence and latency com-
WOE transformation. The process of creating multiple subsets ponents, respectively. The Cox PH model is a classical method for
ðO1 ; O2 ; . . . ; Os Þ is implemented by sampling with replacement from modeling the time of an event. In addition, LR, which estimates
O. Specifically, each subset contains the same number of observa- the probability of a binary response using a logistic function, is
tions as O after sampling. Then, each Oj of the multiple subsets one of the most popular credit risk evaluation models, and it is
ðO1 ; O2 ; . . . ; Os Þ is expected to have a fraction ð1 1e 63:2%Þ of widely used as a benchmark model in credit risk modeling.
Table 2
Description of attributes.
We compared the performances of the proposed EMRF MCM, ence between the cumulative true positive and cumulative false
Cox PH and LR models in terms of their ability to predict the PD positive rate; and (3) the H-measure, proposed by Hand (2009),
over time for a 12-month loan. The time interval is one month who argued that the H-measure is a coherent estimator that is
because borrowers’ repayments are due on a monthly basis, lead- insensitive to the empirical score distributions of the default and
ing to multiple discrete time intervals for a 12-month loan. Note non-default groups. The H-measure should be considered the mea-
that with by our definition of default (repayment overdue for at sure of choice to compare the performance of each method.
least three months) there are 10 discrete time intervals, denoted To better reflect the dynamic repayment behavior, we also
TI1 to TI10 (where TI1 corresponds to the third month). In the EMRF, examined the calibration performance among the four models.
the size of the ensemble was 100 and the size of the forest was 500 The calibration performance refers to the accuracy of the PD esti-
trees. Variable selection for EMRF was performed according to the mates themselves over time (Liu et al., 2015). We evaluated the
GINI-index in the random forest and the C-index in the random calibration performance by comparing the expected cumulative
survival forest. Variable selection for MCM, Cox PH and LR were number of defaulters against the observed number of defaulters
performed using backward stepwise elimination at a 5% signifi- from TI1 to TI10. Specifically, the expected cumulative number of
cance level. defaulters in each time interval was calculated by the sum of the
We examined both the discrimination performance and the cal- default probabilities of a borrower defaulting before the end of
ibration performance among the four models (i.e., EMRF, MCM, Cox the corresponding time interval:
PH and LR). The discrimination performance is a measure of the
model’s ability to distinguish bad borrowers from good borrowers. X
n
In our dynamic assessment scenario, we compared the discrimina- EðDTI Þ ¼ ð1 SðTIjiÞÞ: ð12Þ
i¼1
tion performance of the four models (i.e., EMRF, MCM, Cox PH and
LR) by examining the performance for risk-ranked borrowers at During the process of estimating the performance of each model
various time intervals in terms of the following three measures: (i.e., EMRF, MCM, Cox PH and LR), we performed repeated cross
(1) the area under the receiver operating characteristic curve (AUC), validation (REPCV), specifically, 10 times for the independent 10-
which is calculated as the area under the receiver operating fold cross validations, to achieve unbiased estimates. Molinaro
characteristic curve and ranges in value from 0 to 1; (2) the et al. (2005) argued that 10 random fold splits are sufficient to real-
Kolmogorov–Smirnov (KS) statistic, which is the maximum differ- ize most of the achievable variance reduction. During each 10-fold
80 Z. Wang et al. / Electronic Commerce Research and Applications 27 (2018) 74–82
cross validation, the total samples are randomly divided into 10 fests as a straight line. Although it provides an accurate estimation
equally sized folds, 9 of which are selected to train each model indicating that the borrower will default, it cannot forecast the
and the remaining fold is used for validation. Note that the training dynamic repayment process.
sets and testing sets were kept consistent across the four models.
Table 3
AUCs of repeated 10-fold cross validations for four models, different time intervals.
Table 4
Full-pairwise comparison of the four models.
M1 vs M2 M1 vs M3 M1 vs M4 M2 vs M3 M2 vs M4 M3 vs M4
TI1 0.000 0.955 0.000 0.000 0.000 0.000
TI2 0.227 0.000 0.594 0.000 0.045 0.000
TI3 0.000 0.000 0.000 0.000 0.000 0.648
TI4 0.000 0.000 0.000 0.001 0.002 0.563
TI5 0.000 0.000 0.000 0.439 0.000 0.000
TI6 0.000 0.000 0.000 0.087 0.000 0.000
TI7 0.000 0.000 0.000 0.932 0.000 0.000
TI8 0.000 0.000 0.000 0.622 0.428 0.789
TI9 0.005 0.000 0.000 0.000 0.000 0.000
TI10 0.012 0.000 0.000 0.000 0.000 0.535
to predict the PD over any time horizon. We introduced the Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J., 2003.
Benchmarking state-of-the-art classification algorithms for credit scoring. J.
random forest and random survival forest models to model
Operat. Res. Soc. 54 (6), 627–635.
the incidence component (PD(z)) and the latency component Banasik, J., Crook, J.N., Thomas, L.C., 1999. Not if but when will borrowers default. J.
(S(t|y = 1, x)), respectively, with high accuracy and fine time gran- Operat. Res. Soc. 50 (12), 1185–1190.
ularity (monthly). An evaluation analysis of data from a major Banasik, J., 2007. Reject inference, augmentation, and sample selection. Eur. J. Oper.
Res. 183 (3), 1582–1594.
P2P institution in China showed that the performance of the EMRF Bellotti, T., Crook, J., 2009. Credit scoring with macroeconomic variables using
was significant in terms of both discrimination and calibration, survival analysis. J. Operat. Res. Soc. 60 (12), 1699–1707.
which indicates it can be used as an alternative and superior Bensic, M., Sarlija, N., Zekic-Susac, M., 2005. Modelling small-business credit scoring
by using logistic regression, neural networks and decision trees. Intell. Syst.
method for behavioral scoring. Account. Fin. Manage. 13 (3), 133–150.
Our research makes several contributions to both research and Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32.
practice. First, the proposed ensemble mixture random forest Chen, G.G., Åstebro, T., 2012. Bound and collapse bayesian reject inference for credit
scoring. J. Operat. Res. Soc. 63 (10), 1374–1387.
model can generate the PD over time for predicting not only Chen, D., Han, C., 2012. A comparative study of online p2p lending in the usa and
whether borrowers will default on their loan repayments but also china. J. Int. Bank. Commer. 17 (2), 1–15.
when they are likely to default. It can be used by risk management Crook, J.N., Edelman, D.B., Thomas, L.C., 2007. Recent developments in consumer
credit risk assessment. Eur. J. Oper. Res. 183 (3), 1447–1465.
practitioners and researcher to improve credit risk evaluation. It Cox, D.R., 1972. Regression models and life-tables. J. Roy. Stat. Soc. 34 (2), 187–220.
also creates opportunities for new research and applications in De Leonardis, D., Rocci, R., 2014. Default risk analysis via a discrete-time cure rate
profit scoring domain since a prediction of dynamic PD can provide model. Appl. Stochastic Models Bus. Ind. 30 (5), 529–543.
Dirick, L., Claeskens, G., Baesens, B., 2015. Time to default in credit scoring using
the ability to compute the profitability over a borrower’s lifetime
survival analysis: a benchmark study. Eur. J. Oper. Res. Soc., 1–14
and perform profit scoring. Second, we modeled the incidence Finlay, S., 2011. Multiple classifier architectures and their application to credit risk
and latency components using tree-based methods as well as assessment. Eur. J. Oper. Res. 210 (2), 368–378.
ensemble learning method, which can deal with potential non- Guo, Y., Zhou, W., Luo, C., Liu, C., Xiong, H., 2016. Instance-based credit risk
assessment for investment decisions in P2P lending. Eur. J. Oper. Res. 249 (2),
linear relationship between independent variables and the target 417–426.
variable more naturally and help to improve the performance. Hand, D.J., Kelly, M.G., 2001. Lookahead scorecards for new fixed term credit
Our experiments show that the proposed model is indeed powerful products. Eur. J. Oper. Res. Soc. 52 (9), 989–996.
Hand, D.J., 2009. Measuring classifier performance: a coherent alternative to the
in predicting the PD over time. Third, financial institutions may area under the ROC curve. Mach. Learn. 77 (1), 103–123.
also be interested in the proposed model because they can Im, J.K., Apley, D.W., Qi, C., Shan, X., 2012. A time-dependent proportional hazards
strengthen their post-loan management capabilities by re- survival model for credit risk analysis. Eur. J. Oper. Res. Soc. 63 (3), 306–321.
Ishwaran, H., Kogalur, U.B., Blackstone, E.H., Lauer, M.S., 2011. Random survival
ranking the borrowers’ creditworthiness and predicting the forests. J. Thorac. Oncol. Official Publicat. Int. Assoc. Study Lung Cancer 6 (12),
dynamic number of defaulters according to the PD at any feasible 1974.
time—a capability that cannot be achieved by traditional scoring Kao, L.J., Chiu, C.C., Chiu, F.Y., 2012. A bayesian latent variable model with
classification and regression tree approach for behavior and credit scoring.
models. Timely intervention for potential defaulters can effectively Knowl.-Based Syst. 36 (6), 245–252.
reduce loan delinquencies as well as the number of accounts that Lessmann, S., Baesens, B., Seow, H.V., Thomas, L.C., 2015. Benchmarking state-of-
become bad debts. Those reductions could thus decrease losses the-art classification algorithms for credit scoring: an update of research. Eur. J.
Oper. Res. 247 (1), 124–136.
at financial institutions.
Liu, F., Hua, Z., Lim, A., 2015. Identifying future defaulters: a hierarchical Bayesian
There are a few limitations with this work. One limitation of method. Eur. J. Oper. Res. 241 (1), 202–211.
EMRF is the weak interpretability of the predicted results (i.e., Martens, D., Baesens, B., Gestel, T.V., Vanthienen, J., 2007. Comprehensible credit
the black-box problem). It is difficult to transform the discrimina- scoring models using rule extraction from support vector machines. Eur. J. Oper.
Res. 183 (3), 1466–1476.
tion process of ensemble methods such as bagging, boosting and Molinaro, A.M., Simon, R., Pfeiffer, R.M., 2005. Prediction error estimation: A
random forest into rules that have logical relationships. Future comparison of resampling methods. Bioinformatics 21 (15), 3301–3307.
research may consider improving the interpretability of ensembles Narain, B., 1992. Survival analysis and the credit granting decision. In: Credit
Scoring and Credit Control. Oxford University Press, Oxford, pp. 109–121.
in the framework of survival analysis. Another limitation lies in the Malekipirbazari, M., Aksakalli, V., 2015. Risk assessment in social lending via
limited available data: we use only the application and history random forests. Expert Syst. Appl. 42 (10), 4621–4631.
repayment information. Further research may add the variables Moeyersoms, J., Martens, D., 2015. Including high-cardinality attributes in
predictive models: a case study in churn prediction in the energy sector.
reflecting the management interfere from P2P platform to further Decis. Support Syst. 72 (8), 72–81.
improve the dynamic assessment. In addition, we implemented Sarlija, N., Bensic, M., Zekic-Susac, M., 2009. Comparison procedure of predicting the
experiments on a dataset from only one P2P institution in China; time to default in behavioural scoring. Expert Syst. Appl. 36 (5), 8778–8788.
Savvopoulos, A., 2010. Consumer credit models: Pricing, profit and portfolios. J. Roy.
future research may collect multiple datasets from different coun- Stat. Soc. 173 (2). 468-468.
tries and sites, such as Lending Club and Proposer, to comprehen- C. Serrano-Cinca B. Gutierrez-Nieto (2016). The use of profit scoring as an
sively validate our findings. alternative to credit scoring systems in peer-to-peer lending Dec. Support
Syst. 89 C 113 122
Stepanova, M., Thomas, L., 2002. Survival analysis methods for personal loan data.
Acknowledgments Oper. Res. 50 (2), 277–289.
Sy, J.P., Taylor, J.M.G., 2000. Estimation in a Cox proportional hazards cure model.
This work was funded by the National Natural Science Founda- Biometrics 56 (1), 227–236.
Tong, E.N.C., Mues, C., Thomas, L.C., 2012. Mixture cure models in credit scoring: If
tion of China (Grant Nos. 71731005, 71571059) and the Humani- and when borrowers default. Eur. J. Oper. Res. 218 (1), 132–139.
ties and Social Sciences Fund Research Planning of the Ministry Tsai, C.F., Wu, J.W., 2008. Using neural network ensembles for bankruptcy
of Education (Grant No. 15YJA630010). prediction and credit scoring. Expert Syst. Appl. 34 (4), 2639–2649.
Twala, B., 2010. Multiple classifier application to credit risk assessment. Expert Syst.
Appl. 37 (4), 3326–3336.
References Wang, G., Ma, J., Huang, L., Xu, K., 2012. Two credit scoring models based on dual
strategy ensemble trees. Know.-Based Syst. 26, 61–68.
Alves, B.C., Dias, J., 2015. Survival mixture models in behavioral scoring. Expert Syst. Yildirim, Y., 2008. Estimating default probabilities of CMBS loans with clustering
Appl. 42 (8), 3902–3910. and heavy censoring. J. Real Estate Fin. Econom. 37 (2), 93–111.
Bachmann, A., Becker, A., Buerckner, D., Hilker, M., Kock, F., Lehmann, M., Tibertus, Zhang, J., Thomas, L.C., 2012. Comparisons of linear regression and survival analysis
P., Funk, B., 2011. Online peer-to-peer lending: A literature review. J. Int. Bank. using single and mixture distributions approaches in modelling LGD. Int. J.
Commer. 16 (2). Forecast. 28 (1), 204–215.