Professional Documents
Culture Documents
T: 267.616.1444 T: 647.271.1932
F: 1.905.761.1006 F: 905.761.1006
9 Flagstaff Place 150 Borrows Street
Philadelphia, PA Thornhill, ON
19115 USA L4J 2W8 Canada
info@bisolutions.us www.bisolutions.us
WHITE PAPER:
Credit Risk Evaluation of Online Personal Loan
Applicants: A Data Mining Approach
September 2008
Introduction
This white paper is the result of joint work between Business Intelligence Solutions (BIS) and Strategic
Link Consulting (SLC).
Business Intelligence Solutions (www.bisolutions.us) is a well established statistical/data mining/GIS
company that conducts business for banking, finance, insurance and other industries. Our specialization
is complex unstructured business problems for data rich firms. Our multidisciplinary team includes
professionals in applied statistics, data mining, optimization and simulation, GIS, and software application
development. The team members are authors of more than 100 published papers on diverse applications
of data mining and other quantitative fields.
BIS has access to the best statistical, visualization, data mining and GIS software on the world
market. The essence of our approach is to understand and analyze our client’s business problem
and corresponding data through the prism of dissimilar statistical/data mining models. As a result,
we are always able to produce the best possible model and help our clients in the most effective and
scientifically sound way.
Strategic Link Consulting (www.strategiclinkconsulting.com) represents multiple online personal loan
clients within the sub-prime lending industry. Loan amounts vary based on the customer’s income as
a primary determinant of their ability to pay. Returning customers are eligible for larger loans with more
stringent income requirements. The interest rate is a non-negotiable flat rate based on the duration of the
loan. Returning customers are offered larger loans with lower fees. Payment schedules are derived from
customer pay frequency (weekly, bi-weekly, semi-monthly or monthly). Customers may pay in full on their
due date or refinance by paying either a portion of the principle or only the fee as allowed by applicable
laws.
Customers qualify for a loan after completing a waterfall of underwriting phases which consists of internal
fraud and duplication checks, identity verification and external credit checks (not Trans Union, Equifax
or Experian). These steps produce a score, similar to a FICO score, which determines if a customer is
approved or denied based on adverse data components derived from their external data sources. The
funding/origination of a loan is based on a verbal verification process that includes several manual steps
including contacting the customer directly.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [3]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [4]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Data Structure
This study is based on the SLC sample of 5,000 customers, including 2,500 Good customers and 2,500
Bad customers, one record per customer. According to the rule of thumb (6, p. 28), one should have at
least 2,000 bad and 2,000 good accounts within a defined time frame in order to get a chance to develop a
good scorecard. Therefore, in principle, the given sample of accounts is suitable for scorecard development.
Each record can be treated as a data point in a high-dimensional space (approximately 50 dimensions).
In other words, each customer is characterized by 50 attributes (variables) that are differently scaled. The
following variable types are present in the data:
• numeric (interval scaled) variables such as age, average salary, credit score (industry specific credit
bureau), etc;
• categorical (nominal), with a small number of categories such as periodicity (reflects payroll
frequency) with just 4 categories;
• categorical, with a large number of categories (e.g., employer name, customer’s bank routing
number, e-mail domain, etc)
• date variables (application date, employment date, due date, etc)
The data also include a geographic variable (customer ZIP), and several customer identification variables
such as customer ID, user ID, application number, etc. Unfortunately, the data does not include
psychographic profiling variables.
There are several specific variables that we would like to mention:
BV Completed is a variable that answers whether the customer had a bank verification completed by the
loan processor. A value of 1 means the bank verification was completed. A missing value or 0 means it
was not. Bank verification involves a 3 way call with the customer and their bank to confirm deposits,
account status, etc.
Score is an industry specific credit bureau score.
Email Domain is a variable that reflects an ending part of the email address after the @ symbol.
The variable Monthly means monthly income in dollars.
Required Loan Amount is the principal amount of a loan at the time of origination.
Credit Model is a predictor that can take the following values:
• New customer scorecards – there are three credit bureau scorecards that exists, each with more
stringent approval criteria. The baseline scorecard has only identity verification and an OFAC check
while the tightest scorecard has a variety of criteria including inquiry restrictions, information about
prior loan payment history, and fraud prevention rules. They are limited to standard loan amounts
with standard fees, subject to meeting income requirements.
• Returning customers have minimal underwriting and are eligible for progressively larger loan
amounts with a fee below the standard fee for new customers.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [5]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Isoriginated is either 1 for originated loans or 0 for unoriginated loans. Withdrawn applications and denied
applications will have values of 0. Loans that were funded and had a payment attempt will have a value
of 1.
Loan Status is the status of the loan. Loan statuses are grouped as follows:
• D designates the class of Good Customers (a loan is paid off successfully with no return).
• P, R, B, and C designate the class of Bad Customers.
Other variable names are self-explanatory.
The available variables can be classified according to their role in the model development process. In
statistical terms, variables can be dependent or independent, but in data mining, a dependent variable is
called a target, and an independent variable is called an input (or predictor).
It makes sense to consider a two-segment analysis of risk. In two-segment analysis, the target is a binary
variable Risk (Good, Bad), or Risk Indicator (1, 0), where 1 corresponds to a Good customer (Risk =
Good) and 0 corresponds to a Bad customer (Risk = Bad).
As we mentioned before, each target is associated with a unique optimal regression type model. The
outcome of each model can be treated as the corresponding probability of target = 1 which, in turn, can
be interpreted as a credit score on a 100 point scale. In other words, the model under consideration
serves to estimate probability/credit score of being a Good customer.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [6]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Missing 542
Original Values Grouped Values
1000 4
2WAL 2 ILON 4 BSDE 56
ACTM 2 IMPL 11 CRED 66
ADVA 2 LDCL 70 CRSC 56
BSDE 56 LDCM 4 CRUE 70
CLBP 2 45 Distinct LDFN 9 CRUF 55
COLM 1 Values LDPL 79 CRVF 36
CRED 66 LDPT 128 CSUP 508
CRSC 56 LDRV 99 DCSM 44
CRTD 2 LEAD 1 FRND 86
CRUE 70 LFLA 1748 LDCL 70
CRUF 55 MKGN 35 LDPL 79
CSPD 1 MNTZ 172 LDPT 128
CSUP 508 MTDR 80 LDRV 99
CTAR 4 NETM 2 LFLA 1748
CVCL 14 PART 583 MISS 542
CWST 3 PDHO 4 MKGN 35 20 Distinct
DCSM 44 SUNS 1 MNTZ 172 Values
ECOM 11 SWIS 344 OTHR 223
EDEN 8 TFSP 2 PART 583
EFOR 9 VEND 25 SWIS 344
FRND 86 XMLL 15
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [7]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Another problem with the data is the misspelling and/or double name of one and the same category
for some categorical variables. In particular, the variable Email Domain has a lot of errors in the correct
spelling of a domain. For instance, there are 5 different spelling versions of yahoo.com:
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [8]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 1b.Distribution of the variable Score for Bad and Good customers
Risk
Bad Good
700
600
500
Frequency
400
300
200
100
0
0 200 400 600 800 1000 0 200 400 600 800 1000
Score
Construction of additional variables can dramatically improve the accuracy of risk prediction. New time
duration variables
orig_duration = Origination Date – Application Date
emp_duration = Origination Date – Employment Date
due_duration = Loan Due Date – Origination Date
serve as examples of new variable creation. For the sake of illustrating the importance of data
preprocessing, we can mention here that the latter two of these three variables were important predictors
selected by the TreeNet algorithm.
In order to better understand the relationship between several interval scaled (continuous) variables, quite
often a special visualization tool (a matrix plot) is used. The matrix plot (Graph 2) was developed for the
following four variables: Requested Loan Amount, Finance Charge, Score, and Applicant Age for both
segments: Good and Bad customers.
There is no obvious difference in the relationship of any pair of variables between two segments of
customers (Good /Bad).
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [9]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
The correlation structure of interval scaled variables can be different for different segments. In order
to check this hypothesis, let us select all interval scaled variables and estimate the non-parametric
correlation coefficient (Spearman correlation) for each pair of the following 8 variables: Required Loan
Amount, Financial Charge, Average Salary, Score, Applicant Age, and three duration variables that
are defined below. The Spearman correlation is used in the situation when a pair of variables under
consideration is not subject to bivariate normal distribution.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 10 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
There are no obvious differences in correlation structure among 8 numeric variables. In other words, the
correlation structure is similar among predictors for Good and Bad segments.
The complexity of SLC data can be characterized by:
• High dimensionality (about 50 predictors)
• Uncharacterizable non-linearities
• Presence of differently scaled predictors (numeric and categorical)
• Missing values for some predictors
• Large percentage of categorical predictors with extremely large numbers of categories and
extremely non-uniform frequency distributions
• Non-normality of numeric predictors.
Therefore, complex sophisticated methods should be employed to separate good and bad accounts in
the SLC data.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 11 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Methodology
Data and problem specificity limit the number of algorithms that can be used for SCL data analysis.
Any traditional parametric regression modeling approach (such as statistical logistic regression) and any
traditional nonparametric regression (such as Lowess, Generalized Additive Models, etc.) are inadequate
for such problems. The main reason for this is the presence of a large number of categorical variables
with huge numbers of categories. The inclusion of such categorical information in a multidimensional
dataset imposes a serious challenge to the way researchers analyze data (10).
Any approach based on linear, integer or non-linear programming (see, for example, 7, Chapter 5), is also
not the best approach for the same reasons.
Within the data mining universe, only some algorithms can be applicable to SLC data. For example, data
mining cluster analysis algorithms available in some of the best data mining software (SAS Enterprise
Miner and SPSS Clementine) are based on Euclidean distance and cannot be used for the same reasons
as above.
On the other hand, preliminary analyses and modeling that we had conducted have shown that the
accuracy of the nonlinear, nonparametric regression type models generated by the TreeNet and Random
Forest algorithms are acceptable.
We should note that the use of each of the applicable methods implies that the original data are randomly
separated into two parts: the first is for training (model development) and the second is for validation of
the model. Validation is the process of testing the developed model on unseen data.
SLC describes possible findings in the analysis by the following example: “Customers that are 29 years
old, live in Pennsylvania, and make less than $2,000 per month have an 88% chance of default.” This is
a typical representation of a CART / CHAID type regression tree algorithm findings. The set of CART /
CHAID type rules can easily be applied to unseen data, and can be embedded into the SLC loan credit
risk evaluation online system.
Unfortunately, it is quite possible that the best credit scoring model will not be a CART / CHAID type of
regression tree model. Nonparametric and nonlinear TreeNet or Random Forest type regression models
as a rule outperform CART / CHAID type of models on data similar to SLC data. If this is the case, a
simple representation of the best model as a set of simple rules as mentioned above is impossible. We
can gain the accuracy of risk prediction, but lose the simplicity of model/finding representation.
If this tradeoff between prediction accuracy and model representation simplicity is to be resolved in favor
of accuracy, then the project should include the development of a .Net component implementing the best
TreeNet or Random Forest scoring model that could be run independently on Salford Systems software
and integrated into the SLC loan credit risk evaluation online system.
In addition, standard data mining tools such as SPSS Clementine, SAS Enterprise Miner, and Salford
Systems have between 20 and 100 model parameter options that need to be specified by the researcher.
The settings that would produce the best model could only be found through extensive systematic
experimentation by a data mining expert and lead to the optimal model. For the SLC business problem,
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 12 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
this would mean a significantly reduced number of misclassified customers (i.e., customers with a
wrongly estimated credit risk level). However, the search for the optimal model is a combination of art and
science, and again, requires experience and expertise in the data mining field.
Here the first term AO is a model starting point; as a rule, it is the median of a target. The idea of the
algorithm is the following. The residuals are calculated as the difference between AO and the reality.
Then the residuals are transformed in order to reduce the impact of outliers (Huber’s adjustment for
outliers). The transformed residuals are called pseudo-residuals. The first tree T1 (X) is fitted to the
pseudo-residuals, and the coefficient B1 is determined. After that the new pseudo-residuals, the
difference between predicted values of a target (employing the model AO + B1 x T1 (X)) and reality, are
calculated, and the second tree T1 (X) is fitted to the new pseudo-residuals. This process is repeated,
and the final predicted value of the target is formed by adding the weighted contribution of each tree
with the corresponding weights B1, B2, ... BN. The TreeNet algorithm typically generates hundreds or even
thousands of small trees. This sequential error-correcting process converges to an accurate model that is
highly resistant to outliers and misclassified data points.
The TreeNet algorithm
• is relatively impervious to errors in the dependent variable (target), such as mislabeling
• is strongly resistant to overfitting (predicting noise instead of predicting signal)
• generalizes well to unseen data.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 13 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
TreeNet is the most flexible and powerful data mining tool, capable of generating extremely accurate
models for both regression and classification and can work with varying sizes of data sets (from small to
huge) while readily managing a large number of columns (http://www.salford-systems.com/treenet.php).
The algorithm can handle both continuous and categorical targets and predictors, and readily handle any
number of irrelevant predictors.
From now on major data mining software developers such as Megaputer Intelligence
http://www.megaputer.com/ and SAS (Enterprise Miner Version 5.3) http://www.sas.com/ include TreeNet
type algorithms in the suite of available tools.
TreeNet models are usually complex, consisting of hundreds (or even thousands) of trees, and require
special efforts to understand and interpret the results. The software generates a number of special
reports with visualization to extract the meaning of the model, such as a ranking of predictors according
to their importance on a 100 point scale, and graphs of the relationship between inputs and target.
In order to understand graphs of reports for binary targets that TreeNet generates, we need to remind
ourselves of the concepts of Odds and Log Odds.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 14 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
1. If the probability of an event is 0.5, then odds are equal to 1, and log odds are equal to 0.
2. If the probability of an event is 0, then odds are equal to 0 too, and log odds are equal to minus
infinity.
3. If the probability of an event is 1, then odds are plus infinity, and log odds are plus infinity as well.
Graph 4. Relationship between Odds and Probability of an event
50
45
40
35
30
Odds
25
20
15
10
0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Probability
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 15 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
1
Log Odds
-1
-2
-3
-4
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Probability
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 16 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
TreeNet Analysis
For the purpose of our analysis we randomly selected the data sample into two subsamples. 60% of the
sample builds the first subsample, the LEARN data. It will only be used for model estimation. The second
subsample, the TEST data, will be used to estimate model quality.
The TreeNet algorithm has about 20 different options that can be controlled by a researcher. As a rule,
usage of default options does not produce the best model. Determination of the best options/optimal
model is time consuming and requires experience and expertise.
Models that are quite different (see First and Second models below) can have similar accuracy, and the
interpretability criterion should be used to select the best model.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 17 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 18 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 6. Impact of Market Source Grouped predictor on the probability of being a good customer:
Risk = Good, controlling for all other predictors.
0.05
0.04
0.03
0.02
0.01
0.00
-0.01
-0.02
-0.03
MKGN
DCSM
MNTZ
CRSC
CRED
CRUE
OTHR
CRUF
FRND
LDOL
CRVF
CSLP
LDRV
LDPT
PART
LDPL
MISS
LFLA
SMS
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 19 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
The impact of Market Source Grouped is significant and varies across different values. All values
of the Market Source Grouped variable with bars above the X-axis increase the probability of being a
good customer, and all bars that are below the X-axis decrease the probability of being a good
customer. We can say that the value of LDPT has the highest positive impact on the probability of
being a good customer, and the value of CRSC has the highest negative impact on the probability
of being a good customer.
Table 3. Frequency of Loan Status by Market Source Grouped
The left column of Table 3 depicts the values of the Loan Status predictor, and the upper row of the table
depicts the values of the Market Source Grouped predictor. Table 3 pictured the frequency of customers
that have the following values of the Market Source Grouped variable: CRUE, CRUF, LDPT, and MISS.
These values are matched to the tallest positive bars on Graph 6 (the values with the highest positive
impact on the probability of being a good customer). Since the value of D of the Loan Status predictor
designates a Good customer, and the values C and P correspond to a Bad customer (see Data Structure
section), we can infer that there is a good agreement between the model (Graph 6) and the data (Table 3).
There is a significant difference between the information presented in Table 3 and in Graph 6. If we forget
for a minute about the existence of all other predictors, and consider just two of them (Loan Status and
Market Source Groupe) using available data, then we can arrive at the conclusion that the majority of
customers with Market Source Grouped values of CRUE, CRUF, LDPT, and MISS are Good customers.
Again, we considered the join frequency distribution of only these two predictors, and disregarded the
impact of all other predictors. In other words, there is no control for other predictors at all, and it is data
induced information.
The information, represented by Graph 6, on the contrary, was produced by the developed TreeNet
model, and it is model induced information. The relationship between target (probability of being a good
customer) and Market Source Grouped was mapped, controlling for all other predictors.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 20 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 7. TreeNet Modeling: Impact of BV Completed and Market Source Grouped predictors on
Probability of Being a Good Customer (controlling for all other predictors).
Variable Dependence for Risk$; Slice Market Source GRPD$ = BSDE (1)
Risk$ = Good
0.06
0.05
Partial Dependence
0.04
0.03
0.02
0.01
Two Variable Dependence for Risk$; Slice Market Source GRPD$ = CRSC (3)
Risk$ = Good
BV Completed
0 1
0.02
Partial Dependence
0.01
-0.02
-0.03
-0.04
Graph 7 represents an example of the non-linear interaction between BV Completed and Market Source
predictors: for different values of one predictor, the impact on the probability of being a good customer
has different directions. Actually, for the value of Market Source Grouped = BSDE both values of BV
Completed predictor have a positive impact on the probability of being a good customer. On the other
hand, for the value of Market Source Grouped =CRSC, the value 0 of the BV Completed predictor
accords with negative impact, but the value 1 accords with a positive impact on the probability of being a
good customer.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 21 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 22 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 8. TreeNet Modeling: Impact of Email Domain predictor on Probability of Being a Good Customer,
controlling for all other predictors (Second Model)
One Predictor Dependence for
Risk$ = Good
Email Domains
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 23 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
The impact of Email Domain is extremely significant, but has different directions for different values. We
can mention several distinctive segments of Email Domain values with different impacts on the probability
of being a good customer:
1. Extremely positive impact
2. Modest positive impact
3. Practically no impact
4. Modest negative impact
5. Extremely negative impact
Graph 9. TreeNet Modeling: Impact of Credit Model predictor on Probability of Being a Good Customer,
controlling for all other predictors (Second Model).
Variable Dependence for Risk$; Slice Market Source GRPD$ = BSDE (1)
Risk$ = Good
0.06
0.05
0.04
0.03
Partial Dependence
0.02
0.01
0.00
-0.01
-0.02
-0.03
-0.04
-0.05
0001 0002 0003 7777 8888
Credit Models
The only value 0001 of Credit Model is associated with a strong negative impact on the Probability of
being a Good customer. The value of 0003 is associated with the strongest positive impact on Probability
of being a Good customer.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 24 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
The left column of Table 6 depicts values of the Loan Status predictor (D designates a class of Good
customers, and C and P designate a class of Bad customers), and the upper row of the table depicts the
values of the Credit Model predictor (see section Data Structure for meaning of Credit Model values). The
data supports the directions and strength (size) of impact on the probability of being a Good customer,
induced by the model (Graph 9).
Graph 10. TreeNet Modeling: Impact of Merchant Number predictor on Probability of Being a Good
Customer, controlling for all other predictors (Second Model)
Merchant Numbers
0.00
Partial Dependence
-0.01
-0.02
-0.03
-0.04
-0.05
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 25 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
The left column of Table 7 depicts values of the Loan Status predictor (symbol D corresponds to a Good
customer, and symbols C and P correspond to a Bad customer), and the upper row of the table depicts
the values of the Merchant Number predictor. Table 7 pictured the frequency of customers that have the
following values of Merchant Number variable: 57201 and 57206. We can observe that the majority of
customers with these values of Merchant Number belong to the segment of Bad customers. Again, the
model induced knowledge (Graph 10) and the data are in a good agreement.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 26 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 11. TreeNet Modeling: Impact of Score predictor on Probability of Being a Good Customer,
controlling for all other predictors (Second Model)
0.010
0.008
Partial Dependence
0.006
0.004
0.002
0.000
200 300 400 500 600 700 800 900 1000
-0.002
-0.004
-0.006 Score
Credit bureau score (Score predictor) has a binary impact on the probability of being a good customer:
if the score is 600 and higher, then the probability jumps up, and if an applicant score is less than 600,
then the probability jumps down, but this negative jump is not very large. In other words, for scores of
less than 600 the impact on the probability of being a good customer is very modest.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 27 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 12. TreeNet Modeling: Impact of Employment Duration and Score on Probability of Being a Good
Customer, controlling for all other predictors (Second Model)
0.010
0.008
Partial Dependence
0.006
0.004
0.002
0.000
1000 2000 3000 4000 5000 6000 7000 8000 9000
-0.002
-0.004
If Employment Duration is less than 1,500 days, we can say that there is no strong impact on the
probability of being a Good customer (we can treat the part of Graph 12 curve for Employment Duration
is less than 1,500 days as a noise). The real impact of Employment Duration starts from 1,500 days, and
is linear up to 5,000 days. Then the probability of being a Good customer has a diminishing returns effect
when Employment Duration becomes greater than 5000 days.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 28 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 13. TreeNet Modeling: Impact of Age on Probability of Being a Good Customer,
controlling for all other predictors (SecondModel).
0.002
0.000
Partial Dependence
20 30 40 50 60 70
-0.002
-0.004 Age
It turned out that the best probability of being a Good customer occurs with applicants between ages 38
and 42 years. If the age is less than 32, then the impact is negative, and the younger an applicant the
lower the probability of being a good customer. On the other hand, the strength of positive impact on the
probability of being a Good customer goes down when age is increasing.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 29 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Graph 14. TreeNet Modeling: Impact of the interaction of BV Completed and Credit Model predictors on
Probability of Being a Good Customer, controlling for all other predictors (Second Model)
The vertical axis maps log odds of the event Risk = Good. Two other axes are matched by values of the
BV Completed and Credit Model predictors. The combination of BV Completed =1 and Credit Model =
7777 has the strongest positive impact on the probability of being a Good customer. On the other hand,
the combination of BV Completed = 0 and Credit Model = 0001 has the strongest negative impact on
the probability at hand.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 30 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
GIS Application
In order to improve the quality of the data mining predictive models, it is useful to enrich SLC’s data with
additional region-level demographic, socioeconomic and housing variables that can be obtained from the
US Bureau of the Census. These variables include
• median household income
• education
• median gross rent
• median house value, etc.
The variables can be obtained at different geographic levels, namely at the ZIP Code and the Census
block levels.
Because the SLC data include ZIP code as one of the variables, it will be possible to merge the Zip level
Census data to the SLC data directly.
However, if customer address data are available, it will be advisable to obtain the Census data for smaller
geographic regions (namely, Census blocks). Because Census blocks are in general much smaller than
Zip codes, the Census estimates for these areas will be much more precise and much more applicable
than for their ZIP code counterparts.
Using the Geographic Information Systems software, customer addresses can be geocoded (i.e. the
latitude and longitude of the addresses can be determined, and the addresses can be mapped). Then,
it will be possible to spatially match the addresses to their respective Census blocks (and Census block
data). Demographic, socioeconomic, and housing data can then be obtained at the Census Block level.
Although geocoding is a time intensive procedure, enriching the SLC data with the Census block level
data will make the accuracy of the credit score even higher.
Employing dissimilar data mining tools, it is easy to determine which Census variables are crucial for
customer risk assessment. When corresponding data become available, maps produced by the GIS will
enable us to visually identify zip codes with many bad (high) risk customers and zip codes with many
good (low) risk customers (Graph 15 and Graph 16).
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 31 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Legend:
1 - 37
38 - 57
58 - 83
84 - 100
No Data Available
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 32 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
Legend:
0-1
2-3
4-6
7 - 11
12 - 18
No Data Available
Conclusion
Knowledge generated by the TreeNet models is in good compliance with the data and readily
interpretable. The accuracy of TreeNet models is superior. TreeNet is an appropriate tool for non-
traditional scorecard development, using SLC data. TreeNet model induced knowledge is a great asset
for traditional scorecard development.
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 33 ]
Business Intelligence Solutions – White Paper
Credit Risk Evaluation of Online Personal Loan Applicants: A Data Mining Approach
www.bisolutions.us | Data. Knowledge. Action.
References
1. J. Friedman (1999), Greedy Function Approximation: A Gradient Boosting Machine
http://www.salford-systems.com/doc/GreedyFuncApproxSS.pdf
2. J. Friedman (1999), Stochastic Gradient Boosting
http://www.salford-systems.com/doc/StochasticBoostingSS.pdf
3. TreeNet Frequently Asked Questions
http://www.salford-systems.com/doc/TreeNetFAQ.pdf
4. Dan Steinberg (2006), Overview of TreeNet Technology. Stochastic Gradient Boosting
http://perseo.dcaa.unam.mx/sistemas/doctos/TN_overview.pdf
5. Boosting Trees for Regression and Classification, StatSoft Electronic Text Book
http://www.statsoft.com/textbook/stbootres.html
6. Naeem Siddiqi (2005), Credit Risk Scorecards: Developing and Implementing Intelligent Credit
Scoring (Wiley and SAS Business Series), Wiley, 208 p.
7. Thomas, L.C., Edelman, D.B., Crook, J.N, (2002), Credit Scoring and its Applications, SIAM, 250 p.
8. Matignon, R. (2007). Data Mining Using SAS Enterprise Miner. Wiley Publishing.
9. Myatt, G. J. (2006). Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data
Mining. Wiley Publishing.
10. Seo, J. and Gordish-Dressman, H. (2007). Exploratory Data Analysis With Categorical
Variables: An Improved Rank-by-Feature Framework and a Case Study. International Journal
of Human-Computer Interaction. Available online at http://www.informaworld.com/smpp/
title~content=t775653655
Copyright © Pavel Brusilovskiy – Business Intelligence Solutions and David Johnson – Strategic Link Consulting [ 34 ]