Professional Documents
Culture Documents
Predictive Modeling
Date: 12.08.2016
Contents
1 Introduction 1
1.1 About Lending Club . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 About the Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Data Exploration 5
5 Conclusion 13
1 Introduction
• Recorded Loan_Status into a binary variable, only having fully paid (1)
and charged off (0) records
• Created additional division and region columns based on the given addr_state
Given the vast variety of predictive modeling techniques and their combina-
tions, we decided to use a structured approach. This includes the deployment
of Clustering, Principal Component Analysis, Logistic Regression, Classifica-
tion and Regression Trees (CART) and Neural Networks. The combination of
all techniques is shown in figure 1.1. We started with an initial exploration
of the dataset. The treatment of missing values was neglected, because only
a small number of records (<10%) was affected. Afterwards we conducted
Project Documentation Team 10 4
Figure 2.3: Paid Back and Charged Off Rates for Specific Factors
3 Modification and Model
Building
3.2 Clustering
• Cluster all variables having the training data split into 50% charged off
and 50% paid back
Those resulting five dataset were the foundation for the following modeling.
We used each of them to build four models.
The model comparison of all five datasets in figure 3.2 reveals that the given
models are not very meaningful. The best version where we used only hier-
archical clustering on all variables lead to an R-Square in the testing set of
13.09% using neural networks and 13.08% fitting a logistic regression. The
misclassification rate for those models was 16.71% and 16.65%. The baseline,
in contrast, would have been that we predicted each record as being fully paid
pack. This would lead to a misclassification rate of 16.8%, which could not be
Project Documentation Team 10 10
The initial outcomes of our models were not very persuasive. However, the
business context implied that losing an investment in a charged off loan is
more severe than rejecting an application, which might be a good customer.
The confusion matrix of our best model, having a cutoff of 0.5, is shown in
table 4.1. We focused on the false negatives. Our model would have predicted
2,725 customers as acceptable that are in fact charged off. With this in mind,
we imposed penalties for wrong and benefits for correct classification. The
weights are given in table 4.2.
These terms lead to the actual target function where [j;k] is the number of
records in given confusion matrix. Those values are multiplied with the fol-
lowing parameters from the dataset: Average value of a loan that is rejected
and would otherwise be charged of (le ). Average interest earned on a loan that
is paid back (ie ). Average value of a loan that is charged off and where the
lender loses his investment (ll ). Interest that is lost whenever a trustworthy
loan is rejected (il ).
Pn
i=1 ([1; 1]i ∗ le + [0; 0]i ∗ ie ) − ([0; 1]i ∗ ll + [1; 0]i ∗ il )
Max! Avg.Revenue =
n
(4.1)
We wanted to maximize the average profit per loan. The current values with
a cutoff of 0.5 would lead to an average loss of approximately $1,193 per loan.
This is mostly caused by the high number of false negatives. Neglecting any
specific numbers, we picked the average amount and interest rates per loan.
Over all records, we would have 2,725 cases where the investor loses in average
approximately $14,900.
Varying the cutoff showed that the best solution would be a cutoff of 0.1. The
numbers are given in table 4.3. We could drastically reduce the number of
false negatives. However, we also rejected a majority of applications that are
not convincing enough. This trade-off led to an average revenue of $1,414 per
loan, though.
5 Conclusion
We have been able to show that the most important variables in the prediction
of loan payments are the borrower’s self-reported annual income and the num-
ber of payments on the loan. Even though our model did not have the best
R-Square compared to the baseline model, reducing the cutoff was a great way
to derive business value from our model, despite the number of creditworthy
customers that had to be rejected.
This essentially emphasizes that we need to find balance in business when it
comes to risk taking. A moderately negligible revenue loss when rejecting good
customers falsely classified as bad by the model does not relatively compare to
actual net revenue loss when the cutoff is changed. The dramatic improvement
in profits is proof that the trade-off was a very reasonable one.
Our model provides an effective and time-saving way to make decisions on
the integrity of potential customers. Manual checks can be done by a person
in exceptional cases. It is also a great way to use the huge data asset that
the company has, to more strategically target customers and stay ahead of
competitors.
Bibliography
[Day16] The Daily Beast. Why Is Larry Summers Signing Up With Lend-
ing Club. Internet: http://www.thedailybeast.com/articles/2012/
12/14/why-is-larry-summers-signing-up-with-lending-club.html, re-
trieved on 12/02/16.