You are on page 1of 16

Project Documentation Team 10

Predictive Modeling

Predicting Loan Repayments

Members: Mohd Hasan, Michael Koelbl

Nkechi Nwudu, Twitty Skaria

Professor: Jose Cruz

Date: 12.08.2016
Contents

1 Introduction 1
1.1 About Lending Club . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 About the Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Data Exploration 5

3 Modification and Model Building 8


3.1 Principal Component Analysis . . . . . . . . . . . . . . . . . . . 8
3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Model Building . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Model Assessment and Analysis 11

5 Conclusion 13
1 Introduction

1.1 About Lending Club

LendingClub [Len16a] is an online platform that connects borrowers and in-


vestors. The company supports people who might have been rejected from
traditional bank loans and therefore have to look for other sources. Lenders
can provide unsecured money from $1,000 to $40,000. This means that, un-
like in traditional bank business, the borrower does not have to provide any
securities. This makes it a more risky deal. However, the average interest rate
for investors are even higher. The standard loan period is either three or five
years.
LendingClub’s business model is to charge an origination fee for borrowers and
service fees for investors. In order to minimize risk and make the platform more
attractive, already 90% of all applications are denied. The company provides
anonymized data from its transactions, which will be used for this project.
The goal of this project is to predict whether a loan will be com-
pletely paid back or charged off.
Project Documentation Team 10 2

1.2 About the Sample

We initially downloaded a dataset with 100,000 records from transactions in


2013 from the platforms database [Len16b]. We manually picked an appropri-
ate number of variables that seemed reasonable. The descriptions of those can
be seen in table 1.1.
Out of this given dataset, we initially made some changes:

• Recorded Loan_Status into a binary variable, only having fully paid (1)
and charged off (0) records

• Used the logarithm for annual_inc in order to weaken the impact of


high-level incomes

• Excluded addr_state, because it did not seem to have a valuable contri-


bution

• Converted earliest_cr_line into a date variable

• Created additional division and region columns based on the given addr_state

• Observations with missing values were deleted


Project Documentation Team 10 3

Table 1.1: Description of Variables in the Dataset


Variable Description
ID A unique ID assigned for the loan listing
funded_amnt_inv The total amount committed by investors for that loan at
that point in time
term The number of payments on the loan. Values are in months
and can be either 36 or 60
int_rate Interest rate on the loan
installment The monthly payment owed by the borrower if the loan
originates
grade Assigned loan grade
emp_length Employment length in years. Possible values are between
0 and 10 where 0 means less than one year and 10 means
ten or more years
annual_inc The self-reported annual income provided by the borrower
during registration
purpose A category provided by the borrower for the loan request
addr_state The state provided by the borrower in the loan application
dti A ratio calculated using the borrower’s total monthly debt
payments on the total debt obligations, excluding mortgage
and the requested LC loan, divided by the borrower’s self-
reported monthly income
earliest_cr_line The month the borrower’s earliest reported credit line was
opened
inq_last_6mths The number of inquiries in past 6 months
revol_bal The credit revolving balance
loan_status Current status of the loan

1.3 Our Approach

Given the vast variety of predictive modeling techniques and their combina-
tions, we decided to use a structured approach. This includes the deployment
of Clustering, Principal Component Analysis, Logistic Regression, Classifica-
tion and Regression Trees (CART) and Neural Networks. The combination of
all techniques is shown in figure 1.1. We started with an initial exploration
of the dataset. The treatment of missing values was neglected, because only
a small number of records (<10%) was affected. Afterwards we conducted
Project Documentation Team 10 4

Figure 1.1: Structured Approach

a Principal Component Analysis (PCA) and then applied hierarchical Clus-


tering. Those datasets were used to build make predictions using Logistic
Regression, CART and Neural Networks. The different outcomes were com-
pared. The best performing model was then subjected to further analysis. We
found that varying the threshold can significantly improve the business value
in terms of average revenue per loan.
2 Data Exploration

During this phase, we mostly used visualization and tabulation to gather a


better understanding of the data. The goal was to infer some trends and other
interesting facts that can be used for the ongoing analysis. Some selected
screenshots are given in the figures 2.1 - 2.3. We can see that the funded
amount of a given loan is on average approximately $12,000. The majority of
money is taken for debt consolidation or paying off credit cards. Loans that
are 60 months long are more likely to being charged off than others (33% to
13.5%). We also see that the given credit grade may have a valuable impact.
The worst factor G is more than eight times more likely to be charged off
than the best A. However, it is interesting that employment length does not
play an important role. In contrast, it even seems that longtime employees
are more risky customers. Small business loans are very likely to be not paid
back (27.8%) whereas weddings and credit card consolidations seem to be a
safe bet.
Project Documentation Team 10 6

Figure 2.1: Distribution of Loan Amounts

Figure 2.2: Purposes for Taking Loans


Project Documentation Team 10 7

Figure 2.3: Paid Back and Charged Off Rates for Specific Factors
3 Modification and Model
Building

3.1 Principal Component Analysis

To reduce correlation between the continuous variables, we conducted a PCA.


We used the rule of thumb of extracting components with an Eigenvalue of
larger to one, which is shown in figure 3.1. Therefore, we created four new com-
ponents that in total explain approximately 76.3% of all information. However,
we still can combine those factors with the categorical variables to improve ac-
curacy.

Figure 3.1: Results of the Principal Component Analysis


Project Documentation Team 10 9

3.2 Clustering

We decided to drop k-means clustering from our agenda. It would unneces-


sarily increase our workload and not significantly add value. The hierarchical
clustering does not need to specify any number of clusters, but it provides some
recommendation on which amount would fit best. We apply this technique to
the existing datasets. We are clustering in the following way:

• Cluster all original variables

• Cluster all categorical variables and new principal components

• Cluster principal components only

• Cluster all variables having the training data split into 50% charged off
and 50% paid back

• Cluster all variables using k-means (for validation purposes)

Those resulting five dataset were the foundation for the following modeling.
We used each of them to build four models.

3.3 Model Building

The model comparison of all five datasets in figure 3.2 reveals that the given
models are not very meaningful. The best version where we used only hier-
archical clustering on all variables lead to an R-Square in the testing set of
13.09% using neural networks and 13.08% fitting a logistic regression. The
misclassification rate for those models was 16.71% and 16.65%. The baseline,
in contrast, would have been that we predicted each record as being fully paid
pack. This would lead to a misclassification rate of 16.8%, which could not be
Project Documentation Team 10 10

Figure 3.2: Comparison of the Four Different Models

beaten by our models.


4 Model Assessment and
Analysis

The initial outcomes of our models were not very persuasive. However, the
business context implied that losing an investment in a charged off loan is
more severe than rejecting an application, which might be a good customer.
The confusion matrix of our best model, having a cutoff of 0.5, is shown in
table 4.1. We focused on the false negatives. Our model would have predicted
2,725 customers as acceptable that are in fact charged off. With this in mind,
we imposed penalties for wrong and benefits for correct classification. The
weights are given in table 4.2.

Table 4.1: Confusion Matrix With 0.5 Cutoff


Actual Predicted
loan_status 0 1
0 68 2,725
1 53 13,780

Table 4.2: Imposed Penalties and Benefits


Actual Predicted
loan_status 0 1
0 + Avg. Loan charged off - (Avg. Loan charged off +
Avg. Interest earned)
1 - Avg. Interest earned + Avg. Interest earned
Project Documentation Team 10 12

Table 4.3: Confusion Matrix With 0.1 Cutoff


Actual Predicted
loan_status 0 1
0 2,494 299
1 9,250 4,583

These terms lead to the actual target function where [j;k] is the number of
records in given confusion matrix. Those values are multiplied with the fol-
lowing parameters from the dataset: Average value of a loan that is rejected
and would otherwise be charged of (le ). Average interest earned on a loan that
is paid back (ie ). Average value of a loan that is charged off and where the
lender loses his investment (ll ). Interest that is lost whenever a trustworthy
loan is rejected (il ).

Pn
i=1 ([1; 1]i ∗ le + [0; 0]i ∗ ie ) − ([0; 1]i ∗ ll + [1; 0]i ∗ il )
Max! Avg.Revenue =
n
(4.1)
We wanted to maximize the average profit per loan. The current values with
a cutoff of 0.5 would lead to an average loss of approximately $1,193 per loan.
This is mostly caused by the high number of false negatives. Neglecting any
specific numbers, we picked the average amount and interest rates per loan.
Over all records, we would have 2,725 cases where the investor loses in average
approximately $14,900.
Varying the cutoff showed that the best solution would be a cutoff of 0.1. The
numbers are given in table 4.3. We could drastically reduce the number of
false negatives. However, we also rejected a majority of applications that are
not convincing enough. This trade-off led to an average revenue of $1,414 per
loan, though.
5 Conclusion

We have been able to show that the most important variables in the prediction
of loan payments are the borrower’s self-reported annual income and the num-
ber of payments on the loan. Even though our model did not have the best
R-Square compared to the baseline model, reducing the cutoff was a great way
to derive business value from our model, despite the number of creditworthy
customers that had to be rejected.
This essentially emphasizes that we need to find balance in business when it
comes to risk taking. A moderately negligible revenue loss when rejecting good
customers falsely classified as bad by the model does not relatively compare to
actual net revenue loss when the cutoff is changed. The dramatic improvement
in profits is proof that the trade-off was a very reasonable one.
Our model provides an effective and time-saving way to make decisions on
the integrity of potential customers. Manual checks can be done by a person
in exceptional cases. It is also a great way to use the huge data asset that
the company has, to more strategically target customers and stay ahead of
competitors.
Bibliography

[Day16] The Daily Beast. Why Is Larry Summers Signing Up With Lend-
ing Club. Internet: http://www.thedailybeast.com/articles/2012/
12/14/why-is-larry-summers-signing-up-with-lending-club.html, re-
trieved on 12/02/16.

[Len16a] LendingClub. Internet: https://www.lendingclub.com/, retrieved on


12/02/16.

[Len16b] LendingClub Statistics. Internet: https://www.lendingclub.com/


info/download-data.action, retrieved on 12/02/16.

You might also like