CMPEDA Project: Lending Club Loan Data Prediction

CMPEDA Project: Lending Club loan data
prediction
A Finance Predictive Model based on DNN and XGboost
Manuel Luci, Andrea Dell’Abate
1 Introducion
LendingClub is a US peer-to-peer lending company which specialises in lending various
types of loans to urban customers.
When the company receives a loan application, we have to make a decision based on the
applicant’s profile.
Two types of risks are associated with the bank’s decision :
— if the applicant is likely to repay the loan, then not approving it results in a loss
of business to the company ;
— if the applicant is not likely to repay the loan, i.e. he/she is likely to default, then
approving it may lead to a financial loss for the company.
Our aim is to develop the best supervised machine learning model capable to identify
which borrowers will payoff their loans.
We’ll try to do this binary classification task using Deep Neural Network and X-G boost
models.
Link dataset :https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-predictio
data?select=lending_club_loan_two.csv
2 Preprocessing (module : data_analysis )

We divided the preprocessing part into two phases :
— In the first one we developed the statistical part of the dataset analyzing how all
the features were distributed and how they were correlated with each other.
— In the second part we modified the dataset for our purposes : the missing values
were treated (eliminating or replacing them as appropriate), we eliminated the
variables that were highly correlated with each other and we chose those that were
most useful for our machine learning task.
Finally we replaced all categorical variables with numeric ones and we performed
the one hot encoder by means of the dummy module (dummy data).
1
Below we report some usefull graphs for our analysis :
300000
250000
200000
count
150000
100000
50000
0
Fully Paid Charged Off
loan_status
Figure 1 – loan_status distribution, we can see that our database is strictly unbalanced
1.0
loan_amnt 1 0.17 0.95 0.34 0.017 0.2 -0.078 0.33 0.1 0.22 0.22 -0.11
int_rate 0.17 1 0.16 -0.057 0.079 0.012 0.061 -0.011 0.29 -0.036 -0.083 0.057
installment 0.95 0.16 1 0.33 0.016 0.19 -0.068 0.32 0.12 0.2 0.19 -0.099 0.8
annual_inc 0.34 -0.057 0.33 1 -0.082 0.14 -0.014 0.3 0.028 0.19 0.24 -0.05
dti 0.017 0.079 0.016 -0.082 1 0.14 -0.018 0.064 0.088 0.1 -0.025 -0.015 0.6
open_acc 0.2 0.012 0.19 0.14 0.14 1 -0.018 0.22 -0.13 0.68 0.11 -0.028
pub_rec -0.078 0.061 -0.068 -0.014 -0.018 -0.018 1 -0.1 -0.076 0.02 0.012 0.7 0.4
revol_bal 0.33 -0.011 0.32 0.3 0.064 0.22 -0.1 1 0.23 0.19 0.19 -0.12
revol_util 0.1 0.29 0.12 0.028 0.088 -0.13 -0.076 0.23 1 -0.1 0.0075 -0.087 0.2
total_acc 0.22 -0.036 0.2 0.19 0.1 0.68 0.02 0.19 -0.1 1 0.38 0.042
mort_acc 0.22 -0.083 0.19 0.24 -0.025 0.11 0.012 0.19 0.0075 0.38 1 0.027 0.0
pub_rec_bankruptcies -0.11 0.057 -0.099 -0.05 -0.015 -0.028 0.7 -0.12 -0.087 0.042 0.027 1
loan_amnt
annual_inc
open_acc
int_rate
installment
dti
pub_rec
revol_bal
revol_util
total_acc
mort_acc
pub_rec_bankruptcies
Figure 2 – Correlation matrix
3 Modelling [module : machine_learning]

We tried to use two different models, an algorithm such as XGBoost and a neural
network model.
Regarding the optimization of these models, we saw as a first instance that the class to
be predicted was highly unbalanced, with a ratio of 4 :1.
We decided therefore to adopt a technique to rebalance the dataset called SMOTE
(Synthetic Minority Oversampling).
This technique relies on resampling the data in such a way as to create new instances
through k neighbors on imaginary lines connecting minority class values.
2
0.20
0.15
0.10
0.05
0.00
Source Verified
Not Verified
Verified
verification_status
Figure 3 – barplot verification_status distribution
0.200
0.175
0.150
0.125
0.100
0.075
0.050
0.025
0.000
1 year
10+ years
2 years
3 years
4 years
5 years
6 years
7 years
8 years
9 years
< 1 year
emp_length
Figure 4 – barplot emp_lenght_bar distribution
3.1 X-G Boost

As a first step, we built a preliminary model, instead of determining the best number
of trees with cross validation we implemented a ’Early Stopping’ callback to stop training
when the model doesn’t learn anymore.
To evaluate the XGBoost model we used AUC value. Once the first preliminary model
is trained we have verified how well it works plotting a confusion matrix.
It can predict very well the majority class but not as well the minority.
For instance we tried to optimaze the hyperparameters also doing Cross Validation, with
Grid Search.
3.2 Deep Neural Network

For this model we proceed in several steps.
We built an “inverted pyramid” model and also here to avoid overfitting we used a vali-
dation set and we implemented a ’Early Stopping’ callback.
3
80000
70000
Charged Off 10403 11224
60000
50000
True label
40000
30000
Fully Paid 1732 87828
20000
10000
Charged Off Fully Paid
Predicted label
Figure 5 – Confusion matrix XG-Boost without hyperparameters
80000
70000
60000
50000
True label
40000
30000
20000
10000
Predicted label
Figure 6 – Confusion matrix XG-Boost with hyperparameters
We added other layers using Dropout regolarization and compared the two models.
The regolarization gave us the best results so we procedeed always considering Dropout.
After this we tried to train another model called “diamond” and compare to the prece-
dents one.
All the models we trained and tested give us results very similar but at the end the best
one has been the “inverted pyramid” model with Dropout regolarization.
4
Also in this case we tried to optimize the hyperparamenters, doing Cross Validation,
with Random Search.
50000
40000
True label
30000
20000
10000

Predicted label
Figure 7 – Confusion matrix DNN without dropout
50000
40000
True label
30000
20000
10000

Predicted label
Figure 8 – Confusion matrix DNN with dropout
5
50000
40000
True label
30000
20000
10000

Predicted label
Figure 9 – Confusion matrix diamond DNN with dropout
50000
40000
True label
30000
20000
10000

Predicted label
Figure 10 – Confusion matrix DNN with hyperparameters
6
4 Conclusion
In this table we summarize the results of our model :
model Accuracy Precision Recall F1 Score

Basic XGBoost 0.8835 0.8867 0.9807 0.9313
Hyperparameters XGBoost 0.8844 0.8887 0.979 0.9317
Basic DNN 0.8543 0.9086 0.9104 0.9095
Dropout DNN 0.8874 0.8783 0.9982 0.9344
Diamond DNN 0.8544 0.9086 0.9104 0.9095
Hyperprameter DNN 0.8880 0.8850 0.9892 0.9342
As we can see we have obtained great results in each metrics but f1 score.
On confusion matrix we have obtained a great prediction of the majority class ∼ 90 %
in every model but a worse one on minority ∼ 54 %. This not so optimal classification
is due to the beginning unbalanced of our dataset.
From our model, DNN predicts better than XGBoost.

CMPEDA Project: Lending Club Loan Data Prediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CMPEDA Project: Lending Club Loan Data Prediction

Uploaded by

Copyright:

Available Formats

CMPEDA Project: Lending Club loan data

Manuel Luci, Andrea Dell’Abate

2 Preprocessing (module : data_analysis )

Figure 2 – Correlation matrix

3 Modelling [module : machine_learning]

Figure 3 – barplot verification_status distribution

Figure 4 – barplot emp_lenght_bar distribution

3.1 X-G Boost

3.2 Deep Neural Network

Figure 5 – Confusion matrix XG-Boost without hyperparameters

Figure 6 – Confusion matrix XG-Boost with hyperparameters

Charged Off Fully Paid

Figure 7 – Confusion matrix DNN without dropout

Charged Off Fully Paid

Figure 8 – Confusion matrix DNN with dropout

Charged Off Fully Paid

Figure 9 – Confusion matrix diamond DNN with dropout

Charged Off Fully Paid

Figure 10 – Confusion matrix DNN with hyperparameters

model Accuracy Precision Recall F1 Score

You might also like