You are on page 1of 7

CMPEDA Project: Lending Club loan data

prediction
A Finance Predictive Model based on DNN and XGboost

Manuel Luci, Andrea Dell’Abate

1 Introducion
LendingClub is a US peer-to-peer lending company which specialises in lending various
types of loans to urban customers.
When the company receives a loan application, we have to make a decision based on the
applicant’s profile.
Two types of risks are associated with the bank’s decision :
— if the applicant is likely to repay the loan, then not approving it results in a loss
of business to the company ;
— if the applicant is not likely to repay the loan, i.e. he/she is likely to default, then
approving it may lead to a financial loss for the company.
Our aim is to develop the best supervised machine learning model capable to identify
which borrowers will payoff their loans.
We’ll try to do this binary classification task using Deep Neural Network and X-G boost
models.
Link dataset :https://www.kaggle.com/code/faressayah/lending-club-loan-defaulters-predictio
data?select=lending_club_loan_two.csv

2 Preprocessing (module : data_analysis )


We divided the preprocessing part into two phases :
— In the first one we developed the statistical part of the dataset analyzing how all
the features were distributed and how they were correlated with each other.
— In the second part we modified the dataset for our purposes : the missing values
were treated (eliminating or replacing them as appropriate), we eliminated the
variables that were highly correlated with each other and we chose those that were
most useful for our machine learning task.
Finally we replaced all categorical variables with numeric ones and we performed
the one hot encoder by means of the dummy module (dummy data).

1
Below we report some usefull graphs for our analysis :

300000

250000

200000
count

150000

100000

50000

0
Fully Paid Charged Off
loan_status

Figure 1 – loan_status distribution, we can see that our database is strictly unbalanced

1.0
loan_amnt 1 0.17 0.95 0.34 0.017 0.2 -0.078 0.33 0.1 0.22 0.22 -0.11
int_rate 0.17 1 0.16 -0.057 0.079 0.012 0.061 -0.011 0.29 -0.036 -0.083 0.057
installment 0.95 0.16 1 0.33 0.016 0.19 -0.068 0.32 0.12 0.2 0.19 -0.099 0.8
annual_inc 0.34 -0.057 0.33 1 -0.082 0.14 -0.014 0.3 0.028 0.19 0.24 -0.05
dti 0.017 0.079 0.016 -0.082 1 0.14 -0.018 0.064 0.088 0.1 -0.025 -0.015 0.6
open_acc 0.2 0.012 0.19 0.14 0.14 1 -0.018 0.22 -0.13 0.68 0.11 -0.028
pub_rec -0.078 0.061 -0.068 -0.014 -0.018 -0.018 1 -0.1 -0.076 0.02 0.012 0.7 0.4
revol_bal 0.33 -0.011 0.32 0.3 0.064 0.22 -0.1 1 0.23 0.19 0.19 -0.12
revol_util 0.1 0.29 0.12 0.028 0.088 -0.13 -0.076 0.23 1 -0.1 0.0075 -0.087 0.2
total_acc 0.22 -0.036 0.2 0.19 0.1 0.68 0.02 0.19 -0.1 1 0.38 0.042
mort_acc 0.22 -0.083 0.19 0.24 -0.025 0.11 0.012 0.19 0.0075 0.38 1 0.027 0.0
pub_rec_bankruptcies -0.11 0.057 -0.099 -0.05 -0.015 -0.028 0.7 -0.12 -0.087 0.042 0.027 1
loan_amnt

annual_inc

open_acc
int_rate

installment

dti

pub_rec

revol_bal

revol_util

total_acc

mort_acc

pub_rec_bankruptcies

Figure 2 – Correlation matrix

3 Modelling [module : machine_learning]


We tried to use two different models, an algorithm such as XGBoost and a neural
network model.
Regarding the optimization of these models, we saw as a first instance that the class to
be predicted was highly unbalanced, with a ratio of 4 :1.
We decided therefore to adopt a technique to rebalance the dataset called SMOTE
(Synthetic Minority Oversampling).
This technique relies on resampling the data in such a way as to create new instances
through k neighbors on imaginary lines connecting minority class values.

2
0.20

0.15

0.10

0.05

0.00

Source Verified
Not Verified

Verified
verification_status

Figure 3 – barplot verification_status distribution

0.200

0.175

0.150

0.125

0.100

0.075

0.050

0.025

0.000
1 year

10+ years

2 years

3 years

4 years

5 years

6 years

7 years

8 years

9 years

< 1 year
emp_length

Figure 4 – barplot emp_lenght_bar distribution

3.1 X-G Boost


As a first step, we built a preliminary model, instead of determining the best number
of trees with cross validation we implemented a ’Early Stopping’ callback to stop training
when the model doesn’t learn anymore.
To evaluate the XGBoost model we used AUC value. Once the first preliminary model
is trained we have verified how well it works plotting a confusion matrix.
It can predict very well the majority class but not as well the minority.
For instance we tried to optimaze the hyperparameters also doing Cross Validation, with
Grid Search.

3.2 Deep Neural Network


For this model we proceed in several steps.
We built an “inverted pyramid” model and also here to avoid overfitting we used a vali-
dation set and we implemented a ’Early Stopping’ callback.

3
80000
70000
Charged Off 10403 11224
60000
50000
True label

40000
30000
Fully Paid 1732 87828
20000
10000
Charged Off Fully Paid
Predicted label

Figure 5 – Confusion matrix XG-Boost without hyperparameters

80000
70000
Charged Off 10644 10983
60000
50000
True label

40000
30000
Fully Paid 1865 87695
20000
10000
Charged Off Fully Paid
Predicted label

Figure 6 – Confusion matrix XG-Boost with hyperparameters

We added other layers using Dropout regolarization and compared the two models.
The regolarization gave us the best results so we procedeed always considering Dropout.
After this we tried to train another model called “diamond” and compare to the prece-
dents one.
All the models we trained and tested give us results very similar but at the end the best
one has been the “inverted pyramid” model with Dropout regolarization.

4
Also in this case we tried to optimize the hyperparamenters, doing Cross Validation,
with Random Search.

50000
Charged Off 6927 7598
40000
True label

30000

20000
Fully Paid 790 58810
10000

Charged Off Fully Paid


Predicted label

Figure 7 – Confusion matrix DNN without dropout

50000
Charged Off 6285 8240
40000
True label

30000

20000
Fully Paid 102 59498
10000

Charged Off Fully Paid


Predicted label

Figure 8 – Confusion matrix DNN with dropout

5
50000
Charged Off 6581 7944
40000
True label

30000

20000
Fully Paid 327 59273
10000

Charged Off Fully Paid


Predicted label

Figure 9 – Confusion matrix diamond DNN with dropout

50000
Charged Off 6863 7662
40000
True label

30000

20000
Fully Paid 643 58957
10000

Charged Off Fully Paid


Predicted label

Figure 10 – Confusion matrix DNN with hyperparameters

6
4 Conclusion
In this table we summarize the results of our model :

model Accuracy Precision Recall F1 Score


Basic XGBoost 0.8835 0.8867 0.9807 0.9313
Hyperparameters XGBoost 0.8844 0.8887 0.979 0.9317
Basic DNN 0.8543 0.9086 0.9104 0.9095
Dropout DNN 0.8874 0.8783 0.9982 0.9344
Diamond DNN 0.8544 0.9086 0.9104 0.9095
Hyperprameter DNN 0.8880 0.8850 0.9892 0.9342

As we can see we have obtained great results in each metrics but f1 score.
On confusion matrix we have obtained a great prediction of the majority class ∼ 90 %
in every model but a worse one on minority ∼ 54 %. This not so optimal classification
is due to the beginning unbalanced of our dataset.
From our model, DNN predicts better than XGBoost.

You might also like