You are on page 1of 9

Bank Loan Default Prediction Model

Capstone Project (Note-1)


Submitted By:

m
er as
Brijendra Yadav

co
eH w
o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
Contents
1. Introduction
2. Data Report
3. Exploratory Data Analysis
3.1 Univariate Analysis
3.2 Bi-Variate Analysis
3.3 Unwanted Variable
3.4 Missing Value Identification

m
3.5 Outlier Identification

er as
3.6 Variable Transformation

co
eH w
3.7 Addition of New Variable

o.
4. Insights from EDA
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
1. Introduction

Defining problem statement

Retail lending is considered risk-free as the bank might have done necessary due diligence including
security/collateral requirement and credit score checks along with repayment track of applicant. There
were instances where customers have defaulted the loan despite following necessary procedures
resulting in creation of huge NPA for Banks.
Hence, there is a need to build a model to predict default loan which will help the bank to
identify any risk associated before lending.

Need of the study/project

Retail Lending contributes more than half of bank’s income as Interest Income. In some cases, higher

m
delinquency has even led to bank run and closure of a bank. PSU banks are in worst situation and are

er as
unable to generate profits because of huge NPA and loan defaults. Hence there is need to understand

co
eH w
and build a model which can predict whether the loan is secure or not.

o.
C) Understanding business/social opportunity
rs e
ou urc
If the bank is able to effectively predict the chance of loan default before the disbursement, then the
future default can be reduced significantly and thus increasing Income for the Bank. This will help the
o

bank to maintain good profitability and also enable the bank to identify the customer and product
aC s

segments which have lower delinquency and high profitability.


vi y re
ed d
ar stu
is
Th
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
2. Data Report
A) Understanding how data was collected in terms of time, frequency and methodology

The data contains the details of Loans which have been issued between June 2007 and
December 2015 period.
We can consider data is collected post January 2016 as last payment date for maximum loans is
January 2016. Based on the loan issue date, it shows Monthly frequency of data collection.

B) Visual inspection of data (rows, columns, descriptive details)

There are 226,786 rows and 41 columns. Out of which 25 are numeric columns, 11 character
columns and 5 date columns.The last variable ‘loan_status’ is the dependent variable.

C) Understanding of attributes (variable info, renaming if required)

m
er as
Renamed the column ‘earliest_cr_line’ to ‘earliest_cr_line_mnth’ as it shows the month a

co
eH w
borrower's earliest reported credit line was opened.

o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
3. Exploratory Data Analysis
3.1 Univariate Analysis
We have stored in the data frame ‘loanData’ and analyzing all 41 independent variable from data
set given. The ‘loan_status’ variable is the dependent variable.
Univariate analysis shows.

• Few variables like revol_bal, out_prncp, out_prncp_inv and total_rec_late_fee are


concentrated to a particular range of values. Hence there is difference in mean and
median.
• Approximately 80% customers have ZERO number of 30+ days past-due incidences of
delinquency in the borrower's credit file for the past 2 years which confirms that these
customers have good track record.
• Most of the customers have taken loans between 5000 and 15000.
• Around 80% of customer who have availed loan has annual income less than 100,000

m
er as
• Large part of customer has been lent within 30% of DTI. This gives comfort about the loan

co
portfolio.

eH w
• The summary and box plot shows there is an outlier in most of the continuous variables but

o.
are found to be acceptable values.

rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th

Univariate Analysis
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
3.2 Bi-Variate Analysis
We will analyze loan_status with the independent variables from data set ‘loanData’. Most of the
variables do not seem to have much effect on loan will default. Customers who have mostly
defaulted belong to E, F and G grade compared to fully paid in the same grade. Customers with loan
Grade B have fully paid the loan maximum time.

m
er as
co
eH w
o.
Note: Credit card, debt consolidation, home improvement, house and small business are the top

rs e
reasons to avail higher loan amount.
ou urc
o
aC s
vi y re
ed d
ar stu
is

3.3 Unwanted Variable


Th

The list of unwanted variables given below –

Variable Reason
sh

• recoveries All values are 0


• collection_recovery_fee All values are 0
• pymnt_plan Only 6 rows (4-Default and 2-Fully paid) with value as y.
• desc Info not required.
• mths_since_last_delinq Nearly 54% missing value.

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
3.4 Missing Value Identification
There are missing values and used ‘is na’ function to check if there are any missing values.

m
er as
co
eH w
Variable Missing Count Treatment

o.
next_pymnt_d 207723 Verified that the values are missing where loans are
rs e Fully paid.
ou urc
mths_since_last_delinq 124638 Approx. 54% of data is missing. Variable removed from
the analysis.
o

revol_util 164 Imputed with mean.


aC s

last_pymnt_d 341 there is no repayment of the loan since inception.


vi y re

last_credit_pull_d 16 Removed
Desc 152077 we have dropped desc as it will be redundant information.
ed d
ar stu
is

3.5 Outlier Identification


Th

There are outliers in most of the variables. It is evident from the box plot and summary as well
but the values are acceptable. Hence we did not treat the outliers.

3.6 Variable Transformation / Feature Creation


sh

In Summary of data we have seen that few of the variables are character. So we use ‘ as.factor
‘ to convert them. We also convert the Date variables to number for ease in analysis.
The Dependent variable ‘loan_status’ is converted in terms of 0 and 1
Fully Paid: 0 Default: 1

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
Correlation/Multicollinearity

m
er as
co
eH w
o.
Based on the above plot we can say few of the variables are highly co-related. Hence we find the

rs e
list of highly correlated independent variables.
ou urc
For further calibration we check the VIF value (multicollinearity) as well. For the above variables, the VIF is
more than 10 and hence we drop them
o

3.7 Addition of New Variable


aC s
vi y re

All the key variables required for building the model like Debt to Income ratio or Fixed obligations to
income ratio, revolving balance utilization rate, outstanding principal, past delinquency etc. are available
in the given data set.
ed d
ar stu

Important variable not available in the dataset is the collateral value which bank might have taken for the
loan. If the collateral is available at >100% of the loan amount, it gives comfort to the bank to give specific
loans. Hence, no new variable is required to be created
is
Th
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
4. Insights from EDA

The data is highly imbalanced. So while building the model and performing SMOTE Analysis we can
either choose to undersample the minority class or oversample the majority class.

m
er as
co
eH w
Significant Variable: The list of highly significant variables is: member_id, terms, installment, grade,

o.
emp_length, dti, issue_d, revol_bal, revol_util, total_rec_int, total_rec_late_fee, last_pymnt_d,
rs e
last_pymnt_amnt, last_credit_pull_d.
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh

This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00

https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
Powered by TCPDF (www.tcpdf.org)

You might also like