Professional Documents
Culture Documents
m
er as
Brijendra Yadav
co
eH w
o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
Contents
1. Introduction
2. Data Report
3. Exploratory Data Analysis
3.1 Univariate Analysis
3.2 Bi-Variate Analysis
3.3 Unwanted Variable
3.4 Missing Value Identification
m
3.5 Outlier Identification
er as
3.6 Variable Transformation
co
eH w
3.7 Addition of New Variable
o.
4. Insights from EDA
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
1. Introduction
Retail lending is considered risk-free as the bank might have done necessary due diligence including
security/collateral requirement and credit score checks along with repayment track of applicant. There
were instances where customers have defaulted the loan despite following necessary procedures
resulting in creation of huge NPA for Banks.
Hence, there is a need to build a model to predict default loan which will help the bank to
identify any risk associated before lending.
Retail Lending contributes more than half of bank’s income as Interest Income. In some cases, higher
m
delinquency has even led to bank run and closure of a bank. PSU banks are in worst situation and are
er as
unable to generate profits because of huge NPA and loan defaults. Hence there is need to understand
co
eH w
and build a model which can predict whether the loan is secure or not.
o.
C) Understanding business/social opportunity
rs e
ou urc
If the bank is able to effectively predict the chance of loan default before the disbursement, then the
future default can be reduced significantly and thus increasing Income for the Bank. This will help the
o
bank to maintain good profitability and also enable the bank to identify the customer and product
aC s
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
2. Data Report
A) Understanding how data was collected in terms of time, frequency and methodology
The data contains the details of Loans which have been issued between June 2007 and
December 2015 period.
We can consider data is collected post January 2016 as last payment date for maximum loans is
January 2016. Based on the loan issue date, it shows Monthly frequency of data collection.
There are 226,786 rows and 41 columns. Out of which 25 are numeric columns, 11 character
columns and 5 date columns.The last variable ‘loan_status’ is the dependent variable.
m
er as
Renamed the column ‘earliest_cr_line’ to ‘earliest_cr_line_mnth’ as it shows the month a
co
eH w
borrower's earliest reported credit line was opened.
o.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
3. Exploratory Data Analysis
3.1 Univariate Analysis
We have stored in the data frame ‘loanData’ and analyzing all 41 independent variable from data
set given. The ‘loan_status’ variable is the dependent variable.
Univariate analysis shows.
m
er as
• Large part of customer has been lent within 30% of DTI. This gives comfort about the loan
co
portfolio.
eH w
• The summary and box plot shows there is an outlier in most of the continuous variables but
o.
are found to be acceptable values.
rs e
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
Univariate Analysis
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
3.2 Bi-Variate Analysis
We will analyze loan_status with the independent variables from data set ‘loanData’. Most of the
variables do not seem to have much effect on loan will default. Customers who have mostly
defaulted belong to E, F and G grade compared to fully paid in the same grade. Customers with loan
Grade B have fully paid the loan maximum time.
m
er as
co
eH w
o.
Note: Credit card, debt consolidation, home improvement, house and small business are the top
rs e
reasons to avail higher loan amount.
ou urc
o
aC s
vi y re
ed d
ar stu
is
Variable Reason
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
3.4 Missing Value Identification
There are missing values and used ‘is na’ function to check if there are any missing values.
m
er as
co
eH w
Variable Missing Count Treatment
o.
next_pymnt_d 207723 Verified that the values are missing where loans are
rs e Fully paid.
ou urc
mths_since_last_delinq 124638 Approx. 54% of data is missing. Variable removed from
the analysis.
o
last_credit_pull_d 16 Removed
Desc 152077 we have dropped desc as it will be redundant information.
ed d
ar stu
is
There are outliers in most of the variables. It is evident from the box plot and summary as well
but the values are acceptable. Hence we did not treat the outliers.
In Summary of data we have seen that few of the variables are character. So we use ‘ as.factor
‘ to convert them. We also convert the Date variables to number for ease in analysis.
The Dependent variable ‘loan_status’ is converted in terms of 0 and 1
Fully Paid: 0 Default: 1
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
Correlation/Multicollinearity
m
er as
co
eH w
o.
Based on the above plot we can say few of the variables are highly co-related. Hence we find the
rs e
list of highly correlated independent variables.
ou urc
For further calibration we check the VIF value (multicollinearity) as well. For the above variables, the VIF is
more than 10 and hence we drop them
o
All the key variables required for building the model like Debt to Income ratio or Fixed obligations to
income ratio, revolving balance utilization rate, outstanding principal, past delinquency etc. are available
in the given data set.
ed d
ar stu
Important variable not available in the dataset is the collateral value which bank might have taken for the
loan. If the collateral is available at >100% of the loan amount, it gives comfort to the bank to give specific
loans. Hence, no new variable is required to be created
is
Th
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
4. Insights from EDA
The data is highly imbalanced. So while building the model and performing SMOTE Analysis we can
either choose to undersample the minority class or oversample the majority class.
m
er as
co
eH w
Significant Variable: The list of highly significant variables is: member_id, terms, installment, grade,
o.
emp_length, dti, issue_d, revol_bal, revol_util, total_rec_int, total_rec_late_fee, last_pymnt_d,
rs e
last_pymnt_amnt, last_credit_pull_d.
ou urc
o
aC s
vi y re
ed d
ar stu
is
Th
sh
This study source was downloaded by 100000818305880 from CourseHero.com on 10-01-2021 03:20:03 GMT -05:00
https://www.coursehero.com/file/64657053/Capstone-Loan-Default-Note-1-Brijendrapdf/
Powered by TCPDF (www.tcpdf.org)