You are on page 1of 35

Credit EDA Case Study

Upgrad Assignment

Deepika Sampangi & Manjot Singh


Problem Statement
Given a Loan Application a company needs to decide whether to approve or reject the application based on risk
associated, with the given data set analyse what factors can be considered for risk analysis and determine what
category of loan applications can be considered safer option

Dataset
Name Description Contains

application_data.csv Client information at time of application Client Payment di culty information

previous_application.csv Clients previous loan data Status of previous loan application

columns_descrption.csv Data Dictionary Meaning of columns in the dataset


ffi
Analysis Approach
S.No Step Reason
1 Data Exploration To understand the Datasets
2 Identify Missing Values To ignore the Columns with higher missing values
3 Checking for Outliers To identify what columns can impact the analysis
4 Checking for Datatypes To format the Data to correct Datatypes
5 Binning Continuous Variables To create a range column to aggregate the data
6 Selection of Columns for Analysis Picking a subset of columns that are to be considered for analysis
7 Data Imbalance - Target Variable Identifying the Data present for Defaulters and Non Defaulters
8 Top 10 Correlations Identifying the Relation between the Columns in the Dataset
9 Univariate Analysis Analysing the Dataset with a speci c column
10 Bivariate and Multivariate Analysis Analysing the Dataset with a more than one columns
11 Merging Previous Loan Data To identify how the previous loan data can be considered for Analysis
12 Visualisation of Relations Plotting graphs for merged dataset to draw inferences
13 Drawing Inferences Listing the inferences drawn from the merged dataset
14 Conculsion Based on inferences concluding how to handle risk of loan approval
fi
Step 1: Data Exploration
Application Dataset

1] Application Dataset has 3,07,511 Record and 122 attributes

2] There are 65 columns of type oat64, 41 columns of type int64, 16 columns of type object

3] Dataset seems to have many Null values

Previous Application Dataset

1] Application Dataset has 16,70,214 Record and 37 attributes

2] There are 2 columns of type oat64, 6 columns of type int64, 15 columns of type object

3] Dataset seems to have many Null values


fl
fl
Step 2: Identify Missing Values
Application Dataset

There are Total 57 Columns with more than 13% missing values that can be dropped

Application Dataset - possible imputation

There are few Columns with less number of missing values which can be imputed

Column Name Missing Value Percent Imputation method Reason

AMT_ANNUITY 0.004% Median Due to Outliers

AMT_GOODS_PRICE 0.09% Median Due to Outliers

NAME_TYPE_SUITE 0.42% Mode Categorical Data

Cardinal
CNT_FAM_MEMBERS 0.0007% Mode
Categorical Data
EXT_SOURCE_2 0.22% Mean No Outliers
Step 3: Identify Outliers
On Application Dataset using IQR Values

Column Name IQR value Imputation Method

AMT_INCOME_TOTAL 90000.0 Cap the outliers

AMT_CREDIT 538650.0 Cap the outliers

AMT_ANNUITY 18072.0 Cap the outliers

AMT_GOODS_PRICE 441000.0 Cap the outliers

DAYS_BIRTH 7269.0 Binning


Step 4: Checking for Datatypes

On Application Dataset

There are almost 20 Categorical Columns, but only few of them are of object type

Columns with DAYS related information are having negative values, we need to consider absolute values for them
Step 5: Binning Continuous Variables
On Application Dataset

Variable that can be binned and are most useful for visual analysis are as below

Column Name Bin Labels


AMT_INCOME_TOTAL <100K, 100K-150k, 150k-200k, 200k-500k, 500k+

AMT_CREDIT <0.3M, 0.3M-0.5M, 0.5M-1.0M, 1.0M-1.5M, 1.5M-45M

AMT_GOODS_PRICE <0.3M, 0.3M-0.5M, 0.5M-1.0M, 1.0M-1.5M, 1.5M-45M

DAYS_BIRTH 20-30 yrs, 31-40 yrs, 41-50 yrs, 51-60 yrs, 60+ yrs

DAYS_EMPLOYED 0-5 yrs, 6-10 yrs, 11-15 yrs, 16-20 yrs, 21-25 yrs, 25+ yrs
Step 6: Selection of Columns for Analysis
From Application Dataset

Few of the Speci c variables selected for Analysis are

Column Name Data type


NAME_CONTRACT_TYPE Categorical
AMT_INCOME_TOTAL Continuous Numerical
AMT_CREDIT Continuous Numerical
AMT_GOODS_PRICE Continuous Numerical
NAME_INCOME_TYPE Categorical
NAME_EDUCATION_TYPE Categorical

Apart from these there are few more selected which gives info on Amounts, Binned Columns, Target, Code_Gender
and so on

Another additional Column added is NUM_OF_INSTALLMENTS which is derived from CREDIT/ANNUITY Value, will
be used to gain some useful insights
fi
Step 7: Data Imbalance - Target Variable

From Target Variable info we have

Type Percent Meaning

Most Applicants are considered to


Non Defaulters 92%
have no issue with the payment

8% of Applicants seems to have


Defaulters 8%
issue with loan payment
Analysis Goals
1] Identify in the 92% Non Defaulters are there any patterns of risk

2] Identify in the 8% Defaulters who can be considered for the loan without any risk

3] Use the Previous Application Data to add in more insights based on past data

4] Identify Patterns in Loan Applicants

Conclusions drawn after Analysis

Safe Risky

1] Revolving Loan with Credit other than 0.5M to 1M income range 200k-500k
1] Consumer Loans, Income range 100k-500k, higher credit maybe defaulted

2] Students and Businessman in all income ranges, Pensioners with income 2] Credit amount around 15% more or less than the application amount, may or
200k+ may not be accepted

Bene cial
Unsafe
1] Academic Degree with income range of 200k-500k and Higher Education,
Secondary/Secondary Special for Higher Credit
1] Income type Maternity Leave with income <100k
2] Co-op Apartment living, living with parents and Municipal apartments with 2] Credit bin 0.5M-1.0M and Income range of 100k-200k
income range of 500k+ Higher Credit

* Following slides have the detailed analysis performed from which these conclusions are attained
fi
Step 8: Top 10 Correlations
For Complete Application Dataset
Columns chosen for Analysis

Top 10 Correlated Columns


Application Dataset Subset of Data for Analysis

Var 1 Var 2 ABS_CORRELATION Var 1 Var 2 ABS_CORRELATION

OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 0.998490 AMT_GOODS_PRICE AMT_CREDIT 0.986968

AMT_GOODS_PRICE AMT_CREDIT 0.986968 AMT_GOODS_PRICE AMT_ANNUITY 0.775109

AMT_GOODS_PRICE AMT_ANNUITY 0.775109 AMT_ANNUITY AMT_CREDIT 0.770138

AMT_ANNUITY AMT_CREDIT 0.770138 NUM_OF_INSTALLMENTS AMT_CREDIT 0.661503

DAYS_EMPLOYED DAYS_BIRTH 0.623941 NUM_OF_INSTALLMENTS AMT_GOODS_PRICE 0.634587

DAYS_REGISTRATION DAYS_BIRTH 0.331912 DAYS_EMPLOYED DAYS_BIRTH 0.623941

DAYS_ID_PUBLISH DAYS_EMPLOYED 0.274842 DAYS_ID_PUBLISH DAYS_EMPLOYED 0.274842

DAYS_ID_PUBLISH DAYS_BIRTH 0.272691 DAYS_ID_PUBLISH DAYS_BIRTH 0.272691

DAYS_REGISTRATION DAYS_EMPLOYED 0.214573 EXT_SOURCE_2 REGION_POPULATION_RELATIVE 0.198924

EXT_SOURCE_2 REGION_POPULATION_RELATIVE 0.198924 AMT_ANNUITY AMT_INCOME_TOTAL 0.191657


Step 9: Univariate Analysis
Application Dataset Univariate Analysis

Column Name Variable Type Inferences

NAME_CONTRACT_TYPE Categorical Defaulters Can be considered for Revolving loans

NUM_OF_INSTALLMENTS_BIN Categorical Most Non Defaulters opt for 16-24 Instalments

Most Applicants are of Higher Education, Secondary/Secondary


NAME_EDUCATION_TYPE Categorical
Special and are non Defaulters

DAYS_EMPLOYED_BIN Categorical Most Non Defaulters have work Experience of 0-10 yrs and 30+ yrs

AMT_CREDIT Numerical 0.5M to 1M has most Defaulters other ranges are most Non Defaulters

AMT_GOODS_PRICE Numerical Higher than 0.75M are mostly Non Defaulters


Application Dataset Univariate Analysis Visualisations
Previous Application Dataset Univariate Analysis Categorical Columns

• Chances of refusal for a Consumer • There is almost a 50-50 chance of loan


loan is much lower than that of Cash approval when the applicant walks-in for
Loans or Revolving loans
loan

• Count of Cash Loans getting • High chances of loan approval if the


Cancelled is more than that of NAME_PRODUCT_TYPE is x-sell
getting refused

• Repeater Applicants has more chances of


• Applicant tends to cancel a loan if the loan refusal when compared to a New • Higher loan amount have higher chances
AMT_CREDIT is less than or equal to Applicant
of loan refusal (in our case >200K)
AMT_APPLICATION showing a • New Applicants generally don’t cancel
dissatisfaction with the loan approved the loan application
Previous Application Dataset UNIVARIATE ANALYSIS – CONTINUOUS VARIABLE

• Most of the loans that are credited have an AM_ANNUITY < 10000
• Most approved have a term period of 10 -20 months

• The Density curve is having a similar shape of curve for both • Refused loans don't seem to have a pattern based out of
Approved and Refused Loans CNT_PAYMENT

Step 10.1: Bivariate Analysis


Application Dataset Bivariate Analysis

Column 1 Column 2 Type Inferences

Catergorical- Most Non Defaulters are in Income range 200k-500k Opt for
AMT_CREDIT_BIN AMT_INCOME_TOTAL_BIN
Categorical 0.5M to 1.5M Credit

Catergorical- Most Non Defaulters have income type working, commercial


NAME_INCOME_TYPE AMT_INCOME_TOTAL_BIN
Categorical associate or pensioner with income range of 100k-500k

Categorical- Most Non defaulters with Higher Education, Secondary/


NAME_EDUCATION_TYPE AMT_CREDIT
Continuous Secondary Special prefer higher credit

Continuous-
AMT_ANNUITY AMT_CREDIT With increase in AMT_CREDIT, AMT_ANNUITY Increases
Continuous

*Here we make use of the correlated Columns from which we can draw the inferences required
Application Dataset Bivariate Analysis Visualisations
Previous Application Dataset BIVARIATE ANALYSIS – (CATEGORICAL – CATEGORICAL VARIABLE)

• Loans from a Repeater having an application amount > 200K have • Loans having the range 20K – 75K have high acceptance rate
the highest number of refusals
when the amount credited is less than or equal to amount applied

• 75K – 200K is the most popular loan amounts among Repeater and • Unused Offers are clearly prevalent when the amount credited is
Refreshed applicants. New applicants generally opt for 20K-75K less than or equal to amount applied

• The highest approved loans are those that have high group
• The greatest number of cancellations come from a repeater
interest and have an annuity amount less than 7.5K

applicant when the client is acquired from Credit and Cash Offices

• Refusals are higher for low/normal group interest with an annuity


amount greater than 200K
Previous Application Dataset BIVARIATE ANALYSIS – (CATEGORICAL – CONTINUOUS VARIABLE)

• Mean amount Credited for an approved loan is less for both x-sell • Cancelled applications that have a higher amount are generally having a Low
and walk-in
Action group interest rate

• Applicants walking in for a low value for amount to be credited • Generally, the mean value of the application amount for refused loans looks nearly
generally tend to get their loan approved the same as 1st Quartile value of Cancelled Loans for all Name yield groups

• Mean value of the terms of payment for Cash is same for Refused • Applicants that are Repeating or have refreshed generally tend to take
and Cancelled loans. This value is on a lower side for Approved loans
less time to apply for next loan when compared to New applicants

• The terms of payment for POS is observed to be low for all • Despite getting the loan approved, a New client can take anything
application status between 1000-1500 days to apply for another loan
BIVARIATE ANALYSIS – (CONTINUOUS – CONTINUOUS VARIABLE)

• CNT_PAYMENT more than 80, has more refusal rate than approval rates
• AMT_CREDIT and AMT_APPLICATION are having positive linear relation

• More Approval rate for CNT_PAYMENTS in range of 0-20 and Lesser credit • Higher the AMT_APPLICATION, Lesser the AMT_CREDIT is approved more
amount

• AMT_CREDIT and AMT_ANNUITY are having positive linear relation


• AMT_CREDIT and AMT_GOODS_PRICE are having positive linear relation

• But we can infer that even though AMT_CREDIT is high the AMT_ANNUITY is • Higher AMT_GOODS_PRICE, higher AMT_CREDIT have less number of refusals
still opted in a range of 0-200k

• Cancelled Loan fall into the range of AMT_ANNUITY 0-100k irrespective of


AMT_CREDIT
Step 10.2: Segmented Analysis

Column 1 Column 2 Inferences

Most Non Defaulter Females are earning in range of 100k to 500k where as
AMT_INCOME_TOTAL_BIN CODE_GENDER
Males are mostly earning in range of 200k-500k

Most Non Defaulter Females opt for AMT_CREDIT of < 0.3M or in range of
AMT_CREDIT_BIN CODE_GENDER
0.5M-1.0M

Most Non Defaulter Females have either Higher Education or Secondary/


NAME_EDUCATION_TYPE CODE_GENDER
Secondary Special where as Males mostly have Secondary/Secondary Special

NUM_OF_INSTALLMENTS_BIN CODE_GENDER Most of Non Defaulter Female and Males opt for Instalment bin range of 16-24
Segmented Analysis Visualisations
Step 10.3: Multivariate Analysis
Column 1 Column 2 Hue Inferences

Widows earning income more than 500+ are all Non


AMT_INCOME_TOTAL_BIN AMT_CREDIT NAME_FAMILY_STATUS
Defaulters

Applicants with Academic Degree other than the ones in


AMT_INCOME_TOTAL_BIN AMT_CREDIT NAME_EDUCATION_TYPE
income category of 200k-500k are all Non Defaulters

Most of the Co-op Apartment living applicants are Non


Defaulters

AMT_INCOME_TOTAL_BIN AMT_CREDIT NAME_HOUSING_TYPE


Income 500k+ living with parents and Municipal
apartments are mostly Non Defaulters

Students and Businessman can be considered without


any risk, Unemployed with income greater than 100k can
NAME_INCOME_TYPE AMT_CREDIT AMT_INCOME_TOTAL_BIN
be considered, Pensioners with income 200k+ , Maternity
Leave category most are Defaulters
Multivariate Analysis Visualisations
Step 11: Merging Previous Loan Data
• Previous Application Data has 1670214 Entries and has 37 Columns

• It has 14 Columns with more than 20% Missing values

• AMT_CREDIT Column has 1 Missing value which can be lled with Median of the Column

• PRODUCT_COMBINATION has 0.02% Missing values which can be lled with mode value

• Application Dataset and Previous Application Dataset can be merged on “SK_ID_CURR”

• Post merger we need to select the columns required for further analysis

• Using the NAME_CONTRACT_STATUS variable for drawing inferences, below chart shows the Data imbalance

fi
fi
Step 12: Visualisation of Relations
Correlation of Columns chosen for Analysis on merged Data frame
Analysis on Merged Data Frame
Column Inferences
1] Lot of Applicants considered as Non Defaulters have there Application Cancelled
Previously

NAME_CONTRACT_STATUS
2] Lot of Applicants considered Defaulters have there Application Approved Previously

3] Non Defaulters who refused Previously needs to be inferred

Most Loans have the Purpose of

NAME_CASH_LOAN_PURPOSE Repairs, Others, Urgent Needs, Building a House or an Amex, Everyday Expenses,
Medicines
Analysis on NAME_CONTRACT_STATUS
Type Column Inferences
ND-Canceled NAME_CLIENT_TYPE Most of them are Repeaters
NAME_EDUCATION_TYPE v/s
ND-Canceled Most applicants of Secondary/ Secondary Special with 0-5 and 25+ yrs Experience
DAYS_EMPLOYED_BIN
AMT_CREDIT_BIN v/s
ND-Canceled Most of them are Academic Degree with income range 500k+
AMT_INCOME_TOTAL_BIN

D-Approved NAME_CONTRACT_TYPE_y Most are Consumer Loans

AMT_INCOME_TOTAL_BIN v/s Most belong to category Credit bin 0.5M-1.0M and have Income range of
D-Approved
AMT_CREDIT_BIN 100k-500k
We observe that the data is normalized and most of the cases of refusal are due to
ND-Refused Credit_Application_di
either alloting more or less credit than the application

* Here the Data originally part of application dataset (Eg: Education_type, Income_Range, …)
has been assigned to each of the previous applicant entry with same SK_ID_CURR
f
Visualisation of Analysis on NAME_CONTRACT_STATUS
Few other Analysis

Column Inferences
NAME_CASH_LOAN_PURPOSE v/s Most Non Defaulters are with purpose Repairs, Others, Urgent needs, Building a house or
TARGET an amex, Everyday Expenses, Medicine

NAME_CASH_LOAN_PURPOSE v/s Most loans with purpose of Other, Repairs, Urgent needs, Building House or Amex have
NAME_CONTRACT_STATUS been refused than accepted

NAME_GOODS_CATEGORY v/s There are no cancelled loans with purpose other, Vehicles, Gardening, Clothing and
NAME_CONTRACT_STATUS Accessories, O ce Appliances, Medicines

* Here the Data originally part of application dataset (Eg: Education_type, Income_Range, …)
has been assigned to each of the previous applicant entry with same SK_ID_CURR
ffi
Step 13: Drawing Inferences
Dataset Overall Inferences

1] Income Type working, commercial associate or pensioner with income range 100k-500k are Non
Defaulters

2] Education type Higher Education, Secondary/Secondary Special can be considered for higher Credit

3] Defaulters opting for Revolving Loan with Credit other than 0.5M to 1M range and of income range
Application 200k-500k can be considered safe

Dataset 4] Education Type Academic Degree with income range of 200k-500k opting for higher credit can be
considered

5] Income type Students and Businessman, housing Type Co-op Apartment living, living with parents
and Municipal apartments with income range of 500k+ can safely be considered as Non Defaulters
6] Maternity Leave have income <100k and are mostly are Defaulters

1] Credit bin 0.5M-1.0M and have Income range of 100k-200k needs to be ignored

2] The credit amount should be only 15% more or less than the application amount, else it gets refused

3] Purpose Vehicles, Gardening, Clothing and Accessories, O ce Appliances, Medicines

Merged Dataset
can be considered safe

4] Consumer Loans and Credit bin 0.5M-1.0M and have Income range of 100k-500k tend to fall into
Defaulters side
ffi
Step 14: Conclusion

Safe Risky

1] Revolving Loan with Credit other than 0.5M to 1M income range 200k-500k
1] Consumer Loans, Income range 100k-500k, higher credit maybe defaulted

2] Students and Businessman in all income ranges, Pensioners with income 2] Credit amount around 15% more or less than the application amount, may or
200k+ may not be accepted

Bene cial
Unsafe
1] Academic Degree with income range of 200k-500k and Higher Education,
Secondary/Secondary Special for Higher Credit
1] Income type Maternity Leave with income <100k
2] Co-op Apartment living, living with parents and Municipal apartments with 2] Credit bin 0.5M-1.0M and Income range of 100k-200k
income range of 500k+ Higher Credit
fi
Thank you.

You might also like