Credit EDA Case Study: Upgrad Assignment

Credit EDA Case Study
Upgrad Assignment
Deepika Sampangi & Manjot Singh

Problem Statement
Given a Loan Application a company needs to decide whether to approve or reject the application based on risk
associated, with the given data set analyse what factors can be considered for risk analysis and determine what
category of loan applications can be considered safer option
Dataset
Name Description Contains
application_data.csv Client information at time of application Client Payment di culty information
previous_application.csv Clients previous loan data Status of previous loan application
columns_descrption.csv Data Dictionary Meaning of columns in the dataset

ffi
Analysis Approach
S.No Step Reason
1 Data Exploration To understand the Datasets
2 Identify Missing Values To ignore the Columns with higher missing values
3 Checking for Outliers To identify what columns can impact the analysis
4 Checking for Datatypes To format the Data to correct Datatypes
5 Binning Continuous Variables To create a range column to aggregate the data
6 Selection of Columns for Analysis Picking a subset of columns that are to be considered for analysis
7 Data Imbalance - Target Variable Identifying the Data present for Defaulters and Non Defaulters
8 Top 10 Correlations Identifying the Relation between the Columns in the Dataset
9 Univariate Analysis Analysing the Dataset with a speci c column
10 Bivariate and Multivariate Analysis Analysing the Dataset with a more than one columns
11 Merging Previous Loan Data To identify how the previous loan data can be considered for Analysis
12 Visualisation of Relations Plotting graphs for merged dataset to draw inferences
13 Drawing Inferences Listing the inferences drawn from the merged dataset
14 Conculsion Based on inferences concluding how to handle risk of loan approval
fi
Step 1: Data Exploration
Application Dataset
1] Application Dataset has 3,07,511 Record and 122 attributes
2] There are 65 columns of type oat64, 41 columns of type int64, 16 columns of type object
3] Dataset seems to have many Null values
Previous Application Dataset
1] Application Dataset has 16,70,214 Record and 37 attributes
2] There are 2 columns of type oat64, 6 columns of type int64, 15 columns of type object
3] Dataset seems to have many Null values

fl
fl
Step 2: Identify Missing Values
Application Dataset
There are Total 57 Columns with more than 13% missing values that can be dropped
Application Dataset - possible imputation
There are few Columns with less number of missing values which can be imputed
Column Name Missing Value Percent Imputation method Reason
AMT_ANNUITY 0.004% Median Due to Outliers
AMT_GOODS_PRICE 0.09% Median Due to Outliers
NAME_TYPE_SUITE 0.42% Mode Categorical Data
Cardinal
CNT_FAM_MEMBERS 0.0007% Mode
Categorical Data
EXT_SOURCE_2 0.22% Mean No Outliers
Step 3: Identify Outliers
On Application Dataset using IQR Values
Column Name IQR value Imputation Method
AMT_INCOME_TOTAL 90000.0 Cap the outliers
AMT_CREDIT 538650.0 Cap the outliers
AMT_ANNUITY 18072.0 Cap the outliers
AMT_GOODS_PRICE 441000.0 Cap the outliers
DAYS_BIRTH 7269.0 Binning

Step 4: Checking for Datatypes
On Application Dataset
There are almost 20 Categorical Columns, but only few of them are of object type
Columns with DAYS related information are having negative values, we need to consider absolute values for them
Step 5: Binning Continuous Variables
On Application Dataset
Variable that can be binned and are most useful for visual analysis are as below
Column Name Bin Labels

AMT_INCOME_TOTAL <100K, 100K-150k, 150k-200k, 200k-500k, 500k+
AMT_CREDIT <0.3M, 0.3M-0.5M, 0.5M-1.0M, 1.0M-1.5M, 1.5M-45M
AMT_GOODS_PRICE <0.3M, 0.3M-0.5M, 0.5M-1.0M, 1.0M-1.5M, 1.5M-45M
DAYS_BIRTH 20-30 yrs, 31-40 yrs, 41-50 yrs, 51-60 yrs, 60+ yrs
DAYS_EMPLOYED 0-5 yrs, 6-10 yrs, 11-15 yrs, 16-20 yrs, 21-25 yrs, 25+ yrs
Step 6: Selection of Columns for Analysis
From Application Dataset
Few of the Speci c variables selected for Analysis are
Column Name Data type

NAME_CONTRACT_TYPE Categorical
AMT_INCOME_TOTAL Continuous Numerical
AMT_CREDIT Continuous Numerical
AMT_GOODS_PRICE Continuous Numerical
NAME_INCOME_TYPE Categorical
NAME_EDUCATION_TYPE Categorical
Apart from these there are few more selected which gives info on Amounts, Binned Columns, Target, Code_Gender
and so on
Another additional Column added is NUM_OF_INSTALLMENTS which is derived from CREDIT/ANNUITY Value, will
be used to gain some useful insights
fi
Step 7: Data Imbalance - Target Variable
From Target Variable info we have
Type Percent Meaning
Most Applicants are considered to

Non Defaulters 92%
have no issue with the payment
8% of Applicants seems to have

Defaulters 8%
issue with loan payment
Analysis Goals
1] Identify in the 92% Non Defaulters are there any patterns of risk
2] Identify in the 8% Defaulters who can be considered for the loan without any risk
3] Use the Previous Application Data to add in more insights based on past data
4] Identify Patterns in Loan Applicants
Conclusions drawn after Analysis
Safe Risky
1] Revolving Loan with Credit other than 0.5M to 1M income range 200k-500k
1] Consumer Loans, Income range 100k-500k, higher credit maybe defaulted
2] Students and Businessman in all income ranges, Pensioners with income 2] Credit amount around 15% more or less than the application amount, may or
200k+ may not be accepted
Bene cial
Unsafe
1] Academic Degree with income range of 200k-500k and Higher Education,
Secondary/Secondary Special for Higher Credit
1] Income type Maternity Leave with income <100k
2] Co-op Apartment living, living with parents and Municipal apartments with 2] Credit bin 0.5M-1.0M and Income range of 100k-200k
income range of 500k+ Higher Credit
* Following slides have the detailed analysis performed from which these conclusions are attained
fi
Step 8: Top 10 Correlations
For Complete Application Dataset
Columns chosen for Analysis
Top 10 Correlated Columns

Application Dataset Subset of Data for Analysis
Var 1 Var 2 ABS_CORRELATION Var 1 Var 2 ABS_CORRELATION
OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 0.998490 AMT_GOODS_PRICE AMT_CREDIT 0.986968
AMT_GOODS_PRICE AMT_CREDIT 0.986968 AMT_GOODS_PRICE AMT_ANNUITY 0.775109
AMT_GOODS_PRICE AMT_ANNUITY 0.775109 AMT_ANNUITY AMT_CREDIT 0.770138
AMT_ANNUITY AMT_CREDIT 0.770138 NUM_OF_INSTALLMENTS AMT_CREDIT 0.661503
DAYS_EMPLOYED DAYS_BIRTH 0.623941 NUM_OF_INSTALLMENTS AMT_GOODS_PRICE 0.634587
DAYS_REGISTRATION DAYS_BIRTH 0.331912 DAYS_EMPLOYED DAYS_BIRTH 0.623941
DAYS_ID_PUBLISH DAYS_EMPLOYED 0.274842 DAYS_ID_PUBLISH DAYS_EMPLOYED 0.274842
DAYS_ID_PUBLISH DAYS_BIRTH 0.272691 DAYS_ID_PUBLISH DAYS_BIRTH 0.272691
DAYS_REGISTRATION DAYS_EMPLOYED 0.214573 EXT_SOURCE_2 REGION_POPULATION_RELATIVE 0.198924
EXT_SOURCE_2 REGION_POPULATION_RELATIVE 0.198924 AMT_ANNUITY AMT_INCOME_TOTAL 0.191657

Step 9: Univariate Analysis
Application Dataset Univariate Analysis
Column Name Variable Type Inferences
NAME_CONTRACT_TYPE Categorical Defaulters Can be considered for Revolving loans
NUM_OF_INSTALLMENTS_BIN Categorical Most Non Defaulters opt for 16-24 Instalments
Most Applicants are of Higher Education, Secondary/Secondary

NAME_EDUCATION_TYPE Categorical
Special and are non Defaulters
DAYS_EMPLOYED_BIN Categorical Most Non Defaulters have work Experience of 0-10 yrs and 30+ yrs
AMT_CREDIT Numerical 0.5M to 1M has most Defaulters other ranges are most Non Defaulters
AMT_GOODS_PRICE Numerical Higher than 0.75M are mostly Non Defaulters

Application Dataset Univariate Analysis Visualisations
Previous Application Dataset Univariate Analysis Categorical Columns
• Chances of refusal for a Consumer • There is almost a 50-50 chance of loan

loan is much lower than that of Cash approval when the applicant walks-in for
Loans or Revolving loans
loan
• Count of Cash Loans getting • High chances of loan approval if the

Cancelled is more than that of NAME_PRODUCT_TYPE is x-sell
getting refused
• Repeater Applicants has more chances of

• Applicant tends to cancel a loan if the loan refusal when compared to a New • Higher loan amount have higher chances
AMT_CREDIT is less than or equal to Applicant
of loan refusal (in our case >200K)
AMT_APPLICATION showing a • New Applicants generally don’t cancel
dissatisfaction with the loan approved the loan application
Previous Application Dataset UNIVARIATE ANALYSIS – CONTINUOUS VARIABLE
• Most of the loans that are credited have an AM_ANNUITY < 10000
• Most approved have a term period of 10 -20 months
• The Density curve is having a similar shape of curve for both • Refused loans don't seem to have a pattern based out of
Approved and Refused Loans CNT_PAYMENT
Step 10.1: Bivariate Analysis

Application Dataset Bivariate Analysis
Column 1 Column 2 Type Inferences
Catergorical- Most Non Defaulters are in Income range 200k-500k Opt for
AMT_CREDIT_BIN AMT_INCOME_TOTAL_BIN
Categorical 0.5M to 1.5M Credit
Catergorical- Most Non Defaulters have income type working, commercial

NAME_INCOME_TYPE AMT_INCOME_TOTAL_BIN
Categorical associate or pensioner with income range of 100k-500k
Categorical- Most Non defaulters with Higher Education, Secondary/

NAME_EDUCATION_TYPE AMT_CREDIT
Continuous Secondary Special prefer higher credit
Continuous-
AMT_ANNUITY AMT_CREDIT With increase in AMT_CREDIT, AMT_ANNUITY Increases
Continuous
*Here we make use of the correlated Columns from which we can draw the inferences required
Application Dataset Bivariate Analysis Visualisations
Previous Application Dataset BIVARIATE ANALYSIS – (CATEGORICAL – CATEGORICAL VARIABLE)
• Loans from a Repeater having an application amount > 200K have • Loans having the range 20K – 75K have high acceptance rate
the highest number of refusals
when the amount credited is less than or equal to amount applied
• 75K – 200K is the most popular loan amounts among Repeater and • Unused Offers are clearly prevalent when the amount credited is
Refreshed applicants. New applicants generally opt for 20K-75K less than or equal to amount applied
• The highest approved loans are those that have high group
• The greatest number of cancellations come from a repeater
interest and have an annuity amount less than 7.5K
applicant when the client is acquired from Credit and Cash Offices
• Refusals are higher for low/normal group interest with an annuity

amount greater than 200K
Previous Application Dataset BIVARIATE ANALYSIS – (CATEGORICAL – CONTINUOUS VARIABLE)
• Mean amount Credited for an approved loan is less for both x-sell • Cancelled applications that have a higher amount are generally having a Low
and walk-in
Action group interest rate
• Applicants walking in for a low value for amount to be credited • Generally, the mean value of the application amount for refused loans looks nearly
generally tend to get their loan approved the same as 1st Quartile value of Cancelled Loans for all Name yield groups
• Mean value of the terms of payment for Cash is same for Refused • Applicants that are Repeating or have refreshed generally tend to take
and Cancelled loans. This value is on a lower side for Approved loans
less time to apply for next loan when compared to New applicants
• The terms of payment for POS is observed to be low for all • Despite getting the loan approved, a New client can take anything
application status between 1000-1500 days to apply for another loan
BIVARIATE ANALYSIS – (CONTINUOUS – CONTINUOUS VARIABLE)
• CNT_PAYMENT more than 80, has more refusal rate than approval rates
• AMT_CREDIT and AMT_APPLICATION are having positive linear relation
• More Approval rate for CNT_PAYMENTS in range of 0-20 and Lesser credit • Higher the AMT_APPLICATION, Lesser the AMT_CREDIT is approved more
amount
• AMT_CREDIT and AMT_ANNUITY are having positive linear relation

• AMT_CREDIT and AMT_GOODS_PRICE are having positive linear relation
• But we can infer that even though AMT_CREDIT is high the AMT_ANNUITY is • Higher AMT_GOODS_PRICE, higher AMT_CREDIT have less number of refusals
still opted in a range of 0-200k
• Cancelled Loan fall into the range of AMT_ANNUITY 0-100k irrespective of

AMT_CREDIT
Step 10.2: Segmented Analysis
Column 1 Column 2 Inferences
Most Non Defaulter Females are earning in range of 100k to 500k where as
AMT_INCOME_TOTAL_BIN CODE_GENDER
Males are mostly earning in range of 200k-500k
Most Non Defaulter Females opt for AMT_CREDIT of < 0.3M or in range of
AMT_CREDIT_BIN CODE_GENDER
0.5M-1.0M
Most Non Defaulter Females have either Higher Education or Secondary/

NAME_EDUCATION_TYPE CODE_GENDER
Secondary Special where as Males mostly have Secondary/Secondary Special
NUM_OF_INSTALLMENTS_BIN CODE_GENDER Most of Non Defaulter Female and Males opt for Instalment bin range of 16-24
Segmented Analysis Visualisations
Step 10.3: Multivariate Analysis
Column 1 Column 2 Hue Inferences
Widows earning income more than 500+ are all Non

AMT_INCOME_TOTAL_BIN AMT_CREDIT NAME_FAMILY_STATUS
Defaulters
Applicants with Academic Degree other than the ones in

AMT_INCOME_TOTAL_BIN AMT_CREDIT NAME_EDUCATION_TYPE
income category of 200k-500k are all Non Defaulters
Most of the Co-op Apartment living applicants are Non

Defaulters
AMT_INCOME_TOTAL_BIN AMT_CREDIT NAME_HOUSING_TYPE

Income 500k+ living with parents and Municipal
apartments are mostly Non Defaulters
Students and Businessman can be considered without

any risk, Unemployed with income greater than 100k can
NAME_INCOME_TYPE AMT_CREDIT AMT_INCOME_TOTAL_BIN
be considered, Pensioners with income 200k+ , Maternity
Leave category most are Defaulters
Multivariate Analysis Visualisations
Step 11: Merging Previous Loan Data
• Previous Application Data has 1670214 Entries and has 37 Columns
• It has 14 Columns with more than 20% Missing values
• AMT_CREDIT Column has 1 Missing value which can be lled with Median of the Column
• PRODUCT_COMBINATION has 0.02% Missing values which can be lled with mode value
• Application Dataset and Previous Application Dataset can be merged on “SK_ID_CURR”
• Post merger we need to select the columns required for further analysis
• Using the NAME_CONTRACT_STATUS variable for drawing inferences, below chart shows the Data imbalance
fi
fi
Step 12: Visualisation of Relations
Correlation of Columns chosen for Analysis on merged Data frame
Analysis on Merged Data Frame
Column Inferences
1] Lot of Applicants considered as Non Defaulters have there Application Cancelled
Previously
NAME_CONTRACT_STATUS
2] Lot of Applicants considered Defaulters have there Application Approved Previously
3] Non Defaulters who refused Previously needs to be inferred
Most Loans have the Purpose of
NAME_CASH_LOAN_PURPOSE Repairs, Others, Urgent Needs, Building a House or an Amex, Everyday Expenses,
Medicines
Analysis on NAME_CONTRACT_STATUS
Type Column Inferences
ND-Canceled NAME_CLIENT_TYPE Most of them are Repeaters
NAME_EDUCATION_TYPE v/s
ND-Canceled Most applicants of Secondary/ Secondary Special with 0-5 and 25+ yrs Experience
DAYS_EMPLOYED_BIN
AMT_CREDIT_BIN v/s
ND-Canceled Most of them are Academic Degree with income range 500k+
AMT_INCOME_TOTAL_BIN
D-Approved NAME_CONTRACT_TYPE_y Most are Consumer Loans
AMT_INCOME_TOTAL_BIN v/s Most belong to category Credit bin 0.5M-1.0M and have Income range of
D-Approved
AMT_CREDIT_BIN 100k-500k
We observe that the data is normalized and most of the cases of refusal are due to
ND-Refused Credit_Application_di
either alloting more or less credit than the application
* Here the Data originally part of application dataset (Eg: Education_type, Income_Range, …)
has been assigned to each of the previous applicant entry with same SK_ID_CURR
f
Visualisation of Analysis on NAME_CONTRACT_STATUS
Few other Analysis
Column Inferences
NAME_CASH_LOAN_PURPOSE v/s Most Non Defaulters are with purpose Repairs, Others, Urgent needs, Building a house or
TARGET an amex, Everyday Expenses, Medicine
NAME_CASH_LOAN_PURPOSE v/s Most loans with purpose of Other, Repairs, Urgent needs, Building House or Amex have
NAME_CONTRACT_STATUS been refused than accepted
NAME_GOODS_CATEGORY v/s There are no cancelled loans with purpose other, Vehicles, Gardening, Clothing and
NAME_CONTRACT_STATUS Accessories, O ce Appliances, Medicines
* Here the Data originally part of application dataset (Eg: Education_type, Income_Range, …)
has been assigned to each of the previous applicant entry with same SK_ID_CURR
ffi
Step 13: Drawing Inferences
Dataset Overall Inferences
1] Income Type working, commercial associate or pensioner with income range 100k-500k are Non
Defaulters
2] Education type Higher Education, Secondary/Secondary Special can be considered for higher Credit
3] Defaulters opting for Revolving Loan with Credit other than 0.5M to 1M range and of income range
Application 200k-500k can be considered safe
Dataset 4] Education Type Academic Degree with income range of 200k-500k opting for higher credit can be
considered
5] Income type Students and Businessman, housing Type Co-op Apartment living, living with parents
and Municipal apartments with income range of 500k+ can safely be considered as Non Defaulters
6] Maternity Leave have income <100k and are mostly are Defaulters
1] Credit bin 0.5M-1.0M and have Income range of 100k-200k needs to be ignored
2] The credit amount should be only 15% more or less than the application amount, else it gets refused
3] Purpose Vehicles, Gardening, Clothing and Accessories, O ce Appliances, Medicines
Merged Dataset
can be considered safe
4] Consumer Loans and Credit bin 0.5M-1.0M and have Income range of 100k-500k tend to fall into
Defaulters side
ffi
Step 14: Conclusion
Safe Risky
1] Revolving Loan with Credit other than 0.5M to 1M income range 200k-500k
1] Consumer Loans, Income range 100k-500k, higher credit maybe defaulted
2] Students and Businessman in all income ranges, Pensioners with income 2] Credit amount around 15% more or less than the application amount, may or
200k+ may not be accepted
Bene cial
Unsafe
1] Academic Degree with income range of 200k-500k and Higher Education,
Secondary/Secondary Special for Higher Credit
1] Income type Maternity Leave with income <100k
2] Co-op Apartment living, living with parents and Municipal apartments with 2] Credit bin 0.5M-1.0M and Income range of 100k-200k
income range of 500k+ Higher Credit
fi
Thank you.

Credit EDA Case Study: Upgrad Assignment

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Credit EDA Case Study: Upgrad Assignment

Uploaded by

Copyright:

Available Formats

Credit EDA Case Study

Deepika Sampangi & Manjot Singh

application_data.csv Client information at time of application Client Payment di culty information

previous_application.csv Clients previous loan data Status of previous loan application

columns_descrption.csv Data Dictionary Meaning of columns in the dataset

1] Application Dataset has 3,07,511 Record and 122 attributes

3] Dataset seems to have many Null values

Previous Application Dataset

1] Application Dataset has 16,70,214 Record and 37 attributes

3] Dataset seems to have many Null values

Application Dataset - possible imputation

Column Name Missing Value Percent Imputation method Reason

AMT_ANNUITY 0.004% Median Due to Outliers

AMT_GOODS_PRICE 0.09% Median Due to Outliers

NAME_TYPE_SUITE 0.42% Mode Categorical Data

Column Name IQR value Imputation Method

AMT_INCOME_TOTAL 90000.0 Cap the outliers

AMT_CREDIT 538650.0 Cap the outliers

AMT_ANNUITY 18072.0 Cap the outliers

AMT_GOODS_PRICE 441000.0 Cap the outliers

DAYS_BIRTH 7269.0 Binning

Column Name Bin Labels

AMT_CREDIT <0.3M, 0.3M-0.5M, 0.5M-1.0M, 1.0M-1.5M, 1.5M-45M

AMT_GOODS_PRICE <0.3M, 0.3M-0.5M, 0.5M-1.0M, 1.0M-1.5M, 1.5M-45M

Few of the Speci c variables selected for Analysis are

Column Name Data type

From Target Variable info we have

Type Percent Meaning

Most Applicants are considered to

8% of Applicants seems to have

4] Identify Patterns in Loan Applicants

Conclusions drawn after Analysis

Top 10 Correlated Columns

Var 1 Var 2 ABS_CORRELATION Var 1 Var 2 ABS_CORRELATION

OBS_60_CNT_SOCIAL_CIRCLE OBS_30_CNT_SOCIAL_CIRCLE 0.998490 AMT_GOODS_PRICE AMT_CREDIT 0.986968

AMT_GOODS_PRICE AMT_CREDIT 0.986968 AMT_GOODS_PRICE AMT_ANNUITY 0.775109

AMT_GOODS_PRICE AMT_ANNUITY 0.775109 AMT_ANNUITY AMT_CREDIT 0.770138

AMT_ANNUITY AMT_CREDIT 0.770138 NUM_OF_INSTALLMENTS AMT_CREDIT 0.661503

DAYS_EMPLOYED DAYS_BIRTH 0.623941 NUM_OF_INSTALLMENTS AMT_GOODS_PRICE 0.634587

DAYS_REGISTRATION DAYS_BIRTH 0.331912 DAYS_EMPLOYED DAYS_BIRTH 0.623941

DAYS_ID_PUBLISH DAYS_EMPLOYED 0.274842 DAYS_ID_PUBLISH DAYS_EMPLOYED 0.274842

DAYS_ID_PUBLISH DAYS_BIRTH 0.272691 DAYS_ID_PUBLISH DAYS_BIRTH 0.272691

DAYS_REGISTRATION DAYS_EMPLOYED 0.214573 EXT_SOURCE_2 REGION_POPULATION_RELATIVE 0.198924

EXT_SOURCE_2 REGION_POPULATION_RELATIVE 0.198924 AMT_ANNUITY AMT_INCOME_TOTAL 0.191657

Column Name Variable Type Inferences

NAME_CONTRACT_TYPE Categorical Defaulters Can be considered for Revolving loans

NUM_OF_INSTALLMENTS_BIN Categorical Most Non Defaulters opt for 16-24 Instalments

Most Applicants are of Higher Education, Secondary/Secondary

AMT_GOODS_PRICE Numerical Higher than 0.75M are mostly Non Defaulters

• Chances of refusal for a Consumer • There is almost a 50-50 chance of loan

• Count of Cash Loans getting • High chances of loan approval if the

• Repeater Applicants has more chances of

Step 10.1: Bivariate Analysis

Column 1 Column 2 Type Inferences

Catergorical- Most Non Defaulters have income type working, commercial

Categorical- Most Non defaulters with Higher Education, Secondary/

• Refusals are higher for low/normal group interest with an annuity

• AMT_CREDIT and AMT_ANNUITY are having positive linear relation

• Cancelled Loan fall into the range of AMT_ANNUITY 0-100k irrespective of

Column 1 Column 2 Inferences

Most Non Defaulter Females have either Higher Education or Secondary/

Widows earning income more than 500+ are all Non