Professional Documents
Culture Documents
Credit EDA Case Study: Upgrad Assignment
Credit EDA Case Study: Upgrad Assignment
Upgrad Assignment
Dataset
Name Description Contains
2] There are 65 columns of type oat64, 41 columns of type int64, 16 columns of type object
2] There are 2 columns of type oat64, 6 columns of type int64, 15 columns of type object
There are Total 57 Columns with more than 13% missing values that can be dropped
There are few Columns with less number of missing values which can be imputed
Cardinal
CNT_FAM_MEMBERS 0.0007% Mode
Categorical Data
EXT_SOURCE_2 0.22% Mean No Outliers
Step 3: Identify Outliers
On Application Dataset using IQR Values
On Application Dataset
There are almost 20 Categorical Columns, but only few of them are of object type
Columns with DAYS related information are having negative values, we need to consider absolute values for them
Step 5: Binning Continuous Variables
On Application Dataset
Variable that can be binned and are most useful for visual analysis are as below
DAYS_BIRTH 20-30 yrs, 31-40 yrs, 41-50 yrs, 51-60 yrs, 60+ yrs
DAYS_EMPLOYED 0-5 yrs, 6-10 yrs, 11-15 yrs, 16-20 yrs, 21-25 yrs, 25+ yrs
Step 6: Selection of Columns for Analysis
From Application Dataset
Apart from these there are few more selected which gives info on Amounts, Binned Columns, Target, Code_Gender
and so on
Another additional Column added is NUM_OF_INSTALLMENTS which is derived from CREDIT/ANNUITY Value, will
be used to gain some useful insights
fi
Step 7: Data Imbalance - Target Variable
2] Identify in the 8% Defaulters who can be considered for the loan without any risk
3] Use the Previous Application Data to add in more insights based on past data
Safe Risky
1] Revolving Loan with Credit other than 0.5M to 1M income range 200k-500k
1] Consumer Loans, Income range 100k-500k, higher credit maybe defaulted
2] Students and Businessman in all income ranges, Pensioners with income 2] Credit amount around 15% more or less than the application amount, may or
200k+ may not be accepted
Bene cial
Unsafe
1] Academic Degree with income range of 200k-500k and Higher Education,
Secondary/Secondary Special for Higher Credit
1] Income type Maternity Leave with income <100k
2] Co-op Apartment living, living with parents and Municipal apartments with 2] Credit bin 0.5M-1.0M and Income range of 100k-200k
income range of 500k+ Higher Credit
* Following slides have the detailed analysis performed from which these conclusions are attained
fi
Step 8: Top 10 Correlations
For Complete Application Dataset
Columns chosen for Analysis
DAYS_EMPLOYED_BIN Categorical Most Non Defaulters have work Experience of 0-10 yrs and 30+ yrs
AMT_CREDIT Numerical 0.5M to 1M has most Defaulters other ranges are most Non Defaulters
• Most of the loans that are credited have an AM_ANNUITY < 10000
• Most approved have a term period of 10 -20 months
• The Density curve is having a similar shape of curve for both • Refused loans don't seem to have a pattern based out of
Approved and Refused Loans CNT_PAYMENT
Catergorical- Most Non Defaulters are in Income range 200k-500k Opt for
AMT_CREDIT_BIN AMT_INCOME_TOTAL_BIN
Categorical 0.5M to 1.5M Credit
Continuous-
AMT_ANNUITY AMT_CREDIT With increase in AMT_CREDIT, AMT_ANNUITY Increases
Continuous
*Here we make use of the correlated Columns from which we can draw the inferences required
Application Dataset Bivariate Analysis Visualisations
Previous Application Dataset BIVARIATE ANALYSIS – (CATEGORICAL – CATEGORICAL VARIABLE)
• Loans from a Repeater having an application amount > 200K have • Loans having the range 20K – 75K have high acceptance rate
the highest number of refusals
when the amount credited is less than or equal to amount applied
• 75K – 200K is the most popular loan amounts among Repeater and • Unused Offers are clearly prevalent when the amount credited is
Refreshed applicants. New applicants generally opt for 20K-75K less than or equal to amount applied
• The highest approved loans are those that have high group
• The greatest number of cancellations come from a repeater
interest and have an annuity amount less than 7.5K
applicant when the client is acquired from Credit and Cash Offices
• Mean amount Credited for an approved loan is less for both x-sell • Cancelled applications that have a higher amount are generally having a Low
and walk-in
Action group interest rate
• Applicants walking in for a low value for amount to be credited • Generally, the mean value of the application amount for refused loans looks nearly
generally tend to get their loan approved the same as 1st Quartile value of Cancelled Loans for all Name yield groups
• Mean value of the terms of payment for Cash is same for Refused • Applicants that are Repeating or have refreshed generally tend to take
and Cancelled loans. This value is on a lower side for Approved loans
less time to apply for next loan when compared to New applicants
• The terms of payment for POS is observed to be low for all • Despite getting the loan approved, a New client can take anything
application status between 1000-1500 days to apply for another loan
BIVARIATE ANALYSIS – (CONTINUOUS – CONTINUOUS VARIABLE)
• CNT_PAYMENT more than 80, has more refusal rate than approval rates
• AMT_CREDIT and AMT_APPLICATION are having positive linear relation
• More Approval rate for CNT_PAYMENTS in range of 0-20 and Lesser credit • Higher the AMT_APPLICATION, Lesser the AMT_CREDIT is approved more
amount
• But we can infer that even though AMT_CREDIT is high the AMT_ANNUITY is • Higher AMT_GOODS_PRICE, higher AMT_CREDIT have less number of refusals
still opted in a range of 0-200k
Most Non Defaulter Females are earning in range of 100k to 500k where as
AMT_INCOME_TOTAL_BIN CODE_GENDER
Males are mostly earning in range of 200k-500k
Most Non Defaulter Females opt for AMT_CREDIT of < 0.3M or in range of
AMT_CREDIT_BIN CODE_GENDER
0.5M-1.0M
NUM_OF_INSTALLMENTS_BIN CODE_GENDER Most of Non Defaulter Female and Males opt for Instalment bin range of 16-24
Segmented Analysis Visualisations
Step 10.3: Multivariate Analysis
Column 1 Column 2 Hue Inferences
• AMT_CREDIT Column has 1 Missing value which can be lled with Median of the Column
• PRODUCT_COMBINATION has 0.02% Missing values which can be lled with mode value
• Post merger we need to select the columns required for further analysis
• Using the NAME_CONTRACT_STATUS variable for drawing inferences, below chart shows the Data imbalance
fi
fi
Step 12: Visualisation of Relations
Correlation of Columns chosen for Analysis on merged Data frame
Analysis on Merged Data Frame
Column Inferences
1] Lot of Applicants considered as Non Defaulters have there Application Cancelled
Previously
NAME_CONTRACT_STATUS
2] Lot of Applicants considered Defaulters have there Application Approved Previously
NAME_CASH_LOAN_PURPOSE Repairs, Others, Urgent Needs, Building a House or an Amex, Everyday Expenses,
Medicines
Analysis on NAME_CONTRACT_STATUS
Type Column Inferences
ND-Canceled NAME_CLIENT_TYPE Most of them are Repeaters
NAME_EDUCATION_TYPE v/s
ND-Canceled Most applicants of Secondary/ Secondary Special with 0-5 and 25+ yrs Experience
DAYS_EMPLOYED_BIN
AMT_CREDIT_BIN v/s
ND-Canceled Most of them are Academic Degree with income range 500k+
AMT_INCOME_TOTAL_BIN
AMT_INCOME_TOTAL_BIN v/s Most belong to category Credit bin 0.5M-1.0M and have Income range of
D-Approved
AMT_CREDIT_BIN 100k-500k
We observe that the data is normalized and most of the cases of refusal are due to
ND-Refused Credit_Application_di
either alloting more or less credit than the application
* Here the Data originally part of application dataset (Eg: Education_type, Income_Range, …)
has been assigned to each of the previous applicant entry with same SK_ID_CURR
f
Visualisation of Analysis on NAME_CONTRACT_STATUS
Few other Analysis
Column Inferences
NAME_CASH_LOAN_PURPOSE v/s Most Non Defaulters are with purpose Repairs, Others, Urgent needs, Building a house or
TARGET an amex, Everyday Expenses, Medicine
NAME_CASH_LOAN_PURPOSE v/s Most loans with purpose of Other, Repairs, Urgent needs, Building House or Amex have
NAME_CONTRACT_STATUS been refused than accepted
NAME_GOODS_CATEGORY v/s There are no cancelled loans with purpose other, Vehicles, Gardening, Clothing and
NAME_CONTRACT_STATUS Accessories, O ce Appliances, Medicines
* Here the Data originally part of application dataset (Eg: Education_type, Income_Range, …)
has been assigned to each of the previous applicant entry with same SK_ID_CURR
ffi
Step 13: Drawing Inferences
Dataset Overall Inferences
1] Income Type working, commercial associate or pensioner with income range 100k-500k are Non
Defaulters
2] Education type Higher Education, Secondary/Secondary Special can be considered for higher Credit
3] Defaulters opting for Revolving Loan with Credit other than 0.5M to 1M range and of income range
Application 200k-500k can be considered safe
Dataset 4] Education Type Academic Degree with income range of 200k-500k opting for higher credit can be
considered
5] Income type Students and Businessman, housing Type Co-op Apartment living, living with parents
and Municipal apartments with income range of 500k+ can safely be considered as Non Defaulters
6] Maternity Leave have income <100k and are mostly are Defaulters
1] Credit bin 0.5M-1.0M and have Income range of 100k-200k needs to be ignored
2] The credit amount should be only 15% more or less than the application amount, else it gets refused
Merged Dataset
can be considered safe
4] Consumer Loans and Credit bin 0.5M-1.0M and have Income range of 100k-500k tend to fall into
Defaulters side
ffi
Step 14: Conclusion
Safe Risky
1] Revolving Loan with Credit other than 0.5M to 1M income range 200k-500k
1] Consumer Loans, Income range 100k-500k, higher credit maybe defaulted
2] Students and Businessman in all income ranges, Pensioners with income 2] Credit amount around 15% more or less than the application amount, may or
200k+ may not be accepted
Bene cial
Unsafe
1] Academic Degree with income range of 200k-500k and Higher Education,
Secondary/Secondary Special for Higher Credit
1] Income type Maternity Leave with income <100k
2] Co-op Apartment living, living with parents and Municipal apartments with 2] Credit bin 0.5M-1.0M and Income range of 100k-200k
income range of 500k+ Higher Credit
fi
Thank you.