You are on page 1of 18

Personal Loan Risk Model

Answer Key Template


DATA QUALITY (DQ) REPORT

FS Training - Acquisition 2
Data Quality Report
Table
No Of Unique Data Available Missing
Field Name Data Type Records Count Available Percent Missing Percent Minimum Maximum Mean Comments
Age NUM 150,000 86 150,000 100% 0 0% 0 109 52.30 There is one single row with '0' age. 109 is too high
Gender CHAR 150,000 2 150,000 100% 0 0%
Region CHAR 150,000 5 150,000 100% 0 0%
Rented_OwnHouse CHAR 150,000 2 150,000 100% 0 0%
Occupation CHAR 150,000 5 150,000 100% 0 0%
Education CHAR 150,000 5 150,000 100% 0 0%
NumberOfTime30-59DaysPastDueNotWorse NUM 150,000 16 150,000 100% 0 0% 0 13 0.25 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfTime60-89DaysPastDueNotWorse NUM 150,000 13 150,000 100% 0 0% 0 11 0.06 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfTimes90DaysLate NUM 150,000 19 150,000 100% 0 0% 0 17 0.09 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfOpenCreditLinesAndLoans NUM 150,000 58 150,000 100% 0 0% 0 58 8.45 Max Too High
NumberRealEstateLoansOrLines NUM 150,000 28 150,000 100% 0 0% 0 54 1.02 Max Too High
NumberOfDependents NUM 150,000 14 146,076 97% 3,924 3% 0 20 0.76 The value NA=Missing values
RevolvingUtilizationOfUnsecuredLines NUM 150,000 125,728 150,000 100% 0 0% 0 50,708 6.05 Should not be more than 1
DebtRatio NUM 150,000 114,194 150,000 100% 0 0% 0 329,664 353.01 Should not be more than 1
MonthlyIncome NUM 150,000 13,595 120,269 80% 29,731 20% 0 3,008,750 6670.22 The value NA=Missing values
Good_Bad CHAR 150,000 2 150,000 100% 0 0%
Add 1%, 5%, 10%, 25%, 50%, 75% , 90%, 95%, 99% percentile values for numeric
variables to the above table. Add % of zeros.

3
UNIVARIATES

FS Training - Acquisition 4
Data Quality Report
Treatment
• Missing and default values should be coded as NA and
included in the analysis
• This data has outliers. Keep as is since decision tree will
incorporate it seamlessly.

5
Univariates – Full File
To be Done for each variable – Age
SAMPLE OUTPUT

Age Bad Rate # Obs Bads


<=22 7.9% 618 49
23 to 28 12.4% 6501 807
29 to 36 10.6% 16746 1772
37 to 43 9.0% 20644 1865
44 to 52 7.9% 32861 2584
53 to 55 6.9% 10625 728
56 to 57 5.6% 6964 390
58 to 62 4.8% 17071 819
63 to 67 3.3% 14368 481
>67 2.2% 23602 531

6
Univariates
To be Done for each variable – Income
Income Bad Rate # Obs
<=5320 8.62% 59323
>5320 to <=6643 6.66% 16128
>6643 4.90% 44818
Missing 5.53% 29731

Income
10.00%
8.00%

Bad Rate
Note: Forced split for <=33 category. 6.00%
4.00%
2.00%
0.00%
<=5320 >5320 to >6643 Missing
<=6643
Income

Income rank orders. Lower


bad rates for higher income
levels

7
Univariates – Full File
Grouping To be Done for each Variable – Education
SAMPLE OUTPUT

Tests conducted at 5% level of significance


R Software – Decision tree output

8
Univariates
Grouping To be Done for each Variable – Education
SAMPLE OUTPUT
Total
Education No. Of Bads Accounts Bad %
Matric 1463 11207 13%
Graduate 1607 27917 6%
Post-Grad 1704 26026 7%
PhD 497 4376 11%
Professional 1740 35623 5%

9
Bivariate Risk Segmentation
Age and Income
AGE and INCOME ONLY SAMPLE OUTPUT

Age/Income Y1-Y2 Y3-Y4 Y5-Y6 Y7-Y8


Group
X1-X2
X3- X4
X5-X6
X7-X8

• Max of 5 groups each Segment in such a way that lowest age and
• Populate each cell with lowest income has highest bad rate. Highest
– Sample Size age and highest income has lowest bad rate.
– Bad Rate
– No. of Bads

10
Univariates – Decile Binning In case Decision Tree
Results are not coming out
Debt Ratio - Decile Binning
Debt Ratio - Deciles
Bins Bad Rate N Bads
12.0%
[0,0.0309] 5.4% 15000 807
(0.0309,0.134] 6.8% 15000 1023 10.0%
(0.134,0.214] 6.0% 15000 905
8.0%
(0.214,0.287] 5.4% 15000 811
(0.287,0.367] 5.6% 15000 842 6.0%

(0.367,0.468] 6.7% 15000 1011 4.0%


(0.468,0.649] 8.4% 15001 1262
2.0%
(0.649,4] 11.3% 15108 1712
(4,1.27e+03] 6.1% 14896 906 0.0%
1 2 3 4 5 6 7 8 9 10
(1.27e+03,3.3e+05] 5.0% 14995 747

Debt Ratio – Re-binned


Debt Ratio Bad Rate 12.0%
<.367 5.9% 10.0%
(0.367,0.468] 6.7% 8.0%
(0.468,0.649] 8.4%
6.0%
(0.649,4] 11.3%
4.0%
(4,1.27e+03] 6.1%
2.0%
(1.27e+03,3.3e+05] 5.0%
0.0%
1 2 3 4 5 6

11
Predictors for Risk Model
List of Variables
• Identify the predictors that will be used for logistic regression.
Select predictors which have a rank order and trend ( positive or
negative relationship with bad rate) that makes business sense.
• For numeric variable, where there is no rank order, try to use them
as a dummy variable
• From bivariate of age and income, create an interaction dummy for
highest risk segment
• For Numeric Variables:
– Use the variable as is (Un- Binned)
– Impute Missing Values using DQ Report/ Similar bad rate
– Cap the maximum value to 95th or 99th Percentile for outliers
– Cap the minimum value to 5% or 1% Percentile for outliers
• Create dummy variables for categorical variables.

12
RISK MODEL

FS Training - Acquisition 13
Data
Development and Validation
• Create a 60 % or 70% random file for development of decision
tree
• Create a 30% or 40% random file for validation of the results

14
Risk Model Logistic Regression
Results - Equation
Variable Coefficient Chi Square P Value Any
Definition Value Comment
- Optional
Intercept
Variable 1

Variable N
Sorted by Descending Order of Importance or Chi Square Value

At least 50%- 60% of final predictors should be continuous variables 15


Logistic Regression - Example
Results – Gains Table - Development
Lift = Mean
Depth/Decile # N % of N # of Responders % of Responders Mean Response Rate Cum_Pct_N Cume_Pct_Total_Resp Mean Model Score Response/Total
10 2918 10% 511 31.4% 17.5% 10.0% 31.4% 17.8% 3.1
20 2918 10% 268 16.5% 9.2% 20.0% 47.9% 9.4% 1.7
30 2919 10% 184 11.3% 6.3% 30.0% 59.2% 6.7% 1.1
40 2918 10% 163 10.0% 5.6% 40.0% 69.3% 5.8% 1.0
50 2919 10% 140 8.6% 4.8% 50.0% 77.9% 4.9% 0.9
60 2918 10% 123 7.5% 4.2% 60.0% 85.4% 4.4% 0.8
70 2918 10% 90 5.6% 3.1% 70.0% 91.0% 3.3% 0.6
80 2919 10% 79 4.8% 2.7% 80.0% 95.8% 2.8% 0.5
90 2918 10% 58 3.6% 2.0% 90.0% 99.4% 2.3% 0.4
100 2919 10% 10 0.6% 0.3% 100.0% 100.0% 0.3% 0.1
Total 29184 1626 100% 5.6%

Gains Chart
100.0%

Report GINI 80.0%

60.0%

40.0%

20.0%

0.0%
1 2 3 4 5 6 7 8 9 10

Cum_Pct_N Cume_Pct_Total_Resp 16
Logistic Regression
Results – Gains Table - Validation

Repeat Gains Table of Development for Validation Data

Report GINI

17
Risk Segmentation - Example
Response Segment N # of Responders Mean Response Rate % of Sample % of Responders Lift
V High 2918 511 17.5% 10.0% 31.4% 3.1
High 2918 268 9.2% 10.0% 16.5% 1.7
Medium 11674 610 5.2% 40.0% 37.5% 0.9
Low 11674 237 2.0% 40.0% 14.6% 0.4
Total 29184 1626 5.6% 100.0% 100.0%

• V High Response Segment has 3x the sample response rate based on lift
measure

• V High and High have a collective response rate of 13.4% and comprise 20%
of the sample but contribute to almost 50% of the total responders.

• Low contributes to 40% of sample but only 15% of responders. Their


responder rate is approx half of the sample response rate.

Create the same from Gains Tables. No. of Risk segments can range from min 3 to
max 5.

18

You might also like