Personal Loan Risk Model: Answer Key Template

Personal Loan Risk Model
Answer Key Template

DATA QUALITY (DQ) REPORT
FS Training - Acquisition 2
Data Quality Report
Table
No Of Unique Data Available Missing
Field Name Data Type Records Count Available Percent Missing Percent Minimum Maximum Mean Comments
Age NUM 150,000 86 150,000 100% 0 0% 0 109 52.30 There is one single row with '0' age. 109 is too high
Gender CHAR 150,000 2 150,000 100% 0 0%
Region CHAR 150,000 5 150,000 100% 0 0%
Rented_OwnHouse CHAR 150,000 2 150,000 100% 0 0%
Occupation CHAR 150,000 5 150,000 100% 0 0%
Education CHAR 150,000 5 150,000 100% 0 0%
NumberOfTime30-59DaysPastDueNotWorse NUM 150,000 16 150,000 100% 0 0% 0 13 0.25 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfTime60-89DaysPastDueNotWorse NUM 150,000 13 150,000 100% 0 0% 0 11 0.06 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfTimes90DaysLate NUM 150,000 19 150,000 100% 0 0% 0 17 0.09 The value 96='Others', 98='Refused to Say'. Code this as NA
NumberOfOpenCreditLinesAndLoans NUM 150,000 58 150,000 100% 0 0% 0 58 8.45 Max Too High
NumberRealEstateLoansOrLines NUM 150,000 28 150,000 100% 0 0% 0 54 1.02 Max Too High
NumberOfDependents NUM 150,000 14 146,076 97% 3,924 3% 0 20 0.76 The value NA=Missing values
RevolvingUtilizationOfUnsecuredLines NUM 150,000 125,728 150,000 100% 0 0% 0 50,708 6.05 Should not be more than 1
DebtRatio NUM 150,000 114,194 150,000 100% 0 0% 0 329,664 353.01 Should not be more than 1
MonthlyIncome NUM 150,000 13,595 120,269 80% 29,731 20% 0 3,008,750 6670.22 The value NA=Missing values
Good_Bad CHAR 150,000 2 150,000 100% 0 0%
Add 1%, 5%, 10%, 25%, 50%, 75% , 90%, 95%, 99% percentile values for numeric
variables to the above table. Add % of zeros.
3
UNIVARIATES
Data Quality Report
Treatment
• Missing and default values should be coded as NA and
included in the analysis
• This data has outliers. Keep as is since decision tree will
incorporate it seamlessly.
5
Univariates – Full File
To be Done for each variable – Age
SAMPLE OUTPUT
Age Bad Rate # Obs Bads

<=22 7.9% 618 49
23 to 28 12.4% 6501 807
29 to 36 10.6% 16746 1772
37 to 43 9.0% 20644 1865
44 to 52 7.9% 32861 2584
53 to 55 6.9% 10625 728
56 to 57 5.6% 6964 390
58 to 62 4.8% 17071 819
63 to 67 3.3% 14368 481
>67 2.2% 23602 531
6
Univariates
To be Done for each variable – Income
Income Bad Rate # Obs
<=5320 8.62% 59323
>5320 to <=6643 6.66% 16128
>6643 4.90% 44818
Missing 5.53% 29731
Income
10.00%
8.00%
Bad Rate
Note: Forced split for <=33 category. 6.00%
4.00%
2.00%
0.00%
<=5320 >5320 to >6643 Missing
<=6643
Income
Income rank orders. Lower

bad rates for higher income
levels
7
Univariates – Full File
Grouping To be Done for each Variable – Education
SAMPLE OUTPUT
Tests conducted at 5% level of significance

R Software – Decision tree output
8
Univariates
Grouping To be Done for each Variable – Education
SAMPLE OUTPUT
Total
Education No. Of Bads Accounts Bad %
Matric 1463 11207 13%
Graduate 1607 27917 6%
Post-Grad 1704 26026 7%
PhD 497 4376 11%
Professional 1740 35623 5%
9
Bivariate Risk Segmentation
Age and Income
AGE and INCOME ONLY SAMPLE OUTPUT
Age/Income Y1-Y2 Y3-Y4 Y5-Y6 Y7-Y8

Group
X1-X2
X3- X4
X5-X6
X7-X8
• Max of 5 groups each Segment in such a way that lowest age and
• Populate each cell with lowest income has highest bad rate. Highest
– Sample Size age and highest income has lowest bad rate.
– Bad Rate
– No. of Bads
10
Univariates – Decile Binning In case Decision Tree
Results are not coming out
Debt Ratio - Decile Binning
Debt Ratio - Deciles
Bins Bad Rate N Bads
12.0%
[0,0.0309] 5.4% 15000 807
(0.0309,0.134] 6.8% 15000 1023 10.0%
(0.134,0.214] 6.0% 15000 905
8.0%
(0.214,0.287] 5.4% 15000 811
(0.287,0.367] 5.6% 15000 842 6.0%
(0.367,0.468] 6.7% 15000 1011 4.0%

(0.468,0.649] 8.4% 15001 1262
2.0%
(0.649,4] 11.3% 15108 1712
(4,1.27e+03] 6.1% 14896 906 0.0%
1 2 3 4 5 6 7 8 9 10
(1.27e+03,3.3e+05] 5.0% 14995 747
Debt Ratio – Re-binned

Debt Ratio Bad Rate 12.0%
<.367 5.9% 10.0%
(0.367,0.468] 6.7% 8.0%
(0.468,0.649] 8.4%
6.0%
(0.649,4] 11.3%
4.0%
(4,1.27e+03] 6.1%
2.0%
(1.27e+03,3.3e+05] 5.0%
0.0%
1 2 3 4 5 6
11
Predictors for Risk Model
List of Variables
• Identify the predictors that will be used for logistic regression.
Select predictors which have a rank order and trend ( positive or
negative relationship with bad rate) that makes business sense.
• For numeric variable, where there is no rank order, try to use them
as a dummy variable
• From bivariate of age and income, create an interaction dummy for
highest risk segment
• For Numeric Variables:
– Use the variable as is (Un- Binned)
– Impute Missing Values using DQ Report/ Similar bad rate
– Cap the maximum value to 95th or 99th Percentile for outliers
– Cap the minimum value to 5% or 1% Percentile for outliers
• Create dummy variables for categorical variables.
12
RISK MODEL
Data
Development and Validation
• Create a 60 % or 70% random file for development of decision
tree
• Create a 30% or 40% random file for validation of the results
14
Risk Model Logistic Regression
Results - Equation
Variable Coefficient Chi Square P Value Any
Definition Value Comment
- Optional
Intercept
Variable 1
Variable N
Sorted by Descending Order of Importance or Chi Square Value
At least 50%- 60% of final predictors should be continuous variables 15

Logistic Regression - Example
Results – Gains Table - Development
Lift = Mean
Depth/Decile # N % of N # of Responders % of Responders Mean Response Rate Cum_Pct_N Cume_Pct_Total_Resp Mean Model Score Response/Total
10 2918 10% 511 31.4% 17.5% 10.0% 31.4% 17.8% 3.1
20 2918 10% 268 16.5% 9.2% 20.0% 47.9% 9.4% 1.7
30 2919 10% 184 11.3% 6.3% 30.0% 59.2% 6.7% 1.1
40 2918 10% 163 10.0% 5.6% 40.0% 69.3% 5.8% 1.0
50 2919 10% 140 8.6% 4.8% 50.0% 77.9% 4.9% 0.9
60 2918 10% 123 7.5% 4.2% 60.0% 85.4% 4.4% 0.8
70 2918 10% 90 5.6% 3.1% 70.0% 91.0% 3.3% 0.6
80 2919 10% 79 4.8% 2.7% 80.0% 95.8% 2.8% 0.5
90 2918 10% 58 3.6% 2.0% 90.0% 99.4% 2.3% 0.4
100 2919 10% 10 0.6% 0.3% 100.0% 100.0% 0.3% 0.1
Total 29184 1626 100% 5.6%
Gains Chart
100.0%
Report GINI 80.0%
60.0%
40.0%
20.0%
0.0%
1 2 3 4 5 6 7 8 9 10
Cum_Pct_N Cume_Pct_Total_Resp 16
Logistic Regression
Results – Gains Table - Validation
Repeat Gains Table of Development for Validation Data
Report GINI
17
Risk Segmentation - Example
Response Segment N # of Responders Mean Response Rate % of Sample % of Responders Lift
V High 2918 511 17.5% 10.0% 31.4% 3.1
High 2918 268 9.2% 10.0% 16.5% 1.7
Medium 11674 610 5.2% 40.0% 37.5% 0.9
Low 11674 237 2.0% 40.0% 14.6% 0.4
Total 29184 1626 5.6% 100.0% 100.0%
• V High Response Segment has 3x the sample response rate based on lift
measure
• V High and High have a collective response rate of 13.4% and comprise 20%
of the sample but contribute to almost 50% of the total responders.
• Low contributes to 40% of sample but only 15% of responders. Their

responder rate is approx half of the sample response rate.
Create the same from Gains Tables. No. of Risk segments can range from min 3 to
max 5.
18

Personal Loan Risk Model: Answer Key Template

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Personal Loan Risk Model: Answer Key Template

Uploaded by

Copyright:

Available Formats

Personal Loan Risk Model

Answer Key Template

Age Bad Rate # Obs Bads

Income rank orders. Lower

Tests conducted at 5% level of significance

Age/Income Y1-Y2 Y3-Y4 Y5-Y6 Y7-Y8

(0.367,0.468] 6.7% 15000 1011 4.0%

Debt Ratio – Re-binned

At least 50%- 60% of final predictors should be continuous variables 15

Report GINI 80.0%

Repeat Gains Table of Development for Validation Data

• Low contributes to 40% of sample but only 15% of responders. Their

You might also like