Session 2-2: Credit Risk Database (CRD) For SME Financial Inclusion by Lan Hoang Nguyen

Credit Risk Database (CRD)
for SME Financial Inclusion

Lan H. Nguyen, PhD
Megumi Sagara
Credit Risk Database (CRD) Association, Tokyo
The views expressed in this presentation are the views of the author and do not necessarily reflect the
views or policies of the Asian Development Bank Institute (ADBI), the Asian Development Bank
(ADB), its Board of Directors, or the governments they represent. ADBI does not guarantee the 1
accuracy of the data included in this paper and accepts no responsibility for any consequences of their
use. Terminology used may not necessarily be consistent with ADB official terms.
Contents
 Historical Background
 Contribution to SME financial inclusion

 SME Sector
 Banking Sector
 SME Guarantee System
 Application of Machine Learning to credit risk management

(a joint paper with BOJ and a Japanese Megabank)
 Differences from traditional scoring model
 Dataset: Transaction data
 Machine Learning Models and Results
2
Historical Background
Year-on-year rate (%)
30.0
Collapse of Bubble Economy: 25.0
land collateral lost most value 20.0
15.0
10.0
5.0 Residential area
Credit Crunch in SME Finance 0.0
-5.0
Asian Financial Crisis -10.0
Commercial area
-15.0
Banks needed to improve the quality of risk 80 82 84 86 88 90 92 94 96 98 00 02 04 06 08 10
management so as not to depend on collateral: Source: Ministry of Land, Infrastructure and Transport
risk-based valuation for lending
Lack of information on SME
Call for a establishment of nation-

wide financial database for SME
2001: Introduction of CRD

METI and BOJ initiated the establishment of CRD
via funding from Japanese government.
3
Contribution to SME financial inclusion (1/3):
A sound SME financial infrastructure
 Reliable data provided from nation-wide financial institutions
 mitigate information asymmetry problem and provide benchmark reference for SME sector
 Important support for SME-related policy making: SME Agency White Paper (2018, 2019), BOJ, FSA
【Membership Composition & Accumulated data】

(Unit: 1,000)
Credit guarantee 51 Number of SMEs Number of
corporations financial
statements
Government-affiliated 3
financial institutions Incorporated SMEs 2,529 21,362
(default information) (397) (3,165)
Private financial 103
institutions Sole-proprietor SMEs 1,253 5,665
(default information) (187) (873)
Credit-rating agencies, etc. 14
Total 171 ※ ※ as of March 2019
The governmental 5
institutions
4
※ as of March 2019
Reference: Data Fields 9~26
26~59
items
items Balance Sheet Profit & Loss Statement
Assets Liabilities Sales
• Current assets • Current liabilities Cost of goods sold
―Cash and cash ―Short-term debt
equivalents Gross profit
―Inventories • Fixed liabilities
―Long-term debt Operatingexpenses
Operating expenses
• Fixed assets
―Tangible fixed assets • Salaries expense
Shareholders’
―Intangible assets • Depreciation expense
Equity ・・・・・・・・
―Investments
• Deferred assets • Capital stock Operating income
Non-operating income/expense
Financial Indexes • Interest expense
・・・・・・・・
• Capital-to-asset ratio Non-financial Income before provision for income
174 • Degree of borrowing on lending data & other taxes
indexes supplemental
• Ratio of interest-bearing liabilities
items(numbers Provision for income taxes
• Ratio of current profits to assets of employee)
・・・・・・・・
Net income
 (a) Past-due for 3 months or more; (b) de facto bankruptcy; (c) bankruptcy;
Default data and (d) subrogation (applicable for credit guarantee corporations).
 (e) substandard and (f) potentially bankruptcy were added as correspondence to
BaselⅡ since April 2003.
12
All Rights Reserved. (c) CRD2019
Impact on SME financial inclusion (2/3):
Credit risk management for SME
【Comprehensive evaluation using CRD model】
Credit Scoring:
 Annual credit score based on SME financial 50,000
Your company
performance is calculated for each SME to 45,000

Default Companies (mean)
oversee default risk.

40,000
c
35,000
o
N
CRD provides scoring services to all

r
 p
u
m
30,000
o
member banks b 25,000
r
e
a 20,000
r
t
Validation: i
o
o
15,000
f 10,000
n
 In case bank already have risk models in s

5,000
place, cross-check (validation) of internal 0

25 30 35 40 45 50 55 60 65 70 75
models by CRD models is recommended. Credit

T-Score
Score
Example: Fiscal ＣＲＤ Rank
Default
Default
Your company Company’s
Ranking in the
"Manufacture of
Ranking in the same scale of
Companies T- Ranking in Tokyo sales
Term （Ａ～Ｅ） T-Score
Credit Credit transportation
 An example of how credit scores look like Score Score (mean)

Score equipment" Group Category Ranking
7,200 120,000 \100 to \300 220,000
for a subset of the CRD database: March, 2014 D 38 37 ( in 8,000) (in 140,000) million (in 250,000)
6,000 91,000 \100 to \300 175,000
 Horizontal axis: credit scores 0-100, 0: March, 2013 C 45 37 ( in 8,000) (in 140,000) million (in 250,000)
highest risk, 100: lowest risk March, 2012 C 46 37

5,700 85,000 \100 to \300
million
165,000
( in 8,000) (in 140,000) ( in 250,000)
6
 This firm scores 37, ranks 7200 in its industry, *These numbers are imaginary. The same hereinafter.
ranks 220,000 in its sales size group

Impact on SME financial inclusion (3/3)
Lesson from Japan: improvement in guarantee policy for SME
SME in Japan can apply for guarantee for bank lending Subrogation rate* and outstanding guaranteed
if eligible liabilities billion
yen
% Outstanding guaranteed liabilities Subrogation rate
4 50,000
3 40,000
Rapid expansion of Credit guarantee service 30,000
provision as a result of SME support policy 2
20,000
1 10,000
0 0
1997 1999 2001 2003 2007 2009 2011 2013 2015 2017
Increase in defaults and
subrogation
Need to prevent adverse

selection and improve the
balance sheet of CGCs
2006: CGCs introduced risk-based pricing

based on CRD Scoring
(Guarantee Fee Rate scheme that takes credit risk
7
into account)
Application of Machine Learning to
credit risk management
 The study is a joint study of BOJ, a Japanese megabank and CRD Association.
 To further improve SME credit risk management, we study their bank account’s transaction data
for shorter-term monitoring applying Machine Learning algorithm.
Traditional Scoring Model Machine Learning

• Standardized Data---Financial Statements (FS) • Informative, but Complex Data: Bank Accounts
• Frequency---generally yearly Data, Smartphone (accounting) Application
Data
data
• Frequency---High (daily data)
• Longer term---Probability of Default (PD) within 1~3 • Short term---PD within 3 months or 6 months.
Output
years
Loan • Basis of evaluation---Establishing own internal rating • Evaluation for SMEs without FS (SMEs with
screening system based on traditional CRD scoring model underdeveloped accounting, Start-ups, etc..)
Validation • Validation --- Bank’s own internal rating system

Usage
cross-checked using CRD scores
Monitoring • Timely and Targeted Monitoring: closely
monitoring of troubled SMEs and timely warning
8
* Paper in Japanese: Miura et. al, 2019, 入出金データを用いた信用リスク評価ー機械学習による実証分析ー, BOJ Working paper series
Dataset (1/3)
 Bank Account Transaction Data: (a part of our commercial dataset,
prepared for research purpose)
 Monthly data: 2014/10 to 2018/5 (44 months)
 7000 anonymous borrowers (30,000 observations) among which 1400
default cases
 Transaction data, originally in high frequency (seconds), is summarized to
monthly frequency because:
 Fix costs payments and fixed revenues are often recorded in monthly
frequency
 Default status changes on monthly basis
9
Dataset (2/3)
 Bank Account Transaction data: 6000 variables divided into 3 main groups
 Cash balance: Daily average of monthly balance instead of end-of-month balance  a better
reflection of cash flow
Revenue Bank transfers for Revenue, Overseas sales revenue…

Investing activities Dividend, proceeds from trust funds…
Cash-inflow
Financing activities Loan Borrowing…
Other cash-inflow Cash, Credit Card, Interest…
Cost of Goods and Services Bank transfers to suppliers…
Variable Cost Credit Card, Insurance, Tax…
Fixed Cost Utility (Electric, Gas, Water), Cable…
Cash-outflow Investment Activities Investment trust fund…
Financing Activities Loan Repayment, Interest Payment, Guarantee Fees…
Other cash-outflow Dishonor fees…

Cash Balance Cash Balance Aggregated cash balance…
 Best performing variables: variables that produce highest AR* in a single-variable-model

 Cash Balance, Cash-outflows from loan repayment
* We use AR (Accuracy Ratio) as an indicator for model’s performance, the higher AR* the better the model is. See
10
Reference 1 for details.
※ For confidentiality reason, we could not report summary statistics of the data
Dataset (3/3)
 Default Data: Need to match transaction data with default
 Definition of default
 Default refers to a state in which business finds difficulty in continuing normal operations.
 For this paper, Default refers to borrowers classified as “Needs Special Attention” under
Financial Service Agency (FSA) classification
 Records of repayment status and classification are updated monthly
 Transaction data is matched with default observed within 3 months

 In fact, we also tried 1 month, 2 month, 3 month, 6 month and 12 month
Jan Feb Mar Apr May Jun July

2016 2016 2016 2016 2016 2016 2016
Recorded
Transaction
Default Observation Period (3 months)
11
Default Observation Period (6 months)
Machine learning: Random Forest
 Random Forests (many decision trees): Classification problem of
Default or not Default
 Input: each decision tree is trained on different randomly sampled
sub-dataset with nodes that are chosen to minimize impurity** of the
last subsets.
 Output: The forest output probability of default is the average of
output probability across all different trees:
1
𝑃𝑃 𝑅𝑅𝑅𝑅 = ∑𝑇𝑇𝑡𝑡 𝑝𝑝(𝑡𝑡) (𝑅𝑅𝑅𝑅|𝑡𝑡)
𝑇𝑇
where T is the total number of trees, t is a given decision tree, P(RF) is

forest default probability
 Parameters to learn from data: number of trees and number of
information features (financial variables)
Intuition:
By aggregating output from slightly different trees, the overall model
becomes less prone to variability and thus becomes general enough to
apply to any dataset (ensembling using bagging)
12
** Gini coefficient
Machine Learning: XGBoost
 XGBoost (eXtreme Gradient Boosting): Classification problem of Default or not
Default
 New trees are added to correct or minimize the errors of sequential trees. Each
tree is a weak learner of the previous one by taking gradient steps towards
minimum error.
 Parameter to learn from dataset: tree’s max depth and learning rate
 Relatively easy to overlearn
Learning to minimize residuals of

the previous trees
>>>>>>>>
13
Output = weighted average probability of each tree
Results: Random Forest
Modelling datasets Back test datasets
(2014/10 ~ 2017/5) (2017/6 ~ 2018/5)
Training Dataset Testing Dataset
2017/6 Dataset
(Model Construction) (Parameter Tuning)
......
2018/5 Dataset
 Parameter tuning: a process to find best combination of parameters that provide

highest accuracy
Transaction Default Definition Default Horizon Hyper Parameter Accuracy
Data Ratio*
Monthly Needs Special Attention 3 months *(150, Impurity, 300) 0.699
*(Number of tree, Gini Coefficient, number of information features)
 AR of 0.707 is a relatively high for credit risk models in banking sector

 Feature importance: Cash Balance, Cash-outflows from loan repayment 14
Results: XGBoost
 Parameter Tuning:
Transaction Default Definition Default Horizon Hyper Parameter Accuracy
Data Ratio*
Monthly Needs Special Attention 3 months *(3, 0.01) 0.7266
*(Tree’s max depth, learning rate)
 AR further improves with XGBoost

 Comparison with (traditional) logistic regression model:
Dataset Model Default horizon:
3 month
Random Forest 0.7070
Testing Dataset XG Boost 0.7329
Logistic 0.7113
Random Forest 0.7482
Back test XGBoost 0.7728 15
Logistic 0.7174
Conclusion
 Credit Risk Database contributes to SME financial inclusion via:
 Providing tools to access SME credit risk
 Providing important information of SME sector to policy makers
 Smoothening the working mechanism of credit guarantee system
 Japan’s experience could be a case study for countries who are working towards
building guarantee scheme for SMEs
 Besides financial data, SME credit risk could be accessed using bank account
transaction data by applying machine learning
 Designed for shorter-term monitoring
 Machine learning models such as RF, XGBoost outperform traditional logistic model
 In countries where SME’s financial information is not available, transaction data
based model could be a powerful alternative
16
Thank you for your attention!
17
Reference 1: Accuracy Ratio AR
18

Session 2-2: Credit Risk Database (CRD) For SME Financial Inclusion by Lan Hoang Nguyen

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 2-2: Credit Risk Database (CRD) For SME Financial Inclusion by Lan Hoang Nguyen

Uploaded by

Copyright:

Available Formats

Credit Risk Database (CRD)

for SME Financial Inclusion

 Contribution to SME financial inclusion

 Application of Machine Learning to credit risk management

Lack of information on SME

Call for a establishment of nation-

2001: Introduction of CRD

【Membership Composition & Accumulated data】

performance is calculated for each SME to 45,000

oversee default risk.

CRD provides scoring services to all

 In case bank already have risk models in s

place, cross-check (validation) of internal 0

models by CRD models is recommended. Credit

 An example of how credit scores look like Score Score (mean)

highest risk, 100: lowest risk March, 2012 C 46 37

ranks 220,000 in its sales size group

Need to prevent adverse

2006: CGCs introduced risk-based pricing

Traditional Scoring Model Machine Learning

Validation • Validation --- Bank’s own internal rating system

Revenue Bank transfers for Revenue, Overseas sales revenue…

Other cash-outflow Dishonor fees…

 Best performing variables: variables that produce highest AR* in a single-variable-model

 Transaction data is matched with default observed within 3 months

Jan Feb Mar Apr May Jun July

where T is the total number of trees, t is a given decision tree, P(RF) is

Learning to minimize residuals of

 Parameter tuning: a process to find best combination of parameters that provide

*(Number of tree, Gini Coefficient, number of information features)

 AR of 0.707 is a relatively high for credit risk models in banking sector

*(Tree’s max depth, learning rate)

 AR further improves with XGBoost

You might also like