Professional Documents
Culture Documents
Abstract
Based on the world’s largest loss database of corporate defaults, we perform a comparative
analysis of machine learning (ML) methods in credit risk modeling across the globe. We find
substantial benefits of ML methods for different credit risk parameters, even though we use a
uniform modeling framework for the ML methods, which potentially facilitates a massive
reduction in operational resources required for model development and validation. We analyze
the economic drivers of the credit risk models using explainable ML methods and find large
variations in feature importance suggested by different ML methods. We propose to implement
a nonlinear forecast ensemble, which not only boosts predictive performance but also produces
more stable forecasts and economic sensitivities, thereby mitigating model uncertainty. Our
results provide guidance for financial institutions, regulatory authorities, and academics.
*
We are grateful for helpful comments from Ed Altman, Heiner Beckmeyer, Tim Eisert, Johannes Kriebel, Marek
Micuch, Oleg Reichmann, and Piet Usselmann. We would also like to thank seminar participants at the European
Central Bank, FMA European Conference 2023, Economics of Financial Technology Conference 2022, 83rd
Annual Business Research Conference, 15th RGS Doctoral Conference in Economics, 61st Annual Southwestern
Finance Association Conference, 15th International Conference on Computational and Financial Econometrics,
Banking Research Workshop Münster 2021, 6th Vietnam Symposium in Banking and Finance, 1st International
Conference “Frontiers in International Finance and Banking”, 14th International Risk Management Conference,
and University of Duisburg-Essen. We also thank Global Credit Data for granting access to their database. The
views expressed herein are those of the authors and should not be associated with the EBA.
§
E-mail: martin.hibbeln@uni-due.de, phone: +49 203 37-92830 (corresponding author)
‡
Mercator School of Management, University of Duisburg-Essen, Lotharstr. 65, 47057 Duisburg, Germany
¶
European Banking Authority, 20 avenue André Prothin, 92927 Paris, France
Our setting provides novel insights into credit risk modeling and leads to three key
conclusions. (i) ML methods consistently outperform not only simple benchmarks (e.g., linear
models) but also sophisticated benchmarks (e.g., mixture models) that take into account the
specific distributions of credit risk parameters (e.g., bimodality) on a global scale. We provide
new benchmarks for the predictive accuracy of various ML methods in quantifying credit risk
parameters as well as economically meaningful benefits of ML methods in dissections across
different asset classes, industries, and regions that manifest in substantially higher out-of-time
2
𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ), lower mean absolute error (𝑀𝑀𝑀𝑀𝑀𝑀), and lower root mean square error (𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅).
2
Numerically, state-of-the-art ML methods yield overall 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values that are higher by a factor
ranging from 2.1 to 3.5 compared to benchmark models from the credit risk literature, which
include highly specialized models for both credit risk parameters. Tree-based ML methods and,
to an even greater extent, the nonlinear ensemble consisting of multiple tree-based ML
methods, are especially well-suited across different credit risk parameters even when relying
on a uniform modeling framework. In contrast, the state-of-the-art literature on credit risk
modeling generally implements model structures that are well-suited for modeling a single
credit risk parameter. Rather than relying on inherently different models that fit the specific
(ii) We document large benefits of using model averaging (i.e., forecast ensembles) and,
hence, a boost in predictive performance. For the nonlinear ensemble, we quantify the impact
relative to the individual methods (i.e., state-of-the-art ML methods, including, e.g., random
2
forest) that form the ensemble, in the range of a 1.5% to 19.1% increase in 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ; in comparison
to highly specialized methods from the literature, the positive impact is even several times
2
higher (with an average increase in 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 of 172.4% to 210.3%). None of the methods
considered—ML methods and highly specialized models—outperform the nonlinear ensemble,
highlighting the benefits of model averaging for predictive accuracy. We also observe that the
sizeable superiority of the nonlinear ensemble is very persistent over time, consistently yielding
2
highly competitive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values in all periods. The consistent outperformance of the nonlinear
ensemble supports the idea that model averaging not only provides better predictive accuracy,
but also produces more stable forecasts, which is important from a practical perspective since
we are operating in a highly regulated domain.
(iii) As a critical enabler for the acceptance and adoption of ML applications in financial
services, we unpack the sources of superior predictability in notoriously ‘black box’ models
1
Since we rely on a loss database, it is infeasible to include the probability of default (PD) in our modeling
exercise, as, by definition, all facilities are already defaulted. Regarding the PD, we refer the interested reader to
Barboza et al. (2017), Berg et al. (2020), and Fuster et al. (2022).
Our study builds on several strands in the literature: (I) credit risk modeling, (II) ML
applications in finance and economics, and (III) explainable ML methods. (I) In the context of
credit risk modeling, recent studies have focused on developing sophisticated statistical
methods to model EaD, particularly attempting to account for its highly skewed distribution.
For this purpose, various forms of mixture models have been proposed (Hon and Bellotti 2016;
Leow and Crook 2016; Thackham and Ma 2019; Betz et al. 2022), different distributions have
been assumed (e.g., a zero-adjusted gamma distribution (Tong et al. 2016)), and panel data
methods have been used (Hon and Bellotti 2016; Leow and Crook 2016). However, to the best
of our knowledge, no formal treatment of various ML methods exists. Thus, we contribute to
this literature by providing a comparative study with a focus on ML methods including model
averaging. Similarly to EaD, many statistical methods exist that specifically address LGD and
its bimodal distribution (e.g., quantile regressions, fractional response regressions, beta
regressions, or different mixture models) (Krüger and Rösch 2017; Min et al. 2020). In contrast
to EaD, the use of ML methods for predicting LGD is more widespread (Bastos 2010; Qi and
Zhao 2011; Altman and Kalotay 2014; Kalotay and Altman 2017; Nazemi and Fabozzi 2018;
Kaposty et al. 2020; Olson et al. 2021; Kellner et al. 2022; Nazemi et al. 2022). In this regard,
Loterman et al. (2012) and Bellotti et al. (2021) are closest to our study, as they benchmarked
(II) Our study adds to the burgeoning literature on ML applications in finance and
economics. Currently, there is an ever-increasing number of studies on benchmarking ML
methods, for example, in the context of asset pricing (Chinco et al. 2019; Feng et al. 2020; Gu
et al. 2020; Bianchi et al. 2021; Leippold et al. 2022), credit scoring (Fuster et al. 2022), and
human decision making (Kleinberg et al. 2018; Erel et al. 2021), or corporate governance
(Bandiera et al. 2020; Li et al. 2021). Providing a benchmarking study in risk management
appears highly beneficial, since, from a practical perspective, this represents one of the most
common applications of ML methods in financial services.
(III) We build upon the emerging stream of literature that addresses the explainability of
artificial intelligence and ML (Bracke et al. 2019; Bellotti et al. 2021; Bussman et al. 2021;
Bastos and Matos 2022). ML models show superior predictive abilities in many domains, but
their practical implementation is often hindered by stakeholders’ desire for model
interpretability (Horel and Giesecke 2022). This is particularly concerning in a highly regulated
environment such as credit risk modeling. To this end, we use explainable ML methods as a
critical component for the acceptance and adoption of ML applications in financial services
The remainder of this paper is organized as follows. Section 2 outlines the institutional
setting and modeling framework. Section 3 describes our data and methodology. Section 4
reports our results on the benefits of ML in credit risk modeling across the globe and unpacks
the sources of predictability. Section 5 provides robustness checks and extensions. Section 6
concludes.
In this section, we briefly review the institutional setting to outline the regulatory requirements.
The Basel Accords provide a framework for a stable financial system, which the regulator
continuously strengthened in the aftermath of the global financial crisis. Within this
framework, the Basel Committee on Banking Supervision (BCBS) permits the use of internal
ratings, given that banks can ensure “the integrity, reliability, consistency, and accuracy of both
internal rating systems and estimates of risk components” (BCBS 2001, p. 41). Ultimately, in
the so-called advanced internal ratings-based (A-IRB) approach, banks use their own estimates
of different credit risk parameters for each credit facility in the portfolio: PD, EaD, and LGD.
The estimates are required to calculate EL as well as risk-weighted assets and, hence,
regulatory capital requirements.
EaDt ,τ − et
CLt − et ,
if CLt − et > 0,
CCFt ,τ = (1)
0 , else.
The exposure at time t is defined as 𝑒𝑒𝑡𝑡 ≔ max {−𝐵𝐵𝑡𝑡 , 0}, where 𝐵𝐵𝑡𝑡 denotes the balance. The
corresponding credit limit is 𝐶𝐶𝐿𝐿𝑡𝑡 . CCF can be transformed into EaD estimates as follows:
et + CCFt ,τ ⋅ (CLt − et )
EaDt ,τ = (2)
In general, banks must consider “all relevant, material and available data, information and
methods” when estimating risk parameters (BCBS 2019, p. 20) and the BCBS particularly
permits the use of pooled data (BCBS 2019, p. 20): “A bank may utilise internal data and data
from external sources (including pooled data).” In our implementation, we adhere closely to
the regulatory requirements and estimate two key risk parameters. Regarding EaD, we follow
the BCBS recommendation and use CCF. The important difference between the indirect (CCF)
and direct approach (i.e., EaD as the response variable) is that under CCF, all observations are
weighted equally, which is not the case at the EaD level because of the absolute default
volumes. We represent the second key parameter in our modeling exercise as 𝑅𝑅𝑅𝑅 = 1 − 𝐿𝐿𝐿𝐿𝐿𝐿
for ease of interpretation (i.e., which proportion of EaD could be recovered), which is common
To fix ideas, credit lines are agreements made at time t between a lender and a borrower that
provide a maximum associated Euro amount (i.e., credit limit 𝐶𝐶𝐿𝐿𝑡𝑡 ) that can be drawn down by
the borrower at its own discretion at any time up to the expiration date T, which makes
modeling them particularly challenging in practice (Jiménez et al. 2009; Hibbeln et al. 2020).
We consider a setup with CCLs (i.e., 𝐶𝐶𝐶𝐶𝐿𝐿𝑖𝑖 = (𝐶𝐶𝐶𝐶𝐿𝐿1 , … , 𝐶𝐶𝐶𝐶𝐿𝐿𝑛𝑛 )), where n is the total number
of observed credit lines in 𝐴𝐴𝑖𝑖 = (𝐴𝐴1 , … , 𝐴𝐴𝑎𝑎 ) different asset classes over the observation period
[𝑡𝑡0 , 𝑇𝑇]. Let us further define an indicator function 𝐷𝐷𝑡𝑡 ∈ {0,1}, which describes a default of a
credit line at time t if and only if 𝐷𝐷𝑡𝑡 = 1. 3 The default time is defined as 𝜏𝜏 ≔ min {𝑡𝑡|𝐷𝐷𝑡𝑡 = 1}.
Our objective is to predict the CCF and RR for each defaulted 𝐶𝐶𝐶𝐶𝐿𝐿𝑖𝑖 in our sample (i.e.,
𝐷𝐷𝑡𝑡 = 1) based on various observable 𝑥𝑥𝑡𝑡𝐶𝐶 credit line-specific and 𝑥𝑥𝑡𝑡𝛭𝛭 macroeconomic features.
Hence, the objective is to identify a functional form 𝑦𝑦� = 𝑓𝑓̂(𝑥𝑥) ∈ 𝛭𝛭 that maps the observable
features 𝑥𝑥 into a prediction 𝑦𝑦�. For concreteness, let 𝛭𝛭 ∈ (1, … , 𝑚𝑚) index the set of different
ML methods. Let us assume that the default time of a credit line is 𝑡𝑡 = 𝜏𝜏, which provides us
information about the exposure at the time of default (EaD). From a modeling perspective, we
2
The RRs, for example, for loans or credit lines, are determined by the actual recovery cash flows and direct or
indirect costs of the specific debt position and are, therefore, referred to as ‘workout RRs’. In contrast, the RRs,
for example, for corporate bonds are typically determined based on market values and are, therefore, referred to
as ‘market RRs’. For a detailed discussion on the differences, we refer the interested reader to Calabrese and
Zenga (2010) and Gürtler and Hibbeln (2013).
3
According to the Basel II default definition, a default is triggered by two events: the bank considers it unlikely
that the borrower will meet its credit obligations, or the borrower is more than 90 days past due.
We obtain data on CCLs for the time period 2000–2020 from the world’s largest loss database
of corporate defaults provided by Global Credit Data, an international not-for-profit
association. This database currently pools historical credit data from 58 member banks across
the globe, including many systemically important banks. The database is highly representative
of the North American and Western/Northern European regions, as member banks comprise
more than one-third and nearly one-half, respectively, of the total loan assets of the 500 largest
banks worldwide in their respective regions. The data enables member banks to model credit
risk, for example, to calculate regulatory capital requirements or calibrate and benchmark
internal EaD (CCF) and LGD (RR) models.
In our main analysis, we restrict the sample to credit lines in the asset classes of small and
medium-sized enterprises (SMEs) and large corporates (LCs), as these two segments are
categorized as general corporate exposures under the regulatory guidelines. We restrict the
sample to defaults after 2000 to ensure a consistent default definition under Basel II. Our last
observation period, in terms of defaults in the CCF sample, is the end of 2019 to meet a
materiality threshold regarding the number of observations in a specific year. Regarding the
RR sample, we consider defaults until the end of 2017 because we use realized RRs up to three
4
In robustness checks, we use different recovery horizons to rule out that the results are driven by this choice.
10
We apply the following filters to clean our data. We remove observations with a limit less
than or equal to zero, as these facilities cannot be considered real credit lines. For the CCF
sample, we remove observations for which no information is available for around one year (11
to 13 months) prior to default to ensure a consistent horizon in the estimation. For the RR
sample, we remove observations if banks no longer update information on these defaults
(incomplete portfolios; i.e., if we do not observe transaction data for at least three years and
the case is not resolved). Finally, we remove observations with missing values for both credit
line-specific and macroeconomic features. We implement a floor and cap of CCF at [-5.0, 5.0]
and of RR at [-0.1, 1.1], and winsorize the other credit line-specific features at 0.5% and 99.5%
levels to account for outliers and avoid instability in the parameters (Leow and Crook 2016;
Tong et al. 2016; Gürtler et al. 2018); this is motivated by the fact that some methods (e.g., the
linear regression (LR)), are highly sensitive to outliers (Krüger and Rösch 2017). In summary,
we obtain 12,895 observations in the CCF sample and 14,046 observations in the RR sample.
Table 1 shows the descriptive statistics of the credit line-specific features. Panels A and B
present the features used in modeling CCF and RR, respectively. Across most features, we
observe comparable values, but differences can occur for various reasons. First, the timing of
the measurement is different, as Panel A refers to features one year prior to default, while Panel
B refers to features at the time of default. Second, due to various sample restrictions, the credit
lines considered are not completely identical.
To gain better insights into the data composition, we present the share of observations
across asset classes, industries, regions, rating categories, and currencies for both samples in
Table 2. We also present the distributions of all indicator variables in our dataset. More than
two-thirds of the observations are from SMEs. Only a small proportion of observations are
from the financial, real estate, and insurance (FIRE) industry, and most are non-investment-
11
We collect a broad set of 161 global, local (country-specific), and newspaper-based (i.e., based
on textual analysis) macroeconomic features covering various categories, such as indicators of
economic conditions, stock market conditions, credit market conditions, general corporate
measures, and policy or world uncertainty indices. The features range from daily to annual
frequency and originate from various sources (e.g., Worldbank; FRED (St. Louis Fed);
Refinitiv Eikon; Bahir et al. 2016, 2021; Caldara 2020; Ahir et al. 2022). Local features are
only collected above a threshold of 10 credit lines from a country, as a trade-off between the
number of observations and the number of macroeconomic features with a complete time
series. We collect not only the level but also the change in macroeconomic features. 5 From this
set, we select the macroeconomic features using the randomized least absolute shrinkage and
selection operator (lasso), 6 a state-of-the-art feature selection procedure (Meinshausen and
Bühlmann 2010).
We choose a model validation approach that preserves the time dependency of defaults to
reflect the most realistic scenario from a practical and regulatory perspective. 7 In our
5
To ensure stationarity of the macroeconomic features, we apply various transformations where necessary.
6
For brevity, we outline the details regarding this procedure in the Internet Appendix Part A.1.
7
We refer the interested reader to Hibbeln et al. (2023), who demonstrate the importance of model validation as
a central element of any ML workflow regarding error estimates and model selection.
12
8
For brevity, we outline the details regarding the methods that form the benchmark ensemble (i.e., LR, FRR, and
mixture models), the ML methods that form the nonlinear ensemble (i.e., Cubist, RF, and S-GBM), and the
13
1
yˆiEN
,t = ∑
N M ∈K
yˆiM,t , (3)
with K representing the set of composite models and N representing the number of models in
the given ensemble (BM-EN and NL-EN). Forming ensembles appears appealing for several
reasons (Steel 2020): (I) a broad strand of literature documents large benefits of ensembles in
various domains, for example, economics or weather forecasts, (II) ensembles inherently allow
to combine the predictive performance of multiple models, thus addressing model uncertainty,
(III) ensembles allow simple comparisons between benchmark models and more sophisticated
ML models, and (IV) ensembles are also appealing to tackle the issue of model multiplicity,
which refers to models with similar predictive accuracy but with a different decision surface
embedded in the model, for example, due to differences in the importance of certain features.
We use several competing models, with 𝛭𝛭0 being the set of competing objects, to predict the
credit risk parameters CCF and RR. To assess the predictive ability of the various methods, we
2 ). 9
use the out-of-time 𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 Since our implementation uses a validation approach that
preserves the time dependency of defaults, we re-estimate a given model at each time t (with
𝑡𝑡 = 0, 1, 2, … , 𝑃𝑃) and, hence, compute the post-estimation evaluation metrics for each period p
in the interval [𝑡𝑡, 𝑃𝑃]. Thus, the out-of-time performance metrics reflect weighted averages with
hyperparameters in the Internet Appendix Part A.2. For more details on all methods, we refer the interested reader
to Hastie et al. (2009) and Kuhn (2013). For the implementation of the ML methods, we use the package caret in
R (Kuhn 2008); for the BEINF model, we use the package gamlss in R (Stasinopoulos et al. 2007); and for FMM,
we use the package flexmix in R (Gruen et al. 2020).
9
In the Internet Appendix Part A.3. and Part C (Tables C.1-C.3), we discuss and additionally report the root mean
square error (𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅), mean absolute error (𝑀𝑀𝑀𝑀𝑀𝑀), and Hansen et al.’s (2011) model confidence set procedure.
14
M 2
2
ROOT , p = 1 −
1
n ∑ y − yˆi i i
, (4)
1
∑ y −y
2
n i i train
where 𝑦𝑦𝑖𝑖 is the actual realization of the 𝑖𝑖-th credit line 𝑖𝑖 ∈ (1, … , 𝑛𝑛) in 𝐶𝐶𝐶𝐶𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 , 𝑦𝑦�𝑖𝑖𝛭𝛭 is the
prediction from the respective model 𝛭𝛭 ∈ (1, … , 𝑚𝑚), and 𝑦𝑦�𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 is the historical average in
2
𝐶𝐶𝐶𝐶𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . If 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 is greater than zero, the considered model 𝛭𝛭 exhibits better predictive
2
accuracy than the historical average. The computation of 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 for a given period p follows
Campbell and Thompson (2008) and is commonly referred to in the literature as out-of-sample
2 2
𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ). The only difference is that, in the computation of 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 , the time dependency of
2
defaults is preserved, while this is not necessarily the case for 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 .
Cov(eˆ M 1 , eˆ M 2 )
ρ ( M 1, M 2) = (5)
σ (eˆ M 1 )σ (eˆ M 2 )
with 𝜎𝜎(∙) denoting the standard deviations of the forecast errors of 𝛭𝛭1 and 𝛭𝛭2, and 𝐶𝐶𝐶𝐶𝐶𝐶(∙)
denoting the covariance between 𝛭𝛭1 and 𝛭𝛭2, respectively.
4. Empirical Results
In this section, we begin by taking a statistical perspective to provide insights into the
distribution and dispersion of performance metrics, the correlations of forecast errors, and the
distribution of actual vs. predicted values for CCLs across the globe. Next, we dissect out-of-
15
Table 3 shows a comparative analysis across different model types for modeling both credit
risk parameters, CCF (Panel A) and RR (Panel B). In our results, we distinguish between the
benchmark and nonlinear ensembles (BM-EN and NL-EN) as well as the averages of the
included composites, tree-based methods, non-linear methods, and linear methods. 10
Overall, we find large benefits in using sophisticated ML methods. The NL-EN produces
2
an 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 of 43.5% and 19.7% for CCF and RR, respectively, which is higher than the BM-EN
by a factor of 2.7 and 3.1, while being much more flexible and universally applicable in
common banking practice. The downward shift in performance for RR indicates that RR
estimation is relatively more difficult; this is intuitively plausible since CCF is estimated for a
one-year horizon, whereas RR is estimated for a three-year recovery horizon. Turning to the
NL-EN Composite (i.e., Cubist, RF, and S-GBM), we observe an average (minimum;
maximum) performance increase relative to the BM-EN in the order of 160.8% (156.9%;
168.4%) and 169.9% (160.6%; 188.0%) for CCF and RR, respectively. This in turn implies
that, on average (minimum; maximum), the NL-EN Composite falls short in performance
relative to the NL-EN in the order of 4.3% (1.5%; 5.7%) and 13.0% (7.2%; 16.0%) for CCF
and RR, respectively, resulting in a positive impact of model averaging in the range of a 1.5%
10
For completeness, we report the results for the individual methods in the Internet Appendix Part C (Table C.3).
16
Next, we consider in Figure 1 the dispersion of predictive performance over time, using
2
the annual unweighted 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 values. Given that we operate in a highly regulated domain, this
markedly outperform the benchmark models in all out-of-time years. It follows that in certain
years a very naïve predictor, i.e., the historical average, would have been the preferred choice
11
This also provides two avenues for future research—skipping the pre-selection of macroeconomic features and
specific tailoring of the ML methods to the different credit risk parameters—to further boost the performance of
ML methods compared to our benchmarks.
17
methods forming the NL-EN, both on average (circle) and median (bar), and even by a much
larger margin compared to the benchmark models. As an alternative way to explore the
differences between predictive techniques, we also examine the overall distribution of actual
vs. predicted values for both credit risk parameters. We find that the predicted values of the
nonlinear ensemble better mirror the actual distributions, which explains the better predictive
performance of the sophisticated ML methods. 12
Taken together, these results provide strong evidence for the benefits of a more holistic
approach to credit risk modeling that reinforces the use of identical model implementations for
different credit risk parameters. It is important to recognize, however, that the improvements
achieved through using sophisticated ML methods are substantial not only from a statistical
perspective, but also from an economic perspective for several reasons. One consideration is
that a holistic approach to credit risk modeling has a potentially tremendous impact from an
operational perspective. The ability to rely on uniform modeling approaches for different credit
risk parameters translates into a potentially massive reduction in the operational resources
required for model development and validation for financial institutions, regulatory authorities,
and academics. Another consideration is the monetary perspective, for which we estimate the
lower bound of the impact of applying ML methods to our dataset to be in the order of EUR 5
bn in terms of a more accurate prediction of EaD. 13 As banks apply the calibrated models to
their non-defaulted portfolios, these numbers can translate into values that are many times
12
We report the corresponding figure in the Internet Appendix Part C (Figure C.1).
13
The NL-EN outperforms the BM-EN by more than 30 percentage points, with values of 0.73 and 1.03,
respectively, based on the MAE as performance metric. Considering a cumulative limit and an outstanding amount
one year prior to default of EUR 50 bn and EUR 35 bn, respectively, and recalling the EaD formula in eq. (2), the
application of ML methods leads to a more accurate prediction of the EaD by about EUR 5 bn.
18
Next, we turn to forecast error correlations according to eq. (5) in an effort to better understand
the differences in predictive abilities across predictive techniques. 14 The main insight is that
two groups clearly show high within-group correlations: one group consists of the benchmark
models, and the other group consists of the sophisticated ML models. Numerically, we find
high correlation values of 𝜌𝜌 > 0.95 for the group of benchmark models for both credit risk
parameters. For the ML models, this insight holds for CCF, but only to a lesser extent for RR
with 𝜌𝜌 ≈ 0.90 for the individual models. Recalling the considerable larger benefits of model
averaging for RR, this observation reinforces the idea that ensembles are able to combine the
nuances of the different models to create an even more powerful predictive technique. By
construction, the correlation levels of the ensembles and the contained models (i.e., composite)
are relatively high, with 𝜌𝜌 > 0.96 and in many cases even 𝜌𝜌 > 0.99. At the same time, across-
group correlations between NL-EN and BM-EN are significantly lower, with 𝜌𝜌 ≈ 0.88 and
𝜌𝜌 ≈ 0.74 for CCF and RR, respectively. Viewing predictive abilities from this perspective
suggests that the decision surface embedded in the different model types (i.e., benchmark
models and ML models) is inherently different, which is intuitively plausible given that some
benchmark models are highly specialized in predicting specific credit risk parameters, while
ML models are much more flexible in their functional forms. Recognizing that the correlation
levels of the ensembles and the contained models are relatively high, while at the same time
the ensembles produce better predictive performance, in what follows we mainly compare the
nonlinear to the benchmark ensemble.
14
For brevity, we report the forecast error correlations in the Internet Appendix Part C (Table C.4).
19
To complete the picture, we now investigate whether the superior predictive abilities also hold
across different asset classes, industries, and regions. The analyses build on the models trained
on the respective sample available (expanding window) and are aggregated into a weighted
average, in line with the baseline analysis from Section 4.1.1. To produce the asset class-
specific metrics, we use the average from the training set for a particular asset class in the
2
computation of 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 . The same procedure is used for industry- and region-specific
dissections. Table 4 shows the dissections for both parameters (CCF and RR). Panel A shows
the predictability across asset classes, Panel B across industries, and Panel C across regions.
Across the different dissections, we reach the same conclusion as in our main analysis.
The NL-EN provides superior predictive abilities across all subsamples, regardless of the risk
parameter considered, with observed performance increases for the NL-EN of at least 83% for
each dissection and even several times higher for both credit risk parameters in multiple
dissections. On average, we find a performance increase from the BM-EN to the NL-EN of
194.9% and 266.6% for CCF and RR, respectively, which is even more pronounced compared
to our main analysis, in which we observed a performance increase of 172.4% and 210.3%,
respectively. This advocates for the flexibility of sophisticated ML methods, which seem to be
able to better capture the nuances of the modeling problem with multiple industries, different
asset classes, and regions.
20
Superior predictive abilities alone are not sufficient in a highly regulated environment. To gain
a better understanding of the superior predictive abilities in notoriously ‘black box’ models, we
explore techniques from the explainable ML toolbox. To this end, we unpack the main drivers
of predictability for individual features (Figure 2), outline the sensitivity of the impact of the
most important features (Figure 3), and provide insights into the dynamics of feature
importance over time. To measure the importance of a given feature, we rely on Shapley values,
which are based on a concept borrowed from cooperative game theory and approximate the
average marginal contribution of each feature. For implementing Shapley values, we use the
package fastshap in R (Greenwell, 2020).
A natural starting point for understanding the superior predictive abilities of an ML model is
the relative importance of the most important features to the model’s performance. The features
in Figure 2 are normalized such that they sum to one across all considered features, allowing
for relative interpretation, and are sorted by their overall importance for the NL-EN. For
brevity, we report only the top-15 features, which cover around 90% (CCF) and 78% (RR) of
the total importance for the NL-EN. We also show the min-max range and median feature
importance of the composite models (i.e., Cubist, RF, and S-GBM). Importantly, by virtue of
the design of the NL-EN, the feature importance for the NL-EN reflects on the credit line-level
the equally weighted average across the composite models; this is ensured by the linearity
axiom for Shapley values. Formally, this can be expressed as follows: the effect of a given
feature on the sum of two functions is simply the effect it has on one function plus the other;
i.e., 𝜙𝜙𝑖𝑖 (𝛼𝛼𝛼𝛼 + 𝛽𝛽𝛽𝛽) = 𝛼𝛼𝛼𝛼𝑖𝑖 (𝑓𝑓) + 𝛽𝛽𝛽𝛽𝑖𝑖 (𝑔𝑔), for any two models f and g, and scalars 𝛼𝛼 and 𝛽𝛽,
respectively.
To better understand the feature importance, we start by discussing the overall results
based on the NL-EN. For the CCF, we find that the undrawn percentage is by far the most
important feature, accounting for more than 20% of total importance. The picture is less
21
Most striking in this figure, however, is the large variation of feature importance for
specific features, indicated by the wide range of min-max values and min-max ranks suggested
by the different models. This clearly emphasizes the suitability of ensembles to tackle the issue
of model multiplicity, which refers to models with similar predictive accuracy but with a
different decision surface embedded in the models, for example, due to differences in the
importance of certain features. We mostly observe a feature importance for the NL-EN within
the range of the composite models, which is reasonable as the NL-EN is an equally weighted
average across the composite models. However, we not only observe cases where the
composite models assign a different importance to certain features (e.g., with rank differences
of up to 27), but also that the feature importance suggested by the NL-EN, with a value of less
than 6% of the total importance for the outstanding amount (Panel A – CCF), is substantially
15
We report the results of the group feature importance plots in the Internet Appendix Part C (Figure C.2); for
this purpose, we allocated each feature to one of four groups: borrower-specific features, credit line-specific
features, security-related features, and macroeconomic features.
22
To provide additional perspective, we analyze the sensitivity of the impact of the most
important features. To this end, since we are interested in the impact across feature values (i.e.,
the distribution), we first average the Shapley values for a given feature level before dividing
the CCLs into 20 buckets according to the feature values. 16 Figure 3 shows the mean and SD
of Shapley values across buckets, with feature values normalized between 0 and 1 for
visualization.
For the credit limit, which is one of the most important features for both credit risk
parameters, we find that relatively smaller credit limits affect the CCF prediction positively,
while relatively larger credit limits affect the CCF prediction negatively. This result nicely
brings together two well-documented facts from the literature: first, smaller firms (with on
average much lower credit limits) rely more heavily on CCLs as a source of financing
16
To complement this, in the Internet Appendix Part C (Figure C.3) we report bee-swarm plots for the top-15
features that provide further understanding of the directional impact on predictions.
23
The large overall impact of credit line-specific features, the large variations in the importance
for some features, and the fact that model refitting is costly in practice raises the intriguing
question of the dynamics of feature importance over time. 18 We find overall a general trend
with some fluctuations. The undrawn percentage in the CCF panel consistently ranks first,
which could be expected given its significantly greater importance compared to all other
17
In the Internet Appendix Part C (Figure C.4), we present the sensitivities for the credit limit and outstanding
amount for the different models, i.e., NL-EN and composite models, to illustrate this behavior.
18
We show the feature importance dynamics over time in the Internet Appendix Part C (Figure C.5).
24
In summary, we have used various means to open the ‘black box’ of ML methods in credit
risk modelling, particularly based on nonlinear ensembles, to understand the drivers that lead
to its superior predictive abilities. We think this is a critical enabler for the acceptance and
adoption of ML applications in financial services, where the practical implementation of ML
models is often hindered by stakeholders’ desire for model interpretability.
19
In all robustness checks and extensions, we use the macroeconomic features selected in the main analysis unless
otherwise specified.
25
To provide further insight, we extend our main analyses of CCLs to non-core corporate
borrowing segments (i.e., specialized lending) and private banking (i.e., high net worth
individuals). To the best of our knowledge, this is the first treatment of these two asset classes
in the literature with respect to the credit risk parameters CCF and RR. Even though these asset
classes are more obscure, they comprise credit lines with a defaulted exposure of EUR 8 bn for
specialized lending and EUR 1 bn for private banking. In this exercise, we simply use the same
approach as for the corporate segment; that is, we use the same credit line-specific features,
selected macroeconomic features, and hyperparameter set and do not address the unique
technicalities of these specific asset classes. The results for both asset classes are qualitatively
2
consistent with our main conclusions; the nonlinear ensemble yields a positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 in all
specifications, while for the benchmark ensemble this is the case only for the RR Private
26
6. Conclusion
Based on the world’s largest loss database of corporate defaults, we perform a comparative
analysis of ML methods in credit risk modeling across the globe, covering many systemically
important banks. We find that ML methods—including a battery of individual methods and
forecast ensembles—provide superior predictive abilities along multiple dimensions (i.e., asset
classes, industries, and regions) and consistently outperform benchmarking methods (e.g.,
mixture models). Providing a consistent modeling exercise and benchmarks that work well for
different credit risk parameters around the globe reinforces the benefits of a more holistic
approach to credit risk modeling while being much more flexible and universally applicable in
common banking practice. From a practical perspective, the ability to rely on uniform modeling
approaches for different credit risk parameters translates into potentially massive reductions in
operational resources required for model development and validation.
Our results are robust to a battery of different specifications and important not only from
a statistical point of view but also an economic and regulatory perspective. Applied to our
global default database, we estimate the lower bound of the impact of ML methods leading to
a more accurate prediction of EaD by approximately EUR 5 bn; this figure multiplies when the
calibrated models are applied to the banks’ portfolios of non-defaulted facilities—potentially
by a factor of more than 50, considering that the average default rate of corporate borrowers is
typically less than 2%. Overall, we provide benchmarks that work well for various credit risk
parameters in a holistic credit risk modeling exercise across the globe and present guidelines
for selecting, implementing, and validating ML approaches in credit risk modeling that we
expect to be relevant to financial institutions, regulatory authorities, and academics.
27
Acharya, V. V., & Steffen, S. (2020). The Risk of Being a Fallen Angel and the Corporate
Dash for Cash in the Midst of COVID. Review of Corporate Finance Studies, 9, 430–
471.
Ahir, H., Bloom, N., & Furceri, D. (2022). The World Uncertainty Index. Working Paper.
Altman, E. I., & Kalotay, E. A. (2014). Ultimate Recovery Mixtures. Journal of Banking &
Finance, 40, 116–129.
Athey, S., & Imbens, G. W. (2019). Machine Learning Methods that Economists Should Know
About. Annual Review of Economics, 11, 685–725.
Baker, S. R., Bloom, N., & Davis, S. J. (2016). Measuring Economic Policy Uncertainty.
Quarterly Journal of Economics, 131, 1593–1636.
Baker, S. R., Bloom, N., Davis, S. J., & Kost, K. (2021). Policy News and Stock Market
Volatility. Working Paper.
Bali, T. G., Beckmeyer, H., Moerke, M., & Weigert, F. (2023). Option Return Predictability
with Machine Learning and Big Data. Review of Financial Studies, forthcoming.
Bandiera, O., Prat, A., Hansen, S., & Sadun, R. (2020). CEO Behavior and Firm Performance.
Journal of Political Economy, 128, 1325–1369.
Bank of Canada. (2018). The Bank of Canada’s Financial System Survey.
Bank of England. (2019). Machine Learning in UK Financial Services.
Barboza, F., Kimura, H., & Altman, E. (2017). Machine Learning Models and Bankruptcy
Prediction. Expert Systems with Applications, 83, 405–417.
Basel Committee on Banking Supervision. (2001). The Internal Ratings-Based Approach.
Basel Committee on Banking Supervision. (2019). CRE 36.
Bastos, J. A. (2010). Forecasting Bank Loans Loss-Given-Default. Journal of Banking &
Finance, 34, 2510–2517.
Bastos, J. A., & Matos, S. M. (2022). Explainable Models of Credit Losses. European Journal
of Operational Research, 301, 386–394.
Bellotti, A., Brigo, D., Gambetti, P., & Vrins, F. (2021). Forecasting Recovery Rates on Non-
performing Loans with Machine Learning. International Journal of Forecasting, 37, 428–
444.
Berg, T., Burg, V., Gombović, A., & Puri, M. (2020). On the Rise of Fintechs: Credit Scoring
Using Digital Footprints. Review of Financial Studies, 33, 2845–2897.
Berg, T., Saunders, A., & Steffen, S. (2021). Trends in Corporate Borrowing. Annual Review
of Financial Economics, 13, 321–340.
Betz, J., Kellner, R., & Rösch, D. (2021). Time Matters: How Default Resolution Times Impact
Final Loss Rates. Journal of the Royal Statistical Society: Series C (Applied Statistics),
70, 619–644.
Betz, J., Nagl, M., & Rösch, D. (2022). Credit Line Exposure at Default Modelling Using
Bayesian Mixed Effect Quantile Regression. Journal of the Royal Statistical Society:
Series A (Statistics in Society), 185, 2035–2072.
Bianchi, D., Büchner, M., & Tamoni, A. (2021). Bond Risk Premiums with Machine Learning.
Review of Financial Studies, 34, 1046–1089.
Bracke, P., Datta, A., Jung, C., & Sen, S. (2019). Machine Learning Explainability in Finance:
An Application to Default Risk Analysis. Bank of England Staff Working Paper No. 816.
28
29
30
31
32
33
34
35
70.0
Benchmark Models ML Models
60.0
50.0
40.0
R2OOT
30.0
20.0
10.0
0.0
-10.0
LR FRR FMM BEINF BM-EN Cubist RF S-GBM NL-EN
Panel B: RR
30.0
Benchmark Models ML Models
25.0
20.0
15.0
R2OOT
10.0
5.0
0.0
-5.0
-10.0
LR FRR FMM BEINF BM-EN Cubist RF S-GBM NL-EN
36
0.20
0.15
0.10
0.05
0.00
Composite Models
rank_min 1 2 3 2 4 3 6 5 7 7 8 12 11 12 14
rank_max 1 18 8 5 6 4 10 9 11 11 13 16 14 39 21
Panel B: RR
0.15
NL-EN Composite Min-Max Composite Median
0.10
0.05
0.00
Composite Models
rank_min 1 1 4 2 2 2 5 8 7 10 12 9 11 13 15
rank_max 3 3 6 5 8 6 8 10 28 13 13 21 22 19 17
37
0.00 0.12
-0.38 -0.06
-0.75 -0.25
Shapley Value
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
undrawn_pct utilization_rate
1.00 0.75
0.38 0.50
-0.25 0.25
-0.88 0.00
-1.50 -0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
Panel B: RR
limit outstanding_amount
0.08 0.08
Mean SD
0.03 0.04
-0.01 0.00
-0.06 -0.04
-0.10 -0.08
Shapley Value
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
undrawn_pct utilization_rate
0.10 0.08
0.06 0.01
0.02 -0.05
-0.01 -0.11
-0.05 -0.18
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
38
In this section, we describe the randomized least absolute shrinkage and selection operator
(lasso) procedure used to select macroeconomic features. Technically, the randomized lasso is
comparable to the adaptive lasso introduced by Zou (2006). However, instead of choosing a
tuning parameter in the first stage as in the adaptive lasso (e.g., by using a ridge regression),
the randomized lasso changes the penalty to a randomly chosen value. Thereby, the empirical
implementation is straightforward by rescaling the relevant predictors. Meinshausen and
Bühlmann (2010) showed that this generalization of the lasso selects variables consistently
even if the necessary conditions for the consistency of the original lasso are violated.
The randomized lasso, when applied with only one random perturbation, is not a useful
selection algorithm since the selection is strongly influenced by the single randomization of
the input features. However, applying this randomization many times and identifying features
that are regularly selected provides a powerful selection algorithm with desirable properties. In
our implementation, we use the stability selection method (Meinshausen and Bühlmann 2010);
stability selection is a general technique for improving the performance of a feature selection
algorithm (e.g., randomized lasso) based on the aggregation of the results obtained by applying
a selection procedure to 𝑁𝑁 data subsamples of size 𝑛𝑛/2. A stable set of features is determined
by running any preferred selection algorithm for each subsample. A feature is part of the stable
feature set if the proportion of times for which the feature is selected across all 𝑁𝑁 subsamples
is greater than a pre-defined threshold; we use a threshold of 0.60 in this paper. For
implementing the stability selection method, we use the package stabs in R (Hofner and
Hothorn 2021).
In this section, we describe the predictive techniques used in our comparative analysis. For
brevity, we only outline the methods that form the benchmark ensemble (i.e., linear regression,
fractional response regression, and mixture models) and the machine learning (ML) methods
that form the nonlinear ensemble (i.e., random forest, stochastic gradient boosting machine,
and cubist), which we also discussed in more detail in our main analyses.
Linear Regression: We first consider the linear regression (LR) estimated via ordinary
least squares (OLS). The conditional expectations can be approximated by a linear function
𝑔𝑔(∙) with 𝛽𝛽 a vector of parameters to be estimated. The baseline specification uses a standard
2
least squares objective function where the optimal prediction is: ℒ(𝛽𝛽) = ∑𝑖𝑖 �𝑦𝑦𝑖𝑖 − 𝑔𝑔(𝑥𝑥𝑖𝑖 , 𝛽𝛽)� .
1
∑𝐾𝐾
𝑘𝑘=1 𝜋𝜋𝑘𝑘 𝑓𝑓(𝑦𝑦|𝑥𝑥, θk ) where 𝜋𝜋𝑘𝑘 , 𝑘𝑘 = 1, … , 𝐾𝐾 are the weights. We estimate an FMM for each of
the different numbers of components and keep only the one with the highest performance in
our final analysis. For the implementation of the FMMs, we use the package flexmix in R
(Gruen et al. 2020). However, there is no ad hoc implementation of the group lasso procedure
in flexmix, thus, we have adapted the corresponding functions where necessary.
let 𝑔𝑔𝑗𝑗 (∙) be a link function relating 𝜃𝜃𝑗𝑗 to the explanatory features 𝑋𝑋𝑘𝑘 . Following Rigby and
Stasinopoulos (2005), we model the parameters of a given distribution as follows: 𝑔𝑔𝑘𝑘 (𝜃𝜃𝑘𝑘 ) =
𝐽𝐽
𝜂𝜂𝑘𝑘 = 𝑋𝑋𝑘𝑘 𝛽𝛽𝑘𝑘 + ∑𝑗𝑗=1
𝑘𝑘
𝑍𝑍𝑗𝑗𝑗𝑗 𝛾𝛾𝑗𝑗𝑗𝑗 where 𝑋𝑋𝑘𝑘 𝛽𝛽𝑘𝑘 represents the parametric and 𝑍𝑍𝑗𝑗𝑗𝑗 𝛾𝛾𝑗𝑗𝑗𝑗 the nonparametric
terms.
Within the GAMLSS framework, we use a beta-inflated (BEINF) distribution to model the
CCF and the RR. The BEINF distribution has four components: the mean 𝜇𝜇 and the dispersion
𝜎𝜎 of a response greater than zero and less than one, the probability of a response equal to zero
𝑝𝑝0 , and the probability of a response equal to one 𝑝𝑝1. The probability function of the BEINF
distribution is given by Rigby et al. (2019) as follows:
p0 , if y = 0,
1
f ( y | µ , σ ,ν ,τ ) = (1 − p0 − p1 ) ⋅ yα −1 (1 − y ) β −1 , if 0 < y < 1, (1)
B ( a , β )
p1 , if y = 1,
1
For a more detailed discussion on FMM and the group lasso procedure, we refer the interested reader to Min et
al. (2020) and Yuan and Lin (2006), respectively.
where B(a, β ) represents the beta function. Tong et al. (2016) recommend as a future avenue
for research using the BEINF distribution to account for the bimodal distribution of the CCF.
To the best of our knowledge, however, such a model has not been implemented in the context
of credit risk modeling. We do not consider the BEINF model as a typical ML approach but
rather consider it as a method that is particularly well suited for these distributions.
Consequently, it provides an additional sophisticated benchmark for the ML methods. In our
implementation, we use the same transformation as in the FRR, because the BEINF distribution
requires that the response is bounded within the unit interval. For the implementation, we use
the package gamlss in R (Stasinopoulos et al. 2007).
Random Forest (RF): The RF (Breiman 2001) is a tree-based method that improves the
variance reduction of bagging, whereby the essential idea of bagging is to average many noisy
but approximately unbiased models to reduce the overall variance, by reducing the correlation
between individual trees. The general idea is to not consider every possible feature for every
split during the growth process, but only a subset of the features 𝑚𝑚 < 𝑘𝑘. This procedure helps
to reduce the correlation between individual trees in the forest, which significantly reduces the
variance of the predictor. We tune the number of features considered at each split and the tree
depth using our validation approach.
sample without replacement should be used. This method is called a Stochastic Gradient
Boosting Machine (S-GBM). We implement an S-GBM as the most sophisticated boosting
method and specifically tune the number of boosting iterations, the maximum tree depth, a
shrinkage parameter, and the minimum terminal node size.
Cubist: Cubist is an extension of the M5 model tree approach. The M5 model tree differs
considerably from the usual decision trees in several dimensions: First, the criterion for the first
split is the reduction of the standard deviation by splitting the entire dataset. After the initial
partitioning, a linear model is trained at each node using all the features chosen as splitting
criterion in the previous steps. Subsequent splits are based on the (reduction of) error rate of
the linear models. Second, at each terminal node, the outcome is predicted using a linear model
(as opposed to a simple average). Third, when predicting a new sample, the observation moves
down along the path of the tree and all predictions of the linear models in that particular path
are smoothed in a bottom-up procedure. Analogous to the M5 model tree, Cubist uses
smoothing of predictions by combining multiple linear models in a tree path. The linear
combination of multiple predictions depends on the variance and covariance of the residuals of
the individual models. Cubist represents a rule-based model where the final model tree is used
to construct the initial rule set. Cubist uses the described smoothing technique based on the
variance and covariance of the residuals, so that each rule is associated with a smoothed
representation of multiple linear models. In a further step, it is possible to prune and/or combine
rules based on the adjusted error rate. Ultimately, a new sample is predicted by averaging the
predictions of all smoothed linear models from the appropriate final rules. Cubist also allows
a process called committees, which is very similar to boosting in that a sequence of rule-based
models is created, with each rule-based model influenced by the previous one. We tune the
number of committees and neighbors via our validation approach.
Bagging, Boosting, Conditional Inference Tree (CI-Tree)). For more details on these methods,
we refer the interested reader to Hastie et al. (2009) and Kuhn (2013). For the implementation
of the ML methods, we use the package caret in R (Kuhn 2008).
In this section, we describe further post-estimation evaluation metrics that we employed in our
2
study. Apart from the out-of-time 𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ), we also implement the root mean square error
(𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅) and the mean absolute error (𝑀𝑀𝑀𝑀𝑀𝑀), which are standard performance measures in the
literature. Generally, also for these metrics the same idea regarding the weighted averages
1 2
applies. The 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 of a given period p is calculated as 𝑅𝑅𝑅𝑅𝑅𝑅𝐸𝐸𝑝𝑝 = �𝑛𝑛 ∑𝑖𝑖�𝑦𝑦𝑖𝑖 − 𝑦𝑦� 𝑖𝑖ℳ � . The
1
𝑀𝑀𝑀𝑀𝑀𝑀 of a given period p is calculated as 𝑀𝑀𝑀𝑀𝐸𝐸𝑝𝑝 = 𝑛𝑛 ∑𝑖𝑖�𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖ℳ �. For both 𝑅𝑅𝑅𝑅𝑅𝑅𝐸𝐸𝑝𝑝 and 𝑀𝑀𝑀𝑀𝐸𝐸𝑝𝑝 ,
smaller values imply better predictive accuracy, with zero being the lower bound.
We complement the standard performance metrics with Hansen et al.’s (2011) model
�1−𝛼𝛼
confidence set (MCS) procedure, which aims to identify a set of ‘superior’ models, 𝛭𝛭 ∗
⊆
𝛭𝛭0 , with confidence level 1 − 𝛼𝛼. The procedure comprises statistical tests that enable the
econometrician to identify a ‘superior’ set of models with a certain probability (i.e., a given
confidence level). Thereby, sequential hypothesis testing on the null hypothesis of equal
predictive ability (EPA) between the competing objects 𝛭𝛭0 is utilized. The MCS procedure is
as follows: (I) Start with 𝛭𝛭0 of dimension m. (II) Test the EPA hypothesis and, if it is accepted,
�1−𝛼𝛼
terminate the algorithm and set 𝛭𝛭 ∗
= 𝑚𝑚; otherwise, the model with the worst performance
is identified. (III) This worst model is removed from the set of potential ‘superior’ models and
the algorithm returns to step (II). The MCS procedure is based on an arbitrary loss function
�1−𝛼𝛼
ℒ(𝑦𝑦, 𝑦𝑦�) and, thus, has various applications. The MCS 𝛭𝛭 ∗
can include a single ‘best’ model
𝑚𝑚∗ = 1, multiple ‘superior’ models 𝑚𝑚∗ < 𝑚𝑚, as well as possibly all models 𝑚𝑚∗ = 𝑚𝑚. We
implement the procedure in line with the performance measure in our main analysis with a
square error loss function. For the implementation of the MCS procedure, we use the package
MCS in R (Bernardi and Catania 2018).
In this section, we report a battery of robustness checks that we performed to verify that our
results are not affected by certain specifications in our main analysis, and we provide several
extensions. In the following analyses, we use the macroeconomic features selected in the main
analysis when not indicated differently. The results of the robustness checks are presented in
Table C.5 for both parameters (CCF and RR).
Our preferred estimation strategy is the expanding window approach, as the regulatory
guidelines require that banks base their internal estimates on all available data and, in
particular, prescribe “a minimum data observation period that should ideally cover at least one
complete economic cycle but must in any case be no shorter than a period of seven years”
(BCBS 2019, p. 32). However, there is a reasonable concern that observations that are too
distant from the current environment are no longer relevant and, therefore, distort predictive
performance rather than supporting prediction. To address these concerns, we initialize the first
window with all defaults from 2000 to 2009. Thus, the rolling window approach is also in line
with the regulatory guidelines and covers a period of more than seven years. From then on,
however, we gradually shift not only the end of the window but also its beginning on an annual
basis; that is, the training period covers the same number of periods in all re-estimations. The
results are qualitatively and quantitatively similar, so our main conclusions remain unchanged.
The expanding window approach exhibits slightly better accuracy, which supports the idea of
the regulatory guidelines to use all available data, as more distant observations also convey
valuable information.
In the out-of-time setting of our main analysis, it is implicitly assumed that the models are re-
calibrated annually. This assumption is in line with the regulatory considerations, from, for
example, the European Banking Authority (2016), BCBS (2019), or European Central Bank
(2019), which require the validation of rating systems or the review of internal estimates to be
carried out at least annually and all relevant and available information to be considered.
However, model development is both time-consuming and costly in practice and, therefore, it
can be reasonably argued that financial institutions do not undertake a full re-calibration every
year. For this reason, we use the models not only to predict the subsequent year but extend the
predictive horizon to three years; that is, if the model training is based on defaults up to period
p, the test set contains defaults from periods p+1, p+2, and p+3. This predictive horizon is
motivated by two observations: (i) competent authorities are required under Article 101(1) of
the Capital Requirements Directive to review compliance with the requirements for the use of
internal models at least every three years, and (ii) financial institutions are required to use rating
systems, for example, for internal risk management purposes, in line with the minimum
requirements for at least three years prior to applying for the use of the internal model
(European Banking Authority 2016; BCBS 2019). The results are qualitatively and
quantitatively similar and confirm our main conclusions. Minor deteriorations in performance
metrics could be expected, as the predictive horizon is considerably longer.
In our main analysis, we find that ML methods have superior predictive abilities for the pooled
data of the core corporate exposures—SMEs and LCs. 1 However, the regulatory guidelines
allow banks to distinguish between these two types of exposure (BCBS 2019). SMEs are
defined as exposures in which the consolidated group revenue is less than EUR 50 million.
Therefore, we re-estimate the nonlinear ensemble and the benchmark ensemble for both asset
classes individually and find qualitatively unchanged results; that is, ML methods outperform
the benchmark models by a large margin. Quantitatively, we find comparative results; that is,
2
the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values of the nonlinear ensemble are within a range of less than 2 percentage points
compared to the main analysis where we used pooled data. Thereby, we find slightly stronger
performance metrics for the CCF when we use pooled modeling data, suggesting that the
information conveyed by exposures to SMEs and LCs is in this context to some extent
1
Pooled data in this context means that the training and test data contain information from both asset classes. In
the dissections of our main analysis, we use the same training data but report the post-estimation evaluation
metrics for both asset classes independently. However, in this robustness check, we already split the training data
into the respective asset classes. The same idea applies to the following robustness checks where, for example,
different geographic regions are used.
complementary and that the additional information proves useful in model fitting for the
amount drawn down at the time of default (EaD). This is intuitively plausible since, for
example, the SME group also includes credit lines with very large exposures for which the
additional information from the LC group might be useful, and vice versa. In summary,
depending on the specific design of the credit risk modeling exercise, both options—pooled
and individual modeling of asset classes—appear to be comparable and valid, which also
supports the regulatory guidelines that permit, but do not necessarily require, a distinction
between the two exposures.
Most of the credit risk studies have a homogeneous setting, with internal data obtained from a
specific bank (Tong et al. 2016; Gürtler et al. 2018; Bellotti et al. 2021) or geographic
restrictions to, for example, the United States (Nazemi and Fabozzi 2018; Min et al. 2020) or
Europe (Bellotti et al. 2021). Our approach is unique in that we conduct a global benchmarking
study based on the world’s largest loss database. However, to ensure that our key conclusions
are not influenced by our unique global setting, we re-estimate the nonlinear ensemble and the
benchmark ensemble for credit lines from the United States and Europe individually, with
credit lines from the United States accounting for less than half of all observations for both
credit risk parameters and observations from Europe accounting for less than a quarter of all
observations for both credit risk parameters. We find that our main conclusions remain
qualitatively unchanged. However, a direct quantitative comparison with the results of our
main analysis is flawed because we do not have a true comparison group. In our main analysis,
Americas covers both North and South America, while Non-Americas covers all other regions.
However, for both subsamples and both credit risk parameters, we observe an increase in the
2
𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value of the nonlinear ensemble by at least 105.0%, confirming our main conclusions
even in more homogeneous settings.
In Section 2.2 of the main paper, we have already addressed the problem of a resolution bias
(i.e., an underestimation of recent losses). In our main analysis, we consider a three-year
recovery horizon, as 80% of the workout processes are completed after three years. However,
10
since three years is somewhat arbitrary, we re-estimate our analyses with a two-year and a five-
year recovery horizon. After two years, around half of the cases are resolved, whereas after
five years, approximately 95% of the cases are resolved. Our main conclusions remain
qualitatively unchanged for the different recovery horizons with quantitatively slightly weaker
2
and stronger 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values for the two-year and five-year horizons, respectively.
In our main analysis, we split our observations into a training set and a test set based on the
default date of the credit lines. One concern with this procedure, however, is that from an
operational perspective, banks must follow a specific credit line over a certain period (e.g.,
three years) to obtain the RR for a three-year recovery horizon. This means that, in practice,
banks must ensure that the observations in the training set are at least three years prior to the
start of the test set. In our specific example, we estimate the first expanding window with
training data from 2000 to 2009 and consider a three-year recovery horizon. This means that
our first out-of-time set of defaults would be from 2013, rather than 2010, resulting in a massive
loss of observations. However, to rule out the possibility that the procedure in our main analysis
affects our main conclusions, we re-estimate the nonlinear ensemble and the benchmark
ensemble with a waiting period of three years (i.e., our first out-of-time set is from 2013). Ad
hoc, the results of this re-estimation and our main analysis are not comparable because they
average over different out-of-time periods. To ensure comparability, we use only the results
from the same period as our main analysis. We find that our conclusions remain qualitatively
2
unchanged for this specification. For 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 , we see a slight performance decrease for the
nonlinear ensemble of less than 2 percentage points, but our main specification seems to benefit
2
rather than hurt the benchmark ensemble. In our main analysis, we find that the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value for
the nonlinear ensemble is higher by a factor of more than 3 compared to the benchmark
ensemble, and in our waiting scenario, it increases to a factor of more than 8.5. Small
deteriorations in performance could be expected, as in our re-estimation, we predict the RR
effectively seven years after the training data, and not just four years after. Consequently, the
introduction of a waiting period for our RR analysis does not affect the overall conclusions of
our study.
11
Finally, we offer an extension in terms of the non-core corporate borrowing segments (i.e.,
specialized lending) and private banking (i.e., high net worth individuals). To the best of our
knowledge, this is the first treatment of these two asset classes in the literature with respect to
the credit risk parameters CCF and RR. Although these asset classes are less common, in our
dataset, they still comprise credit lines with a total EaD of EUR 8 bn for specialized lending
and EUR 1 bn for private banking. Specialized lending covers five sub-classes: (1) project
finance, (2) object finance, (3) commodities finance, (4) income-producing real estate, and (5)
high-volatility commercial real estate (BCBS 2019). In our estimation, we simply use the same
approach as for the corporate segment; that is, we use the same credit line-specific features,
selected macroeconomic features, and hyperparameter set and do not address the unique
technicalities of these specific asset classes. Therefore, our estimates represent at most a lower
bound of the predictive abilities of ML methods for these asset classes. For the specialized
2
lending segment, the benchmark ensemble yields a negative 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value for both credit risk
2
parameters, while the nonlinear ensemble yields a positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value. These observations
confirm our main conclusions regarding the superiority of the nonlinear ensemble and strongly
support the fact that ML methods are much more flexible and universally applicable in common
banking practice.
For private banking, our general modeling approach is similar, but due to the specific
characteristics of the asset class, we exclude the industry features, operating company indicator,
syndicate indicator, and the number of guarantors, all of which vary slightly or not at all, as
they are simply not applicable to this asset class in most cases. This suggests that our simple
approach can lead to even greater performance deteriorations; as state-of-the-art ML methods
can frequently deal with large feature spaces, it is likely that additionally including asset class-
tailored features would further increase the benefit of ML methods. In effect, a more
comparable picture emerges when we compare the nonlinear ensemble with the benchmark
2
ensemble. The nonlinear ensemble again yields positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 for both credit risk parameters,
while this is the case for the benchmark ensemble only for the RR. However, even in this case,
the performance increase of the nonlinear ensemble is greater than 24%.
12
In summary, despite our simple approach, our core results are applicable outside the core
corporate banking segment, with superior predictive abilities of ML methods across different
segments and credit risk parameters.
13
14
15
16
Panel B: RR
FRR BEINF FMM BM-EN RF Cubist S-GBM NL-EN
LR 0.996 0.979 0.986 0.995 0.703 0.654 0.762 0.727
FRR 0.982 0.992 0.998 0.711 0.659 0.770 0.734
BEINF 0.978 0.990 0.720 0.644 0.766 0.728
FMM 0.995 0.721 0.670 0.778 0.744
BM-EN 0.718 0.661 0.773 0.737
RF 0.884 0.916 0.962
Cubist 0.882 0.966
S-GBM 0.963
17
18
2.0
1.5
Density
1.0
0.5
0.0
-5.0 -2.5 0.0 2.5 5.0
CCF
Panel B: RR
10.0
RR NL-EN BM-EN
8.0
6.0
Density
4.0
2.0
0.0
-0.1 0.2 0.5 0.8 1.1
RR
19
Credit line-specific
Macros
Securities
Borrower-specific
Panel B: RR
Credit line-specific
Securities
Borrower-specific
Macros
20
Feature value
seniority
credit_line_age
committed
currency
asset_class
collateral
utilization_rate_h
wui_gdpweighted
Low
−2.00 −1.00 0.00 1.00 2.00
Shapley Value (sensitivity of impact on model output)
Panel B: RR
High
committed
limit
operating_firm
outstanding_amount
limit_increase
gdp_deflator
undrawn_pct
Feature value
collateral
utilization_rate
numb_cl
credit_line_age
currency
seniority
ted_spread
industry
Low
−0.50 −0.32 −0.15 0.02 0.20
Shapley Value (sensitivity of impact on model output)
21
0.00 -0.12
-0.50 -0.44
-1.00 -0.75
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
Panel B: RR
limit outstanding_amount
0.15 0.15
NL-EN RF
Cubist S-GBM
0.07 0.09
Shapley Value
0.00 0.04
-0.07 -0.02
-0.15 -0.07
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
22
Rank
seniority 6 8 6 7 7 7 8 11 11 10 8.1
credit_line_age 10 9 10 10 9 9 9 10 10 11 9.7
committed 11 10 12 9 10 10 11 9 9 9 10.0
currency 9 11 9 12 11 11 10 8 8 8 9.7
asset_class 12 13 13 13 14 12 12 12 12 12 12.5
collateral 14 12 11 11 12 13 15 15 15 15 13.3
utilization_rate_h 13 14 15 15 15 15 14 13 13 13 14.0
wui_gdpweighted 15 15 14 14 13 14 13 14 14 14 14.0
15
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Avg.
Panel B: RR
1
committed 1 1 2 1 1 1 2 1 1.2
limit 2 2 1 2 2 3 1 2 1.9
operating_firm 3 3 5 4 7 6 3 3 4.2
outstanding_amount 6 4 4 5 3 2 7 6 4.6
limit_increase 5 5 3 3 4 7 8 8 5.4
gdp_deflator 4 6 6 6 6 5 4 5 5.2
undrawn_pct 7 7 7 7 5 4 6 7 6.2
Rank
collateral 8 8 8 8 9 10 12 10 9.1
utilization_rate 15 15 9 9 8 8 5 4 9.1
numb_cl 9 9 10 10 11 12 9 11 10.1
credit_line_age 11 10 11 11 10 11 13 14 11.4
currency 12 11 12 12 12 14 11 15 12.4
seniority 14 14 15 13 14 9 10 9 12.2
ted_spread 10 12 13 14 13 13 15 13 12.9
industry 13 13 14 15 15 15 14 12 13.9
15
2010 2011 2012 2013 2014 2015 2016 2017 Avg.
23
Feature Description
Dependent Variables
Recovery Rate (RR) Proportion of EaD that was recovered (in decimal percentage)
Credit Conversion Proportion of currently undrawn amount of committed limit that is expected
Factor (CCF) to be drawn down at the default date (in decimal percentage)
Borrower-specific Features
Americas 1 if country of residence is ‘Americas’, and 0 otherwise
Asset Class Divided into small and medium-sized enterprises (SME) and large
corporates (LC)
Currency Denomination on which credit line is based: U.S. dollar; Euro; and others
Industry Code Divided into services; finance, real estate and insurance (FIRE); commerce;
manufacturing; construction; and others
No. Credit Lines Number of credit lines of the borrower that defaulted before or on the event
date (frequency)
Operating 1 if entity is an operating company (income from sales to third parties), and
0 otherwise
Rating Borrower risk rating divided into investment grade (IG); non-investment
grade (non-IG); no rating (none); and unknown rating
Credit Line-specific Features
Credit Line Age Credit line age between origination and event date (in months)
Limit Limit advised to the obligor or the bank’s accepted share of syndicate (in
thousand EUR)
Limit Increase 1 if limit was increased between origination and event date, and 0 otherwise
Maturity 1 if maturity of credit line is ≥ 1 year, and 0 otherwise
Outstanding Amount Amount of the current principal outstanding plus past due interest of the
credit line (in thousand EUR)
Syndication 1 if credit line is part of a syndication, and 0 otherwise
Undrawn Amount Amount of advised limit not utilized at event date (in thousand EUR)
Undrawn Percentage Percentage of advised limit not utilized at event date (in decimal
percentage)
Utilization Rate Percentage of limit that is drawn at event date (in decimal percentage)
Utilization Rate High 1 if utilization rate is ≥ 0.95, and 0 otherwise
Security-related Features
Collateral 1 if credit line has underlying protection in form of collateral or security,
and 0 otherwise
No. Collaterals Number of collaterals protecting the credit line (frequency)
Committed 1 if there is a contractual obligation for the bank to ‘make the funds’ when
the facility is drawn by the obligor, and 0 otherwise
Guarantee 1 if the credit line has underlying protection in form of a guarantee, a credit
default swap, or support from a key party, and 0 otherwise
No. Guarantors Number of guarantors protecting the credit line (frequency)
Seniority 1 if seniority code is ‘super senior’, and 0 otherwise
24
References
Basel Committee on Banking Supervision. (2019). CRE 30, CRE 31, and CRE 36.
Bellotti, A., Brigo, D., Gambetti, P., & Vrins, F. (2021). Forecasting Recovery Rates on Non-
performing Loans with Machine Learning. International Journal of Forecasting, 37, 428–
444.
Bernardi, M., & Catania, L. (2018). The Model Confidence Set Package for R. International
Journal of Computational Economics and Econometrics, 8, 144–158.
Betz, J., Kellner, R., & Rösch, D. (2018). Systematic Effects among Loss Given Defaults and
their Implications on Downturn Estimation. European Journal of Operational Research,
271, 1113–1144.
Breiman, L. (2001). Random Forests. Machine learning, 45, 5–32.
Caputo, B., Sim, K., Furesio, F., & Smola A. (2002). Appearance-based Object Recognition
Using SVMs: Which Kernel Should I Use? Proceedings of NIPS Workshop of Statistical
Methods for Computational Experiments in Visual Processing and Computer Vision.
European Banking Authority (2016). Final Draft Regulatory Technical Standards on the
Specification of the Assessment Methodology for Competent Authorities regarding
Compliance of an Institution with the Requirements to use the IRB Approach in
Accordance with Articles 144(2), 173(3) and 180(3)(b) of Regulation (EU) No 575/2013.
European Central Bank (2019). Instructions for Reporting the Validation Results of Internal
Models: IRB Pillar I Models for Credit Risk.
Greenwell, B. (2020). fastshap: Fast Approximate Shapley Values. Retrieved from
https://cran.r-project.org/package=fastshap
Greenwell, B., Boehmke, B., Cunningham, J., & Developers, G. (2020). gbm: Generalized
Boosted Regression Models. Retrieved from https://github.com/gbm-developers/gbm.
Gruen, B., Leisch, F., Sarkar, D., Mortier, F., & Picard, N. (2020). flexmix: Flexible Mixture
Modeling. Retrieved from https://cran.r-project.org/web/packages/flexmix/
Gürtler, M., Hibbeln, M., & Usselmann, P. (2018). Exposure at Default Modeling–A
Theoretical and Empirical Assessment of Estimation Approaches and Parameter Choice.
Journal of Banking & Finance, 91, 176–188.
Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The Model Confidence Set. Econometrica,
79, 453–497.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Science & Business Media.
Hofner, B., & Hothorn, T. (2021). Stabs: Stability Selection with Error Control. Retrieved from
https://CRAN.R-project.org/package=stabs.
Kuhn, M. (2008). Building Predictive Models in R Using the Caret Package. Journal of
Statistical Software, 28, 1–26.
Kuhn, M., Johnson, K. (2013). Applied Predictive Modeling. Springer.
Meinshausen, N., & Bühlmann, P. (2010). Stability Selection. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72, 417–473.
Min, A., Scherer, M., Schischke, A., & Zagst, R. (2020). Modeling Recovery Rates of Small-
and Medium-Sized Entities in the US. Mathematics, 8, 1856.
Nazemi, A., & Fabozzi, F. J. (2018). Macroeconomic Variable Selection for Creditor Recovery
Rates. Journal of Banking & Finance, 89, 14–25.
25
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale
and Shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54,
507–554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., & De Bastiani, F. (2019). Distributions for
Modeling Location, Scale, and Shape. CRC Press.
Stasinopoulos, D. M., Rigby, R. A., & Others. (2007). Generalized Additive Models for
Location, Scale and Shape (GAMLSS) in R. Journal of Statistical Software, 23, 1–46.
Tong, E. N., Mues, C., Brown, I., & Thomas, L. C. (2016). Exposure at Default Models With
and Without the Credit Conversion Factor. European Journal of Operational Research,
252, 910–920.
Yuan, M., & Lin, Y. (2006). Model Selection and Estimation in Regression with Grouped
Variables. Journal of the Royal Statistical Society: Series B, 68, 49–67.
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American
Statistical Association, 101, 1418–1429.
26