Credit Risk ML

Credit Risk Modeling in the Age of Machine Learning *
Martin T. Hibbeln §‡, Raphael M. Kopp‡¶, and Noah Urban‡
July 13, 2023
Abstract
Based on the world’s largest loss database of corporate defaults, we perform a comparative
analysis of machine learning (ML) methods in credit risk modeling across the globe. We find
substantial benefits of ML methods for different credit risk parameters, even though we use a
uniform modeling framework for the ML methods, which potentially facilitates a massive
reduction in operational resources required for model development and validation. We analyze
the economic drivers of the credit risk models using explainable ML methods and find large
variations in feature importance suggested by different ML methods. We propose to implement
a nonlinear forecast ensemble, which not only boosts predictive performance but also produces
more stable forecasts and economic sensitivities, thereby mitigating model uncertainty. Our
results provide guidance for financial institutions, regulatory authorities, and academics.
Keywords: risk management, credit risk modeling, machine learning, forecasting
JEL Classification: G17, G21
*
We are grateful for helpful comments from Ed Altman, Heiner Beckmeyer, Tim Eisert, Johannes Kriebel, Marek
Micuch, Oleg Reichmann, and Piet Usselmann. We would also like to thank seminar participants at the European
Central Bank, FMA European Conference 2023, Economics of Financial Technology Conference 2022, 83rd
Annual Business Research Conference, 15th RGS Doctoral Conference in Economics, 61st Annual Southwestern
Finance Association Conference, 15th International Conference on Computational and Financial Econometrics,
Banking Research Workshop Münster 2021, 6th Vietnam Symposium in Banking and Finance, 1st International
Conference “Frontiers in International Finance and Banking”, 14th International Risk Management Conference,
and University of Duisburg-Essen. We also thank Global Credit Data for granting access to their database. The
views expressed herein are those of the authors and should not be associated with the EBA.
§
E-mail: martin.hibbeln@uni-due.de, phone: +49 203 37-92830 (corresponding author)
‡
Mercator School of Management, University of Duisburg-Essen, Lotharstr. 65, 47057 Duisburg, Germany
¶
European Banking Authority, 20 avenue André Prothin, 92927 Paris, France
Electronic copy available at: https://ssrn.com/abstract=3913710

1. Introduction
Recent developments in analytical techniques, massive increases in computing power, and

growing amounts of data have led to an unprecedented rise in machine learning (ML)
applications in both academia and risk management practices (Chinco et al. 2019; Feng et al.
2020; Gu et al. 2020; Kozak et al. 2020; Bianchi et al. 2021; Fuster et al. 2022; Leippold et al.
2022; Bali et al. 2023). Survey evidence provided by the Bank of Canada (2018), Bank of
England (2019), and Deutsche Bundesbank (2020) suggests that ML applications are being
gradually adopted and particularly widespread in financial services, with a broad range of
applications expected in the upcoming years (Bank of England 2019). These observations are
supported by the European Banking Authority (2021, p. 8) in a discussion paper on ML for
internal ratings-based (IRB) models, acknowledging that “ML might play an important part in
the way financial services will be designed and delivered in the future.” ML typically aims at
predicting specific outcomes (𝑦𝑦) or classifications (𝑦𝑦𝑘𝑘 with 𝑘𝑘 = 1, 2, … , 𝐾𝐾, where 𝐾𝐾 is the
number of unique classes) based on a set of given features (𝑥𝑥), which is different from the
classical econometric approach of parameter estimation and inference (Mullainathan and
Spiess 2017; Athey and Imbens 2019; Goldstein et al. 2021). Achieving high out-of-
sample/out-of-time predictive performance is extremely relevant in many practical problems
and decisions (Kleinberg et al. 2018; Athey and Imbens 2019).
In this study, we leverage these developments by performing a comparative analysis of

ML methods in credit risk modeling across the globe and by providing guidance on the
implementation of ML-based prediction models. Accurately estimating credit risk parameters
is essential, for example, for risk-based pricing; estimating various risk metrics, such as value-
at-risk or expected shortfall; calculating regulatory or economic capital requirements; or
complying with the International Financial Reporting Standard 9 (Chava et al. 2011; Gürtler et
al. 2018). We aim to answer the following research questions: (i) Do ML methods in credit risk
modeling provide superior predictive abilities in a consistent modeling exercise for different
credit risk parameters—exposure at default (EaD) and loss given default (LGD)—across the
globe? (ii) Does model averaging (i.e., forecast ensembles) boost predictive performance? (iii)
What are the sources of potentially superior predictive abilities?

In doing so, we rely on the world’s largest loss database of corporate defaults, which
currently contains facility-level information from 58 banks, comprising more than one-fourth
of the global loan assets of the 500 largest banks worldwide. In our empirical application, we
focus on revolving facilities (i.e., corporate credit lines (CCLs)) for several reasons. First, for
borrowers, the ability to access funds in uncertain economic environments, such as the
increased drawdown of credit lines induced by the COVID-19 pandemic (Acharya and Steffen
2020), is an important issue in corporate finance. Second, credit lines represent one of the most
widely used tools in commercial lending, with more than 80% of U.S. firms having access to
a credit line (Sufi 2009). Undrawn credit lines in the United States have been the primary driver
of the increase in loan financing and account for approximately 6% of GDP, up from 1% at the
time of the financial crisis; similar observations have been noted for Europe (Berg et al. 2021).
Third, credit lines represent a relatively unexplored financing channel that has important
implications for both the available funds of borrowers and the risk exposure of banks that are
the main providers of credit lines (Berg et al. 2021).
Our setting provides novel insights into credit risk modeling and leads to three key
conclusions. (i) ML methods consistently outperform not only simple benchmarks (e.g., linear
models) but also sophisticated benchmarks (e.g., mixture models) that take into account the
specific distributions of credit risk parameters (e.g., bimodality) on a global scale. We provide
new benchmarks for the predictive accuracy of various ML methods in quantifying credit risk
parameters as well as economically meaningful benefits of ML methods in dissections across
different asset classes, industries, and regions that manifest in substantially higher out-of-time
2
𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ), lower mean absolute error (𝑀𝑀𝑀𝑀𝑀𝑀), and lower root mean square error (𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅).
2
Numerically, state-of-the-art ML methods yield overall 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values that are higher by a factor
ranging from 2.1 to 3.5 compared to benchmark models from the credit risk literature, which
include highly specialized models for both credit risk parameters. Tree-based ML methods and,
to an even greater extent, the nonlinear ensemble consisting of multiple tree-based ML
methods, are especially well-suited across different credit risk parameters even when relying
on a uniform modeling framework. In contrast, the state-of-the-art literature on credit risk
modeling generally implements model structures that are well-suited for modeling a single
credit risk parameter. Rather than relying on inherently different models that fit the specific

distribution of the two credit risk parameters, we use an identical implementation of ML
methods and do not adjust the model structure and/or the hyperparameter space depending on
the credit risk parameter, but rely on the methods’ flexibility to cope with different data
structures or distributions. In our modeling exercise, we focus on two main components of
expected loss (EL) with different distributions and influencing factors—EaD and LGD. 1 To
the best of our knowledge, this is the first study that offers a formal treatment of different ML
methods for EaD and provides evidence of the enormous benefits of ML methods in a
consistent modeling exercise across different credit risk parameters. In doing so, we provide a
holistic approach to credit risk modeling, which we expect to be relevant to financial
institutions, regulatory authorities, and academics, providing benchmarks that work well for
various credit risk parameters across the globe.
(ii) We document large benefits of using model averaging (i.e., forecast ensembles) and,
hence, a boost in predictive performance. For the nonlinear ensemble, we quantify the impact
relative to the individual methods (i.e., state-of-the-art ML methods, including, e.g., random
2
forest) that form the ensemble, in the range of a 1.5% to 19.1% increase in 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ; in comparison
to highly specialized methods from the literature, the positive impact is even several times
2
higher (with an average increase in 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 of 172.4% to 210.3%). None of the methods
considered—ML methods and highly specialized models—outperform the nonlinear ensemble,
highlighting the benefits of model averaging for predictive accuracy. We also observe that the
sizeable superiority of the nonlinear ensemble is very persistent over time, consistently yielding
2
highly competitive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values in all periods. The consistent outperformance of the nonlinear
ensemble supports the idea that model averaging not only provides better predictive accuracy,
but also produces more stable forecasts, which is important from a practical perspective since
we are operating in a highly regulated domain.
(iii) As a critical enabler for the acceptance and adoption of ML applications in financial
services, we unpack the sources of superior predictability in notoriously ‘black box’ models
1
Since we rely on a loss database, it is infeasible to include the probability of default (PD) in our modeling
exercise, as, by definition, all facilities are already defaulted. Regarding the PD, we refer the interested reader to
Barboza et al. (2017), Berg et al. (2020), and Fuster et al. (2022).

using the explainable ML toolbox. We depict the main drivers of predictability for EaD and
LGD models based on methods from the explainable ML toolbox, and show that the identified
feature importance proves highly persistent over time. Employing the explainable ML methods
based on individual ML methods, we find large differences in feature importance for certain
features suggested by different ML methods, indicating that the models partially use a different
decision surface, commonly known as the ‘multiplicity of good models’ (Breiman 2001; Dong
and Rudin 2020). Leveraging the explainable ML toolbox for ensembles has the advantage that
the drivers of credit risk models are derived from multiple, well-performing models rather than
a single model, which mitigates the model multiplicity problem. We also provide economically
reasonable interpretations of the sensitivity of the impact of the most important features. To
this end, we show that forecast ensembles can be seen as an important tool in modern ML for
coping with model uncertainty, producing not only more stable forecasts but also economic
sensitivities.
Our study builds on several strands in the literature: (I) credit risk modeling, (II) ML
applications in finance and economics, and (III) explainable ML methods. (I) In the context of
credit risk modeling, recent studies have focused on developing sophisticated statistical
methods to model EaD, particularly attempting to account for its highly skewed distribution.
For this purpose, various forms of mixture models have been proposed (Hon and Bellotti 2016;
Leow and Crook 2016; Thackham and Ma 2019; Betz et al. 2022), different distributions have
been assumed (e.g., a zero-adjusted gamma distribution (Tong et al. 2016)), and panel data
methods have been used (Hon and Bellotti 2016; Leow and Crook 2016). However, to the best
of our knowledge, no formal treatment of various ML methods exists. Thus, we contribute to
this literature by providing a comparative study with a focus on ML methods including model
averaging. Similarly to EaD, many statistical methods exist that specifically address LGD and
its bimodal distribution (e.g., quantile regressions, fractional response regressions, beta
regressions, or different mixture models) (Krüger and Rösch 2017; Min et al. 2020). In contrast
to EaD, the use of ML methods for predicting LGD is more widespread (Bastos 2010; Qi and
Zhao 2011; Altman and Kalotay 2014; Kalotay and Altman 2017; Nazemi and Fabozzi 2018;
Kaposty et al. 2020; Olson et al. 2021; Kellner et al. 2022; Nazemi et al. 2022). In this regard,
Loterman et al. (2012) and Bellotti et al. (2021) are closest to our study, as they benchmarked

various modeling techniques for predicting the recovery rate (RR), with 𝑅𝑅𝑅𝑅 = 1 − 𝐿𝐿𝐿𝐿𝐿𝐿.
However, our stark focus on CCLs across the globe rather than commonly used retail loans or,
more generally, bank-by-bank estimations contributes to the generalizability of the results.
Moreover, to provide a holistic perspective on credit risk modeling, we extend the literature by
providing a benchmarking study (including a battery of individual methods and forecast
ensembles) based on a consistent modeling exercise across different credit risk parameters. Put
differently, we use an identical implementation of ML methods (e.g., the identical choice of
the parameter set for hyperparameter tuning) and rely on the methods’ flexibility to cope with
different data structures or distributions, instead of developing highly specialized models for
both credit risk parameters separately. To this end, we also use a common set of features in the
modeling exercise (i.e., influencing factors), including credit line-specific information (e.g.,
credit limit, utilization rate, and age of credit line), borrower-specific information (e.g., asset
class, industry, region, and rating), and collect a broad set of 161 macroeconomic features. To
the best of our knowledge, this represents the most comprehensive set of macroeconomic
features considered in credit risk modeling for corporate facilities.
(II) Our study adds to the burgeoning literature on ML applications in finance and
economics. Currently, there is an ever-increasing number of studies on benchmarking ML
methods, for example, in the context of asset pricing (Chinco et al. 2019; Feng et al. 2020; Gu
et al. 2020; Bianchi et al. 2021; Leippold et al. 2022), credit scoring (Fuster et al. 2022), and
human decision making (Kleinberg et al. 2018; Erel et al. 2021), or corporate governance
(Bandiera et al. 2020; Li et al. 2021). Providing a benchmarking study in risk management
appears highly beneficial, since, from a practical perspective, this represents one of the most
common applications of ML methods in financial services.
(III) We build upon the emerging stream of literature that addresses the explainability of
artificial intelligence and ML (Bracke et al. 2019; Bellotti et al. 2021; Bussman et al. 2021;
Bastos and Matos 2022). ML models show superior predictive abilities in many domains, but
their practical implementation is often hindered by stakeholders’ desire for model
interpretability (Horel and Giesecke 2022). This is particularly concerning in a highly regulated
environment such as credit risk modeling. To this end, we use explainable ML methods as a
critical component for the acceptance and adoption of ML applications in financial services

and extend the literature by unpacking the main drivers of predictability in credit risk modeling
across the globe, providing insights into the dynamics of feature importance over time and
outlining the sensitivity of the impact of the most important features. This is particularly
important for EaD as, to the best of our knowledge, there are no studies discussing in detail the
features influencing EaD. More generally, however, by employing explainable ML methods to
both individual ML methods and ensembles, we extend the literature for both credit risk
parameters by showing that the feature importance suggested by different models sometimes
varies widely. In addition, we show how ensembles help to deal with model uncertainty by
providing more stable economic sensitivities and assigning less feature importance when
economic sensitivities are less clear.
The remainder of this paper is organized as follows. Section 2 outlines the institutional
setting and modeling framework. Section 3 describes our data and methodology. Section 4
reports our results on the benefits of ML in credit risk modeling across the globe and unpacks
the sources of predictability. Section 5 provides robustness checks and extensions. Section 6
concludes.
2. Institutional Setting and Modeling Framework
2.1. Institutional Setting
In this section, we briefly review the institutional setting to outline the regulatory requirements.
The Basel Accords provide a framework for a stable financial system, which the regulator
continuously strengthened in the aftermath of the global financial crisis. Within this
framework, the Basel Committee on Banking Supervision (BCBS) permits the use of internal
ratings, given that banks can ensure “the integrity, reliability, consistency, and accuracy of both
internal rating systems and estimates of risk components” (BCBS 2001, p. 41). Ultimately, in
the so-called advanced internal ratings-based (A-IRB) approach, banks use their own estimates
of different credit risk parameters for each credit facility in the portfolio: PD, EaD, and LGD.
The estimates are required to calculate EL as well as risk-weighted assets and, hence,
regulatory capital requirements.

At the facility level, expected loss (EL) is the most relevant risk parameter and the main
output of a credit risk model. EL at the level of a single CCL is defined as follows: 𝐸𝐸𝐿𝐿𝑖𝑖 = 𝑃𝑃𝐷𝐷𝑖𝑖 ⋅
𝐸𝐸𝐸𝐸𝐷𝐷𝑖𝑖 ⋅ 𝐿𝐿𝐿𝐿𝐷𝐷𝑖𝑖 , where 𝑃𝑃𝐷𝐷𝑖𝑖 ∈ [0,1] describes the probability of default, usually within one year;
𝐸𝐸𝐸𝐸𝐷𝐷𝑖𝑖 ∈ [0, ∞) represents the outstanding amount in the event of default; and 𝐿𝐿𝐿𝐿𝐷𝐷𝑖𝑖 ∈ [0,1]
represents the fraction of 𝐸𝐸𝐸𝐸𝐷𝐷𝑖𝑖 that cannot be recovered. Regarding EaD, the IRB approach
divides exposures into on-balance and off-balance sheet positions. For on-balance sheet
positions, the EaD estimated by the bank must not be less than the amount currently drawn
down. For off-balance sheet positions (e.g., undrawn credit lines), banks’ estimates of EaD
should account for the possible additional drawings by the borrower up to the default date. To
ensure compliance with this, the Basel Committee suggests the use of a credit conversion factor
(CCF) (i.e., the proportion of the currently undrawn amount that might be drawn down by the
default time). CCF, based on the reference point at time t and referring to a default at time τ, is
defined in accordance with, for example, Gürtler et al. (2018) as follows:
 EaDt ,τ − et
CLt − et ,
if CLt − et > 0,
CCFt ,τ =  (1)
 0 , else.
The exposure at time t is defined as 𝑒𝑒𝑡𝑡 ≔ max {−𝐵𝐵𝑡𝑡 , 0}, where 𝐵𝐵𝑡𝑡 denotes the balance. The
corresponding credit limit is 𝐶𝐶𝐿𝐿𝑡𝑡 . CCF can be transformed into EaD estimates as follows:
et + CCFt ,τ ⋅ (CLt − et )
EaDt ,τ = (2)
In general, banks must consider “all relevant, material and available data, information and
methods” when estimating risk parameters (BCBS 2019, p. 20) and the BCBS particularly
permits the use of pooled data (BCBS 2019, p. 20): “A bank may utilise internal data and data
from external sources (including pooled data).” In our implementation, we adhere closely to
the regulatory requirements and estimate two key risk parameters. Regarding EaD, we follow
the BCBS recommendation and use CCF. The important difference between the indirect (CCF)
and direct approach (i.e., EaD as the response variable) is that under CCF, all observations are
weighted equally, which is not the case at the EaD level because of the absolute default
volumes. We represent the second key parameter in our modeling exercise as 𝑅𝑅𝑅𝑅 = 1 − 𝐿𝐿𝐿𝐿𝐿𝐿
for ease of interpretation (i.e., which proportion of EaD could be recovered), which is common

in the literature. 2 Finally, our paper not only has implications for credit risk modeling under
the regulatory requirements but also provides several possible avenues for improvement from
an internal risk management perspective. The close alignment with the regulatory framework
is for two reasons: First, it allows for a consistent framework and ensures that our conclusions
are relevant from a regulatory and practical perspective, and second, banks mostly use
regulatory-approved models for internal risk management purposes.
2.2. Modeling Framework
To fix ideas, credit lines are agreements made at time t between a lender and a borrower that
provide a maximum associated Euro amount (i.e., credit limit 𝐶𝐶𝐿𝐿𝑡𝑡 ) that can be drawn down by
the borrower at its own discretion at any time up to the expiration date T, which makes
modeling them particularly challenging in practice (Jiménez et al. 2009; Hibbeln et al. 2020).
We consider a setup with CCLs (i.e., 𝐶𝐶𝐶𝐶𝐿𝐿𝑖𝑖 = (𝐶𝐶𝐶𝐶𝐿𝐿1 , … , 𝐶𝐶𝐶𝐶𝐿𝐿𝑛𝑛 )), where n is the total number
of observed credit lines in 𝐴𝐴𝑖𝑖 = (𝐴𝐴1 , … , 𝐴𝐴𝑎𝑎 ) different asset classes over the observation period
[𝑡𝑡0 , 𝑇𝑇]. Let us further define an indicator function 𝐷𝐷𝑡𝑡 ∈ {0,1}, which describes a default of a
credit line at time t if and only if 𝐷𝐷𝑡𝑡 = 1. 3 The default time is defined as 𝜏𝜏 ≔ min {𝑡𝑡|𝐷𝐷𝑡𝑡 = 1}.
Our objective is to predict the CCF and RR for each defaulted 𝐶𝐶𝐶𝐶𝐿𝐿𝑖𝑖 in our sample (i.e.,
𝐷𝐷𝑡𝑡 = 1) based on various observable 𝑥𝑥𝑡𝑡𝐶𝐶 credit line-specific and 𝑥𝑥𝑡𝑡𝛭𝛭 macroeconomic features.
Hence, the objective is to identify a functional form 𝑦𝑦� = 𝑓𝑓̂(𝑥𝑥) ∈ 𝛭𝛭 that maps the observable
features 𝑥𝑥 into a prediction 𝑦𝑦�. For concreteness, let 𝛭𝛭 ∈ (1, … , 𝑚𝑚) index the set of different
ML methods. Let us assume that the default time of a credit line is 𝑡𝑡 = 𝜏𝜏, which provides us
information about the exposure at the time of default (EaD). From a modeling perspective, we
2
The RRs, for example, for loans or credit lines, are determined by the actual recovery cash flows and direct or
indirect costs of the specific debt position and are, therefore, referred to as ‘workout RRs’. In contrast, the RRs,
for example, for corporate bonds are typically determined based on market values and are, therefore, referred to
as ‘market RRs’. For a detailed discussion on the differences, we refer the interested reader to Calabrese and
Zenga (2010) and Gürtler and Hibbeln (2013).
3
According to the Basel II default definition, a default is triggered by two events: the bank considers it unlikely
that the borrower will meet its credit obligations, or the borrower is more than 90 days past due.

use the information at time 𝜏𝜏 − 12 months to predict CCF at time 𝑡𝑡 = 𝜏𝜏 and, thus, over a
horizon of one year. The same idea applies to RR, but there we use the information at time 𝑡𝑡 =
𝜏𝜏 to predict RR at the end of a recovery period of three years. 4 Defining such a long recovery
period is necessary to account for a resolution bias (i.e., an underestimation of recent losses),
which is driven by the observation that recoveries correlate with the length of the workout
process (e.g., Gürtler and Hibbeln 2013; Betz et al. 2021).
3. Data and Methodology
In this section, we describe our dataset, macroeconomic features, estimation strategy,

predictive techniques, and post-estimation evaluation metrics.
3.1. Credit Line and Default Data
We obtain data on CCLs for the time period 2000–2020 from the world’s largest loss database
of corporate defaults provided by Global Credit Data, an international not-for-profit
association. This database currently pools historical credit data from 58 member banks across
the globe, including many systemically important banks. The database is highly representative
of the North American and Western/Northern European regions, as member banks comprise
more than one-third and nearly one-half, respectively, of the total loan assets of the 500 largest
banks worldwide in their respective regions. The data enables member banks to model credit
risk, for example, to calculate regulatory capital requirements or calibrate and benchmark
internal EaD (CCF) and LGD (RR) models.
In our main analysis, we restrict the sample to credit lines in the asset classes of small and
medium-sized enterprises (SMEs) and large corporates (LCs), as these two segments are
categorized as general corporate exposures under the regulatory guidelines. We restrict the
sample to defaults after 2000 to ensure a consistent default definition under Basel II. Our last
observation period, in terms of defaults in the CCF sample, is the end of 2019 to meet a
materiality threshold regarding the number of observations in a specific year. Regarding the
RR sample, we consider defaults until the end of 2017 because we use realized RRs up to three
4
In robustness checks, we use different recovery horizons to rule out that the results are driven by this choice.
10

years after the default. However, transaction data on recoveries and costs for the RR sample
are considered accordingly until the end of 2020.
We apply the following filters to clean our data. We remove observations with a limit less
than or equal to zero, as these facilities cannot be considered real credit lines. For the CCF
sample, we remove observations for which no information is available for around one year (11
to 13 months) prior to default to ensure a consistent horizon in the estimation. For the RR
sample, we remove observations if banks no longer update information on these defaults
(incomplete portfolios; i.e., if we do not observe transaction data for at least three years and
the case is not resolved). Finally, we remove observations with missing values for both credit
line-specific and macroeconomic features. We implement a floor and cap of CCF at [-5.0, 5.0]
and of RR at [-0.1, 1.1], and winsorize the other credit line-specific features at 0.5% and 99.5%
levels to account for outliers and avoid instability in the parameters (Leow and Crook 2016;
Tong et al. 2016; Gürtler et al. 2018); this is motivated by the fact that some methods (e.g., the
linear regression (LR)), are highly sensitive to outliers (Krüger and Rösch 2017). In summary,
we obtain 12,895 observations in the CCF sample and 14,046 observations in the RR sample.
Table 1 shows the descriptive statistics of the credit line-specific features. Panels A and B
present the features used in modeling CCF and RR, respectively. Across most features, we
observe comparable values, but differences can occur for various reasons. First, the timing of
the measurement is different, as Panel A refers to features one year prior to default, while Panel
B refers to features at the time of default. Second, due to various sample restrictions, the credit
lines considered are not completely identical.
[Table 1 – about here]
To gain better insights into the data composition, we present the share of observations
across asset classes, industries, regions, rating categories, and currencies for both samples in
Table 2. We also present the distributions of all indicator variables in our dataset. More than
two-thirds of the observations are from SMEs. Only a small proportion of observations are
from the financial, real estate, and insurance (FIRE) industry, and most are non-investment-
11

grade firms. This is not surprising, as we rely on a default database in which investment-grade
firms rarely appear due to low default rates in this segment. We observe many observations
from the Americas, and more than 60% of the credit lines are secured by collateral.
3.2. Macroeconomic Features
We collect a broad set of 161 global, local (country-specific), and newspaper-based (i.e., based
on textual analysis) macroeconomic features covering various categories, such as indicators of
economic conditions, stock market conditions, credit market conditions, general corporate
measures, and policy or world uncertainty indices. The features range from daily to annual
frequency and originate from various sources (e.g., Worldbank; FRED (St. Louis Fed);
Refinitiv Eikon; Bahir et al. 2016, 2021; Caldara 2020; Ahir et al. 2022). Local features are
only collected above a threshold of 10 credit lines from a country, as a trade-off between the
number of observations and the number of macroeconomic features with a complete time
series. We collect not only the level but also the change in macroeconomic features. 5 From this
set, we select the macroeconomic features using the randomized least absolute shrinkage and
selection operator (lasso), 6 a state-of-the-art feature selection procedure (Meinshausen and
Bühlmann 2010).
3.3. Estimation Strategy
We choose a model validation approach that preserves the time dependency of defaults to
reflect the most realistic scenario from a practical and regulatory perspective. 7 In our
5
To ensure stationarity of the macroeconomic features, we apply various transformations where necessary.
6
For brevity, we outline the details regarding this procedure in the Internet Appendix Part A.1.
7
We refer the interested reader to Hibbeln et al. (2023), who demonstrate the importance of model validation as
a central element of any ML workflow regarding error estimates and model selection.
12

implementation, we split the data into three non-overlapping parts. First, a training set is used
to train our model for a given set of hyperparameter values. Second, a validation set is used to
generate predictions based on the estimated models; the forecasting errors in the validation set
are used to iteratively search (grid search) for the optimal hyperparameter value combination
(tuning) that minimizes our objective function. Third, the test set, which is not used in the
training or validation process, is used to objectively assess the model’s performance on unseen
data. We re-estimate a given model at each time t (with 𝑡𝑡 = 0, 1, 2, … , 𝑃𝑃, where P is the number
of different periods) and predict the credit risk parameters for the subsequent out-of-time period
p of one year (CCF) or the three-year recovery horizon (RR). We compare the different
methods based on the out-of-time prediction errors.
3.4. Predictive Techniques
We compare a battery of competing methods 𝛭𝛭 ∈ (1, … , 𝑚𝑚), including:
- Linear regression (LR) as simple benchmark,

- Sophisticated benchmarks: Fractional Response Regression (FRR), Beta-inflated
Model (BEINF), and Finite Mixture Model (FMM),
- Subset selections: Forward- (LR-FS), Backward- (LR-BS), Stepwise-Selection (LR-
SS),
- Penalized regressions: Lasso, Ridge, Elastic Net (ENet),
- Dimension reduction methods: Principal Component Regression (PCA), Partial Least
Squares (PLS),
- Nonlinear methods: k-Nearest Neighbor (KNN), Multivariate Adaptive Regression
Splines (MARS), Neural Network (NN), Support Vector Machine (SVM), and
- Tree-based methods: Decision Tree (RTree), Bagging, Boosting, Random Forest (RF),
Conditional Inference Tree (CI-Tree), Stochastic Gradient Boosting Machine (S-
GBM), and Cubist. 8
8
For brevity, we outline the details regarding the methods that form the benchmark ensemble (i.e., LR, FRR, and
mixture models), the ML methods that form the nonlinear ensemble (i.e., Cubist, RF, and S-GBM), and the
13

We also form two forecast ensembles with equal weights: (i) a benchmark ensemble (BM-EN)
composed of LR, FRR, BEINF, and FMM, and (ii) a nonlinear ensemble (NL-EN) composed
of Cubist, RF, and S-GBM, respectively. The forecast ensemble of the i-th credit line 𝑖𝑖 ∈
(1, … , 𝑛𝑛) at time t can be formalized as follows:
1
yîEN
,t = ∑
N M ∈K
yîM,t , (3)
with K representing the set of composite models and N representing the number of models in
the given ensemble (BM-EN and NL-EN). Forming ensembles appears appealing for several
reasons (Steel 2020): (I) a broad strand of literature documents large benefits of ensembles in
various domains, for example, economics or weather forecasts, (II) ensembles inherently allow
to combine the predictive performance of multiple models, thus addressing model uncertainty,
(III) ensembles allow simple comparisons between benchmark models and more sophisticated
ML models, and (IV) ensembles are also appealing to tackle the issue of model multiplicity,
which refers to models with similar predictive accuracy but with a different decision surface
embedded in the model, for example, due to differences in the importance of certain features.
3.5. Post-Estimation Evaluation Metrics
We use several competing models, with 𝛭𝛭0 being the set of competing objects, to predict the
credit risk parameters CCF and RR. To assess the predictive ability of the various methods, we
2 ). 9
use the out-of-time 𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 Since our implementation uses a validation approach that
preserves the time dependency of defaults, we re-estimate a given model at each time t (with
𝑡𝑡 = 0, 1, 2, … , 𝑃𝑃) and, hence, compute the post-estimation evaluation metrics for each period p
in the interval [𝑡𝑡, 𝑃𝑃]. Thus, the out-of-time performance metrics reflect weighted averages with
hyperparameters in the Internet Appendix Part A.2. For more details on all methods, we refer the interested reader
to Hastie et al. (2009) and Kuhn (2013). For the implementation of the ML methods, we use the package caret in
R (Kuhn 2008); for the BEINF model, we use the package gamlss in R (Stasinopoulos et al. 2007); and for FMM,
we use the package flexmix in R (Gruen et al. 2020).
9
In the Internet Appendix Part A.3. and Part C (Tables C.1-C.3), we discuss and additionally report the root mean
square error (𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅), mean absolute error (𝑀𝑀𝑀𝑀𝑀𝑀), and Hansen et al.’s (2011) model confidence set procedure.
14

respect to the number of observations in the test set for each period p. Consequently, the out-
2 2
of-time 𝑅𝑅 2 is defined as 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 = ∑𝑝𝑝(𝑤𝑤𝑝𝑝 ⋅ 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 ), where the weight of a given period p is
𝑄𝑄𝑝𝑝
defined as 𝑤𝑤𝑝𝑝 = ∑ , where Q reflects the number of credit lines in the test set of a given
𝑝𝑝 𝑄𝑄𝑝𝑝
period p. The out-of-time 𝑅𝑅 2 of a given period p is calculated as:
M 2
2
ROOT , p = 1 −
1
n ∑ y − yî i i
, (4)
1
∑ y −y
2
n i i train
where 𝑦𝑦𝑖𝑖 is the actual realization of the 𝑖𝑖-th credit line 𝑖𝑖 ∈ (1, … , 𝑛𝑛) in 𝐶𝐶𝐶𝐶𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 , 𝑦𝑦�𝑖𝑖𝛭𝛭 is the
prediction from the respective model 𝛭𝛭 ∈ (1, … , 𝑚𝑚), and 𝑦𝑦�𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 is the historical average in
2
𝐶𝐶𝐶𝐶𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . If 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 is greater than zero, the considered model 𝛭𝛭 exhibits better predictive
2
accuracy than the historical average. The computation of 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 for a given period p follows
Campbell and Thompson (2008) and is commonly referred to in the literature as out-of-sample
2 2
𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ). The only difference is that, in the computation of 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 , the time dependency of
2
defaults is preserved, while this is not necessarily the case for 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 .
To further assess possible performance differences between the different methods, we

compute forecast error correlations. The forecast error for the prediction of a given observation
2
i from model 𝛭𝛭 is defined as: 𝑒𝑒̂𝑖𝑖𝛭𝛭 = �𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖𝛭𝛭 � . Subsequently, the forecast error correlations
of two methods 𝛭𝛭 ∈ {1; 2} are defined as follows:
Cov(eˆ M 1 , eˆ M 2 )
ρ ( M 1, M 2) = (5)
σ (eˆ M 1 )σ (eˆ M 2 )
with 𝜎𝜎(∙) denoting the standard deviations of the forecast errors of 𝛭𝛭1 and 𝛭𝛭2, and 𝐶𝐶𝐶𝐶𝐶𝐶(∙)
denoting the covariance between 𝛭𝛭1 and 𝛭𝛭2, respectively.
4. Empirical Results
In this section, we begin by taking a statistical perspective to provide insights into the
distribution and dispersion of performance metrics, the correlations of forecast errors, and the
distribution of actual vs. predicted values for CCLs across the globe. Next, we dissect out-of-
15

time predictive performance across asset classes, industries, and regions. To analyze the
sources of predictability, we unpack the main drivers of predictability, i.e., importance of
individual features, provide insights into the dynamics of feature importance over time, and
outline the sensitivity of the impact of the most important features.
4.1. Comparison of Methods
4.1.1. Overall Results
Table 3 shows a comparative analysis across different model types for modeling both credit
risk parameters, CCF (Panel A) and RR (Panel B). In our results, we distinguish between the
benchmark and nonlinear ensembles (BM-EN and NL-EN) as well as the averages of the
included composites, tree-based methods, non-linear methods, and linear methods. 10
Overall, we find large benefits in using sophisticated ML methods. The NL-EN produces
2
an 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 of 43.5% and 19.7% for CCF and RR, respectively, which is higher than the BM-EN
by a factor of 2.7 and 3.1, while being much more flexible and universally applicable in
common banking practice. The downward shift in performance for RR indicates that RR
estimation is relatively more difficult; this is intuitively plausible since CCF is estimated for a
one-year horizon, whereas RR is estimated for a three-year recovery horizon. Turning to the
NL-EN Composite (i.e., Cubist, RF, and S-GBM), we observe an average (minimum;
maximum) performance increase relative to the BM-EN in the order of 160.8% (156.9%;
168.4%) and 169.9% (160.6%; 188.0%) for CCF and RR, respectively. This in turn implies
that, on average (minimum; maximum), the NL-EN Composite falls short in performance
relative to the NL-EN in the order of 4.3% (1.5%; 5.7%) and 13.0% (7.2%; 16.0%) for CCF
and RR, respectively, resulting in a positive impact of model averaging in the range of a 1.5%
10
For completeness, we report the results for the individual methods in the Internet Appendix Part C (Table C.3).
16

2
to 19.1% increase in 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 . Numerically, the NL-EN Composite exhibits a similar dispersion
for both credit risk parameters having a standard deviation (SD) of 1.0, with the minimum and
2
maximum observed 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values being within a range of 1.8 percentage points. For both
parameters, tree-based methods perform well with averaged performance increases of 135.4%
(CCF) and 150.6% (RR) compared to the BM-EN. Table 3 also reveals that nonlinear methods
strongly outperform linear methods, suggesting the existence of nonlinear relationships
between the features and credit risk parameters. This result is consistent with recent findings
in the finance literature showing that the superior predictive abilities of ML methods in various
domains are primarily driven by nonlinear relationships in the data (Gu et al. 2020; Bianchi et
al. 2021; Bali et al. 2023).
The substantial outperformance of sophisticated ML methods is striking for several

reasons: (i) the BM-EN includes not only the LR, which is regularly used in banking practice
(European Banking Authority 2021), but also sophisticated benchmarks from the literature
(e.g., BEINF and FMM), which are explicitly designed to predict specific credit risk
parameters; (ii) the pre-selection of macroeconomic features probably benefits the benchmarks
more than the ML methods; and (iii), even more importantly, we provide a consistent modeling
exercise using an identical implementation of ML methods for estimating both credit risk
parameters, thus relying on the methods’ flexibility to cope with different data structures or
distributions. 11
Next, we consider in Figure 1 the dispersion of predictive performance over time, using
2
the annual unweighted 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 values. Given that we operate in a highly regulated domain, this
perspective is particularly important from a practical viewpoint (e.g., for regulatory

authorities). Perhaps, most striking is the observation that for both credit risk parameters, only
2
the sophisticated ML methods consistently produce a positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 and that the ML methods
markedly outperform the benchmark models in all out-of-time years. It follows that in certain
years a very naïve predictor, i.e., the historical average, would have been the preferred choice
11
This also provides two avenues for future research—skipping the pre-selection of macroeconomic features and
specific tailoring of the ML methods to the different credit risk parameters—to further boost the performance of
ML methods compared to our benchmarks.
17

relative to the benchmark models. The figure also provides strong support for the superiority
2
of the NL-EN, which produces higher 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 values over time compared to the individual ML
methods forming the NL-EN, both on average (circle) and median (bar), and even by a much
larger margin compared to the benchmark models. As an alternative way to explore the
differences between predictive techniques, we also examine the overall distribution of actual
vs. predicted values for both credit risk parameters. We find that the predicted values of the
nonlinear ensemble better mirror the actual distributions, which explains the better predictive
performance of the sophisticated ML methods. 12
[Figure 1 – about here]
Taken together, these results provide strong evidence for the benefits of a more holistic
approach to credit risk modeling that reinforces the use of identical model implementations for
different credit risk parameters. It is important to recognize, however, that the improvements
achieved through using sophisticated ML methods are substantial not only from a statistical
perspective, but also from an economic perspective for several reasons. One consideration is
that a holistic approach to credit risk modeling has a potentially tremendous impact from an
operational perspective. The ability to rely on uniform modeling approaches for different credit
risk parameters translates into a potentially massive reduction in the operational resources
required for model development and validation for financial institutions, regulatory authorities,
and academics. Another consideration is the monetary perspective, for which we estimate the
lower bound of the impact of applying ML methods to our dataset to be in the order of EUR 5
bn in terms of a more accurate prediction of EaD. 13 As banks apply the calibrated models to
their non-defaulted portfolios, these numbers can translate into values that are many times
12
We report the corresponding figure in the Internet Appendix Part C (Figure C.1).
13
The NL-EN outperforms the BM-EN by more than 30 percentage points, with values of 0.73 and 1.03,
respectively, based on the MAE as performance metric. Considering a cumulative limit and an outstanding amount
one year prior to default of EUR 50 bn and EUR 35 bn, respectively, and recalling the EaD formula in eq. (2), the
application of ML methods leads to a more accurate prediction of the EaD by about EUR 5 bn.
18

higher—likely more than 50 times—given that the average default rate for corporate borrowers
is typically less than 2% (Giesecke et al. 2011; Cui and Kaas 2021).
4.1.2. Forecast Error Correlations
Next, we turn to forecast error correlations according to eq. (5) in an effort to better understand
the differences in predictive abilities across predictive techniques. 14 The main insight is that
two groups clearly show high within-group correlations: one group consists of the benchmark
models, and the other group consists of the sophisticated ML models. Numerically, we find
high correlation values of 𝜌𝜌 > 0.95 for the group of benchmark models for both credit risk
parameters. For the ML models, this insight holds for CCF, but only to a lesser extent for RR
with 𝜌𝜌 ≈ 0.90 for the individual models. Recalling the considerable larger benefits of model
averaging for RR, this observation reinforces the idea that ensembles are able to combine the
nuances of the different models to create an even more powerful predictive technique. By
construction, the correlation levels of the ensembles and the contained models (i.e., composite)
are relatively high, with 𝜌𝜌 > 0.96 and in many cases even 𝜌𝜌 > 0.99. At the same time, across-
group correlations between NL-EN and BM-EN are significantly lower, with 𝜌𝜌 ≈ 0.88 and
𝜌𝜌 ≈ 0.74 for CCF and RR, respectively. Viewing predictive abilities from this perspective
suggests that the decision surface embedded in the different model types (i.e., benchmark
models and ML models) is inherently different, which is intuitively plausible given that some
benchmark models are highly specialized in predicting specific credit risk parameters, while
ML models are much more flexible in their functional forms. Recognizing that the correlation
levels of the ensembles and the contained models are relatively high, while at the same time
the ensembles produce better predictive performance, in what follows we mainly compare the
nonlinear to the benchmark ensemble.
14
For brevity, we report the forecast error correlations in the Internet Appendix Part C (Table C.4).
19

4.1.3. Predictability Across Asset Classes, Industries, and Regions
To complete the picture, we now investigate whether the superior predictive abilities also hold
across different asset classes, industries, and regions. The analyses build on the models trained
on the respective sample available (expanding window) and are aggregated into a weighted
average, in line with the baseline analysis from Section 4.1.1. To produce the asset class-
specific metrics, we use the average from the training set for a particular asset class in the
2
computation of 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 . The same procedure is used for industry- and region-specific
dissections. Table 4 shows the dissections for both parameters (CCF and RR). Panel A shows
the predictability across asset classes, Panel B across industries, and Panel C across regions.
Across the different dissections, we reach the same conclusion as in our main analysis.
The NL-EN provides superior predictive abilities across all subsamples, regardless of the risk
parameter considered, with observed performance increases for the NL-EN of at least 83% for
each dissection and even several times higher for both credit risk parameters in multiple
dissections. On average, we find a performance increase from the BM-EN to the NL-EN of
194.9% and 266.6% for CCF and RR, respectively, which is even more pronounced compared
to our main analysis, in which we observed a performance increase of 172.4% and 210.3%,
respectively. This advocates for the flexibility of sophisticated ML methods, which seem to be
able to better capture the nuances of the modeling problem with multiple industries, different
asset classes, and regions.
In summary, the application of ML methods in credit risk modeling, particularly based on

nonlinear ensembles, offers tremendous benefits from an economic and regulatory perspective
across the globe. We argue that the implementation of a uniform modeling framework across
different credit risk parameters can lead to a massive reduction in the operational resources
required for model development or validation by banks and for assessments by the regulatory
authorities, while being much more flexible in common banking practice.
20

4.2. Sources of Predictability
Superior predictive abilities alone are not sufficient in a highly regulated environment. To gain
a better understanding of the superior predictive abilities in notoriously ‘black box’ models, we
explore techniques from the explainable ML toolbox. To this end, we unpack the main drivers
of predictability for individual features (Figure 2), outline the sensitivity of the impact of the
most important features (Figure 3), and provide insights into the dynamics of feature
importance over time. To measure the importance of a given feature, we rely on Shapley values,
which are based on a concept borrowed from cooperative game theory and approximate the
average marginal contribution of each feature. For implementing Shapley values, we use the
package fastshap in R (Greenwell, 2020).
4.2.1. Feature Importance
A natural starting point for understanding the superior predictive abilities of an ML model is
the relative importance of the most important features to the model’s performance. The features
in Figure 2 are normalized such that they sum to one across all considered features, allowing
for relative interpretation, and are sorted by their overall importance for the NL-EN. For
brevity, we report only the top-15 features, which cover around 90% (CCF) and 78% (RR) of
the total importance for the NL-EN. We also show the min-max range and median feature
importance of the composite models (i.e., Cubist, RF, and S-GBM). Importantly, by virtue of
the design of the NL-EN, the feature importance for the NL-EN reflects on the credit line-level
the equally weighted average across the composite models; this is ensured by the linearity
axiom for Shapley values. Formally, this can be expressed as follows: the effect of a given
feature on the sum of two functions is simply the effect it has on one function plus the other;
i.e., 𝜙𝜙𝑖𝑖 (𝛼𝛼𝛼𝛼 + 𝛽𝛽𝛽𝛽) = 𝛼𝛼𝛼𝛼𝑖𝑖 (𝑓𝑓) + 𝛽𝛽𝛽𝛽𝑖𝑖 (𝑔𝑔), for any two models f and g, and scalars 𝛼𝛼 and 𝛽𝛽,
respectively.
To better understand the feature importance, we start by discussing the overall results
based on the NL-EN. For the CCF, we find that the undrawn percentage is by far the most
important feature, accounting for more than 20% of total importance. The picture is less
21

pronounced for the RR, where the two most important features (committed indicator and limit)
have a relative importance of 10.4% and 11.3% of total importance. Generally, we observe that
the majority of the top-5 features all belong to the group of credit line-specific features
accounting for 60% and 25% of the total importance, respectively, with the limit being the only
variable that belongs to the top-5 features across both credit risk parameters. Overall, we find
that credit line-specific features account for 64% (CCF) and 42% (RR) of total importance. 15
Intuitively, the much higher importance of credit line-specific features for the CCF appears
reasonable given that CCLs, by construction, do not contain a fixed repayment schedule and
previous literature suggests that credit line utilization often increases prior to a borrower default
(Jiménez et al. 2009; Hibbeln et al. 2020). This implies that the dynamic developments of CCLs
are important and that credit line-specific features capture the dynamics relatively better. To
model RR, information at the default time is considered, where EaD is already known and,
thus, the revolving nature no longer has an impact. As expected, however, security-related
features (e.g., committed, collateral, and seniority indicators) are much more relevant to model
RR, accounting for more than one-fifth of the total importance.
Most striking in this figure, however, is the large variation of feature importance for
specific features, indicated by the wide range of min-max values and min-max ranks suggested
by the different models. This clearly emphasizes the suitability of ensembles to tackle the issue
of model multiplicity, which refers to models with similar predictive accuracy but with a
different decision surface embedded in the models, for example, due to differences in the
importance of certain features. We mostly observe a feature importance for the NL-EN within
the range of the composite models, which is reasonable as the NL-EN is an equally weighted
average across the composite models. However, we not only observe cases where the
composite models assign a different importance to certain features (e.g., with rank differences
of up to 27), but also that the feature importance suggested by the NL-EN, with a value of less
than 6% of the total importance for the outstanding amount (Panel A – CCF), is substantially
15
We report the results of the group feature importance plots in the Internet Appendix Part C (Figure C.2); for
this purpose, we allocated each feature to one of four groups: borrower-specific features, credit line-specific
features, security-related features, and macroeconomic features.
22

lower than the values suggested by the composite models, which range around 11%. This result
can only occur if the composite models have, on average, completely different impacts with
respect to this variable; that is, for example, 𝛭𝛭1 suggests a positive impact on the prediction
for a given credit line and 𝛭𝛭2 suggests a negative impact. This leads to the situation where the
overall impact of the NL-EN is much lower, even though all composite models suggest a
relatively higher importance. This result is interesting from a practical perspective for several
reasons: it suggests that (i) models do indeed partially use a different decision surface
embedded in the models, (ii) the NL-EN suggests lower feature importance especially in cases
where economic impacts are less clear, and (iii) forecast ensembles can be seen as an important
tool in modern ML for coping with model uncertainty, producing not only more stable forecasts
but also economic impacts, which is especially important in highly regulated domains.
4.2.2. Feature Sensitivity
To provide additional perspective, we analyze the sensitivity of the impact of the most
important features. To this end, since we are interested in the impact across feature values (i.e.,
the distribution), we first average the Shapley values for a given feature level before dividing
the CCLs into 20 buckets according to the feature values. 16 Figure 3 shows the mean and SD
of Shapley values across buckets, with feature values normalized between 0 and 1 for
visualization.
For the credit limit, which is one of the most important features for both credit risk
parameters, we find that relatively smaller credit limits affect the CCF prediction positively,
while relatively larger credit limits affect the CCF prediction negatively. This result nicely
brings together two well-documented facts from the literature: first, smaller firms (with on
average much lower credit limits) rely more heavily on CCLs as a source of financing
16
To complement this, in the Internet Appendix Part C (Figure C.3) we report bee-swarm plots for the top-15
features that provide further understanding of the directional impact on predictions.
23

(Chodorow-Reich et al. 2022) and second, credit line utilization often increases prior to a
borrower default (Jiménez et al. 2009; Hibbeln et al. 2020). Combining these two facts, it
follows that the CCF is, on average, relatively more positively affected for firms with a smaller
credit limit and, consequently, for smaller firms and vice versa. For the RR, we find the
opposite pattern, i.e., the higher the credit limit, the higher the RR on average. This result
complements previous findings in the literature, e.g., Jankowitsch et al. (2014), showing that
RRs increase with firm size in the U.S. corporate bond market. For the outstanding amount, we
find for the CCF that the sensitivity for the entire distribution hovers around zero, which could
be expected based on the evidence discussed in the previous section, i.e., that the composite
models suggest opposite sensitivities that on average lead to impacts close to zero. 17 For the
RR, we find on average a slightly positive impact for small and large outstanding amounts. For
the undrawn percentage, we find an increasing function with an increase in the feature value.
Put differently, relatively low values of the undrawn percentage have a negative (less positive)
impact on CCF (RR) prediction, while relatively high values of the undrawn percentage have
a substantial (more) positive impact on CCF (RR) prediction. For the utilization rate, we
observe the same pattern but in the opposite direction; that is, the higher the utilization rate, the
lower the CCF/RR, and vice versa.
4.2.3. Feature Importance Dynamics
The large overall impact of credit line-specific features, the large variations in the importance
for some features, and the fact that model refitting is costly in practice raises the intriguing
question of the dynamics of feature importance over time. 18 We find overall a general trend
with some fluctuations. The undrawn percentage in the CCF panel consistently ranks first,
which could be expected given its significantly greater importance compared to all other
17
In the Internet Appendix Part C (Figure C.4), we present the sensitivities for the credit limit and outstanding
amount for the different models, i.e., NL-EN and composite models, to illustrate this behavior.
18
We show the feature importance dynamics over time in the Internet Appendix Part C (Figure C.5).
24

features. More broadly, we find that the top-4 features—the undrawn percentage, utilization
rate, credit limit, and undrawn amount—each contributing more than 10% to total importance,
are consistently among the top-5 features in all out-of-time years; this is also reflected in their
average rank around 3. For the RR, we find that the two most important features—the
committed indicator and credit limit—are consistently among the two most important features,
with one exception. Numerically, the ranks for RR appear to fluctuate somewhat more, which
could be expected since we observed several sets of features with very similar overall
importance. More generally, however, we observe that the top-7 features for both credit risk
parameters are consistently among the top-8 features in all out-of-time years. Taken together,
this provides clear evidence of the relatively stable importance of features over time,
particularly for the most important features, when we model the credit risk parameters based
on the nonlinear ensembles, which is an important consideration from a practical perspective
for financial institutions and regulatory authorities.
In summary, we have used various means to open the ‘black box’ of ML methods in credit
risk modelling, particularly based on nonlinear ensembles, to understand the drivers that lead
to its superior predictive abilities. We think this is a critical enabler for the acceptance and
adoption of ML applications in financial services, where the practical implementation of ML
models is often hindered by stakeholders’ desire for model interpretability.
5. Robustness Checks and Extensions
In this section, we provide a brief summary of a battery of robustness checks we performed to

verify that our results are not affected by certain specifications in our main analysis, and
provide an overview of several extensions that go beyond the scope of our main analysis.19
Concretely, we perform the following robustness checks: (I) different estimation strategy, (II)
longer predictive horizon, (III) subsets of the pooled data, (IV) varying recovery horizons, and
(V) introducing a recovery waiting period.
19
In all robustness checks and extensions, we use the macroeconomic features selected in the main analysis unless
otherwise specified.
25

(I) One concern is that our preferred estimation strategy, the expanding window approach,
may distort predictive performance by including observations in the training data that are too
distant from the current environment and thus no longer relevant. To address this concern, we
2
compute the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 based on a rolling window approach, i.e., we shift not only the end of the
window but also the beginning of the window gradually on an annual basis; that is, the training
period covers the same number of periods for all re-estimations. Since we adjust the estimation
strategy, we also pre-select the macroeconomic features based on the shorter period for the
training data. (II) Our main analysis implicitly assumed that the models are re-estimated each
year. However, re-estimation of the models is time-consuming and costly and is therefore
probably not feasible in practice on an annual basis. For this reason, we extend the predictive
horizon from one year to three years to reflect a more realistic modeling scenario. (III) Since
we use pooled data in our main analysis, we re-estimate the models considering the data for
both asset classes (i.e., SMEs and LCs) and two regions (i.e., U.S. and Europe) separately to
ensure that our results hold for more homogeneous samples. Finally, we perform two
robustness checks specific to the modeling of RR. (IV) We re-estimate our RR analysis based
on a two-year and five-year recovery horizon instead of a three-year horizon, and (V) we
introduce a recovery waiting period to ensure that there is no overlap between training and test
sets. For all robustness checks, we find qualitatively unchanged results; the detailed results for
all robustness checks are reported in the Internet Appendix Part B.1.–B.6.
To provide further insight, we extend our main analyses of CCLs to non-core corporate
borrowing segments (i.e., specialized lending) and private banking (i.e., high net worth
individuals). To the best of our knowledge, this is the first treatment of these two asset classes
in the literature with respect to the credit risk parameters CCF and RR. Even though these asset
classes are more obscure, they comprise credit lines with a defaulted exposure of EUR 8 bn for
specialized lending and EUR 1 bn for private banking. In this exercise, we simply use the same
approach as for the corporate segment; that is, we use the same credit line-specific features,
selected macroeconomic features, and hyperparameter set and do not address the unique
technicalities of these specific asset classes. The results for both asset classes are qualitatively
2
consistent with our main conclusions; the nonlinear ensemble yields a positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 in all
specifications, while for the benchmark ensemble this is the case only for the RR Private
26

Banking subset. In summary, despite our simple approach, our core results are applicable
outside the core corporate banking segment, with superior predictive abilities of ML methods
across different segments and credit risk parameters. The results for the extensions are reported
in the Internet Appendix Part B.7.
6. Conclusion
Based on the world’s largest loss database of corporate defaults, we perform a comparative
analysis of ML methods in credit risk modeling across the globe, covering many systemically
important banks. We find that ML methods—including a battery of individual methods and
forecast ensembles—provide superior predictive abilities along multiple dimensions (i.e., asset
classes, industries, and regions) and consistently outperform benchmarking methods (e.g.,
mixture models). Providing a consistent modeling exercise and benchmarks that work well for
different credit risk parameters around the globe reinforces the benefits of a more holistic
approach to credit risk modeling while being much more flexible and universally applicable in
common banking practice. From a practical perspective, the ability to rely on uniform modeling
approaches for different credit risk parameters translates into potentially massive reductions in
operational resources required for model development and validation.
Our results are robust to a battery of different specifications and important not only from
a statistical point of view but also an economic and regulatory perspective. Applied to our
global default database, we estimate the lower bound of the impact of ML methods leading to
a more accurate prediction of EaD by approximately EUR 5 bn; this figure multiplies when the
calibrated models are applied to the banks’ portfolios of non-defaulted facilities—potentially
by a factor of more than 50, considering that the average default rate of corporate borrowers is
typically less than 2%. Overall, we provide benchmarks that work well for various credit risk
parameters in a holistic credit risk modeling exercise across the globe and present guidelines
for selecting, implementing, and validating ML approaches in credit risk modeling that we
expect to be relevant to financial institutions, regulatory authorities, and academics.
27

References
Acharya, V. V., & Steffen, S. (2020). The Risk of Being a Fallen Angel and the Corporate
Dash for Cash in the Midst of COVID. Review of Corporate Finance Studies, 9, 430–
471.
Ahir, H., Bloom, N., & Furceri, D. (2022). The World Uncertainty Index. Working Paper.
Altman, E. I., & Kalotay, E. A. (2014). Ultimate Recovery Mixtures. Journal of Banking &
Finance, 40, 116–129.
Athey, S., & Imbens, G. W. (2019). Machine Learning Methods that Economists Should Know
About. Annual Review of Economics, 11, 685–725.
Baker, S. R., Bloom, N., & Davis, S. J. (2016). Measuring Economic Policy Uncertainty.
Quarterly Journal of Economics, 131, 1593–1636.
Baker, S. R., Bloom, N., Davis, S. J., & Kost, K. (2021). Policy News and Stock Market
Volatility. Working Paper.
Bali, T. G., Beckmeyer, H., Moerke, M., & Weigert, F. (2023). Option Return Predictability
with Machine Learning and Big Data. Review of Financial Studies, forthcoming.
Bandiera, O., Prat, A., Hansen, S., & Sadun, R. (2020). CEO Behavior and Firm Performance.
Journal of Political Economy, 128, 1325–1369.
Bank of Canada. (2018). The Bank of Canada’s Financial System Survey.
Bank of England. (2019). Machine Learning in UK Financial Services.
Barboza, F., Kimura, H., & Altman, E. (2017). Machine Learning Models and Bankruptcy
Prediction. Expert Systems with Applications, 83, 405–417.
Basel Committee on Banking Supervision. (2001). The Internal Ratings-Based Approach.
Basel Committee on Banking Supervision. (2019). CRE 36.
Bastos, J. A. (2010). Forecasting Bank Loans Loss-Given-Default. Journal of Banking &
Finance, 34, 2510–2517.
Bastos, J. A., & Matos, S. M. (2022). Explainable Models of Credit Losses. European Journal
of Operational Research, 301, 386–394.
Bellotti, A., Brigo, D., Gambetti, P., & Vrins, F. (2021). Forecasting Recovery Rates on Non-
performing Loans with Machine Learning. International Journal of Forecasting, 37, 428–
444.
Berg, T., Burg, V., Gombović, A., & Puri, M. (2020). On the Rise of Fintechs: Credit Scoring
Using Digital Footprints. Review of Financial Studies, 33, 2845–2897.
Berg, T., Saunders, A., & Steffen, S. (2021). Trends in Corporate Borrowing. Annual Review
of Financial Economics, 13, 321–340.
Betz, J., Kellner, R., & Rösch, D. (2021). Time Matters: How Default Resolution Times Impact
Final Loss Rates. Journal of the Royal Statistical Society: Series C (Applied Statistics),
70, 619–644.
Betz, J., Nagl, M., & Rösch, D. (2022). Credit Line Exposure at Default Modelling Using
Bayesian Mixed Effect Quantile Regression. Journal of the Royal Statistical Society:
Series A (Statistics in Society), 185, 2035–2072.
Bianchi, D., Büchner, M., & Tamoni, A. (2021). Bond Risk Premiums with Machine Learning.
Review of Financial Studies, 34, 1046–1089.
Bracke, P., Datta, A., Jung, C., & Sen, S. (2019). Machine Learning Explainability in Finance:
An Application to Default Risk Analysis. Bank of England Staff Working Paper No. 816.
28

Breiman, L. (2001). Statistical Modeling: The Two Cultures. Statistical Science, 16, 199–231.
Bussmann, N., Giudici, P., Marinelli, D., & Papenbrock, J. (2021). Explainable Machine
Learning in Credit Risk Management. Computational Economics, 57, 203–216.
Calabrese, R., & Zenga, M. (2010). Bank Loan Recovery Rates: Measuring and Nonparametric
Density Estimation. Journal of Banking & Finance, 34, 903–911.
Caldara, D., Iacoviello, M., Molligo, P., Prestipino, A., & Raffo, A. (2020). The Economic
Effects of Trade Policy Uncertainty. Journal of Monetary Economics, 109, 38–59.
Campbell, J. Y., & Thompson, S. B. (2008). Predicting Excess Stock Returns Out of Sample:
Can Anything Beat the Historical Average? Review of Financial Studies, 21, 1509–1531.
Chava, S., Stefanescu, C., & Turnbull, S. (2011). Modeling the Loss Distribution. Management
Science, 57, 1267–1287.
Chinco, A., Clark-Joseph, A. D., & Ye, M. (2019). Sparse Signals in the Cross-section of
Returns. Journal of Finance, 74, 449–492.
Chodorow-Reich, G., Darmouni, O., Luck, S., & Plosser, M. (2022). Bank Liquidity Provision
Across the Firm Size Distribution. Journal of Financial Economics, 144, 908–932.
Cui, W., & Kaas, L. (2021). Default Cycles. Journal of Monetary Economics, 117, 377–394.
Deutsche Bundesbank. (2020). The Use of Artificial Intelligence and Machine Learning in the
Financial Sector.
Dong, J., & Rudin, C. (2020): Exploring the Cloud of Variable Importance for the Set of All
Good Models. Nature: Machine Intelligence, 810–824.
Erel, I., Stern, L. H., Tan, C., & Weisbach, M. S. (2021). Selecting Directors Using Machine
Learning. Review of Financial Studies, 34, 3226–3264.
European Banking Authority (2021). EBA Discussion Paper on Machine Learning for IRB
Models. EBA/DP/2021/04.
Feng, G., Giglio, S., & Xiu, D. (2020). Taming the Factor Zoo: A Test of New Factors. Journal
of Finance, 75, 1327–1370.
Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., & Walther, A. (2021). Predictably
Unequal? The Effects of Machine Learning on Credit Markets. Journal of Finance, 77,
5–47.
Giesecke, K., Longstaff, F. A., Schaefer, S., & Strebulaev, I. (2011). Corporate Bond Default
Risk: A 150-year Perspective. Journal of Financial Economics, 102, 233–250.
Goldstein, I., Spatt, C. S., & Ye, M. (2021). Big Data in Finance. Review of Financial Studies,
34, 3213–3225.
Greenwell, B. (2020). Fastshap: Fast Approximate Shapley Values. Retrieved from
https://cran.r-project.org/package=fastshap
Gruen, B., Leisch, F., Sarkar, D., Mortier, F., & Picard, N. (2020). Flexmix: Flexible Mixture
Modeling. Retrieved from https://cran.r-project.org/package=flexmix.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. Review of
Financial Studies, 33, 2223–2273.
Gürtler, M., & Hibbeln, M. (2013). Improvements in Loss Given Default Forecasts for Bank
Loans. Journal of Banking & Finance, 37, 2354–2366.
Gürtler, M., Hibbeln, M., & Usselmann, P. (2018). Exposure at Default Modeling–A
Theoretical and Empirical Assessment of Estimation Approaches and Parameter Choice.
Journal of Banking & Finance, 91, 176–188.
29

Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The Model Confidence Set. Econometrica,
79, 453–497.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Science & Business Media.
Hibbeln, M., Jentsch, C., Kopp, R. M., & Urban, N. (2023). On the Choice of Validation
Approaches for Forecasting Models. Working Paper.
Hibbeln, M., Norden, L., Usselmann, P., & Gürtler, M. (2020). Informational Synergies in
Consumer Credit. Journal of Financial Intermediation, 44, Article 100831.
Hon, P. S., & Bellotti, T. (2016). Models and Forecasts of Credit Card Balance. European
Journal of Operational Research, 249, 498–505.
Horel, E., & Giesecke, K. (2022). Computationally Efficient Feature Significance and
Importance for Predictive Models. Proceedings of the Third ACM International
Conference on AI in Finance, 300–307.
Jankowitsch, R., Nagler, F., & Subrahmanyam, M. G. (2014). The Determinants of Recovery
Rates in the US Corporate Bond Market. Journal of Financial Economics, 114, 155–177.
Jiménez, G., Lopez, J. A., & Saurina, J. (2009). Empirical Analysis of Corporate Credit Lines.
Review of Financial Studies, 22, 5069–5098.
Kalotay, E. A., & Altman, E. I. (2017). Intertemporal Forecasts of Defaulted Bond Recoveries
and Portfolio Losses. Review of Finance, 21, 433–463.
Kaposty, F., Kriebel, J., & Löderbusch, M. (2020). Predicting Loss Given Default in Leasing:
A Closer Look at Models and Variable Selection. International Journal of Forecasting,
36, 248–266.
Kellner, R., Nagl, M., & Rösch, D. (2022). Opening the Black Box – Quantile Neural Networks
for Loss Given Default Prediction. Journal of Banking & Finance, 134, Article 106334.
Kleinberg, J., Lakkaraju, H., Leskovec, J., Ludwig, J., & Mullainathan, S. (2018). Human
Decisions and Machine Predictions. Quarterly Journal of Economics, 133, 237–293.
Kozak, S., Nagel, S., & Santosh, S. (2020). Shrinking the Cross-section. Journal of Financial
Economics, 135, 271–292.
Krüger, S., & Rösch, D. (2017). Downturn LGD Modeling Using Quantile Regression. Journal
of Banking & Finance, 79, 42–56.
Kuhn, M. (2008). Building Predictive Models in R Using the Caret Package. Journal of
Statistical Software, 28, 1–26.
Kuhn, M., Johnson, K. (2013). Applied Predictive Modeling. Springer.
Leow, M., & Crook, J. (2016). A New Mixture Model for the Estimation of Credit Card
Exposure at Default. European Journal of Operational Research, 249, 487–497.
Leippold, M., Wang, Q., & Zhou, W. (2022). Machine Learning in the Chinese Stock Market.
Journal of Financial Economics, 145, 64–82.
Li, K., Mai, F., Shen, R., & Yan, X. (2021). Measuring Corporate Culture using Machine
Learning. Review of Financial Studies, 34, 3265–3315.
Loterman, G., Brown, I., Martens, D., Mues, C., & Baesens, B. (2012). Benchmarking
Regression Algorithms for Loss Given Default Modeling. International Journal of
Forecasting, 28, 161–170.
Meinshausen, N., & Bühlmann, P. (2010). Stability Selection. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72, 417–473.
30

Min, A., Scherer, M., Schischke, A., & Zagst, R. (2020). Modeling Recovery Rates of Small-
and Medium-Sized Entities in the US. Mathematics, 8, 1856.
Mullainathan, S., & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach.
Journal of Economic Perspectives, 31, 87–106.
Nazemi, A., Baumann, F., & Fabozzi, F. J. (2022). Intertemporal Defaulted Bond Recoveries
Prediction via Machine Learning. European Journal of Operational Research, 297, 1162–
1177.
Nazemi, A., & Fabozzi, F. J. (2018). Macroeconomic Variable Selection for Creditor Recovery
Rates. Journal of Banking & Finance, 89, 14–25.
Olson, L. M., Qi, M., Zhang, X., & Zhao, X. (2021). Machine Learning Loss Given Default
for Corporate Debt. Journal of Empirical Finance, 64, 144–159.
Qi, M., & Zhao, X. (2011). Comparison of Modeling Methods for Loss Given Default. Journal
of Banking & Finance, 35, 2842–2855.
Stasinopoulos, D. M., Rigby, R. A., & Others. (2007). Generalized Additive Models for
Location, Scale and Shape (GAMLSS) in R. Journal of Statistical Software, 23, 1–46.
Steel, M. F. J. (2020). Model Averaging and Its Use in Economics. Journal of Economic
Literature, 58, 644–719.
Sufi, A. (2009). Bank Lines of Credit in Corporate Finance: An Empirical Analysis. Review
of Financial Studies, 22, 1057–1088.
Thackham, M., & Ma, J. (2019). Exposure at Default Without Conversion Factors—Evidence
from Global Credit Data for Large Corporate Revolving Facilities. Journal of the Royal
Statistical Society: Series A (Statistics in Society), 182, 1267–1286.
Tong, E. N., Mues, C., Brown, I., & Thomas, L. C. (2016). Exposure at Default Models With
and Without the Credit Conversion Factor. European Journal of Operational Research,
252, 910–920.
31

Table 1: Summary Statistics of Dependent Variables and Credit Line-specific Features
This table presents summary statistics of the dependent variables (CCF and RR) and all
numeric credit line-specific features. Panel A and B refer to the features used to predict CCF
and RR, respectively. We provide all feature definitions in the Internet Appendix Part D. We
winsorize the features at the 0.5% and 99.5% level, and we floor and cap the CCF at [-5.0, 5.0]
and the RR at [-0.1, 1.1].
Panel A: CCF
Features Mean SD Min Q25 Median Q75 Max
CCF -0.09 1.64 -5.00 0.00 0.00 0.60 5.00
Limit 3,892 13,045 2 76 210 1,400 163,188

Outstanding Amount 2,728 10,182 0 51 142 702 134,257
Undrawn Amount 1,086 5,380 0 0 2 68 75,818
Undrawn Percentage 0.23 0.35 0.00 0.00 0.02 0.35 1.00
Utilization Rate 0.77 0.35 0.00 0.65 0.98 1.00 1.45
Credit Line Age 33.1 34.1 0.0 11.0 21.0 44.0 237.6
No. Credit Lines 1.02 4.30 0 0 0 1 49
No. Collaterals 0.17 0.57 0 0 0 0 5
No. Guarantors 0.06 0.34 0 0 0 0 3
Panel B: RR
Features Mean SD Min Q25 Median Q75 Max
RR 0.56 0.42 -0.10 0.07 0.67 0.99 1.10
Limit 3,356 11,369 1 73 188 1,235 145,357

Outstanding Amount 2,970 10,243 0 69 169 1,006 129,113
Undrawn Amount 373 2,621 0 0 0 2 45,304
Undrawn Percentage 0.10 0.24 0.00 0.00 0.00 0.02 1.00
Utilization Rate 0.91 0.27 0.00 0.98 1.00 1.00 2.17
Credit Line Age 40.8 33.4 0.0 19.0 29.0 52.0 234.5
No. Credit Lines 1.21 5.35 0 0 0 1 65
No. Collaterals 0.21 0.71 0 0 0 0 8
No. Guarantors 0.09 0.37 0 0 0 0 3
32

Table 2: Frequency Distribution of Indicator Features
This table presents the distributions of indicator features for CCF and RR.
Feature % (CCF) % (RR) Feature % (CCF) % (RR)
Asset Class Rating
SME 70.33 69.03 Invest. Grade 3.22 0.78
Large Corporate 29.67 30.97 Non-Invest. Grade 75.25 61.28
Industry Not Rated 6.57 17.98
Services 30.27 29.18 Unknown 14.97 19.95
Commerce 19.71 19.41 Indicators
Manufacturing 19.33 20.16 Limit Increase 24.00 24.12
Construction 9.03 8.79 Util. Rate High 56.98 78.83
FIRE 3.20 3.25 Maturity ≥ 1 yr. 63.02 61.55
Others 18.46 19.21 Seniority 41.67 42.04
Region Guarantee 10.21 10.01
Americas 80.08 76.68 Collateral 64.85 61.95
Non-Americas 19.92 23.32 Committed 61.54 60.47
Currency Operating 20.02 22.57
USD 50.17 47.35 Syndication 5.53 5.57
EUR 12.93 14.89
Others 36.91 37.75
33

Table 3: Out-of-time Predictive Performance
2
This table presents the out-of-time predictive performance based on the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 for CCF (Panel
2
A) and RR (Panel B). All 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values are averages of the models within the respective model
2
types; the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values of each model represent weighted averages with respect to the out-of-
time years as defined in eq. (4). Ensembles are highlighted in bold italic. Δ% quantifies the
performance difference relative to the benchmarking ensemble (BM-EN).
Panel A: CCF
Model Type 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶 Δ% SD Min Median Max
NL-EN 43.5 172.4
NL-EN Composite1 41.7 160.8 1.0 41.1 41.1 42.9
Tree-based2 37.6 135.4 6.7 23.6 41.1 42.9
Non-linear3 15.7 -1.8 3.2 12.9 14.5 20.0
4
Linear 13.4 -16.1 2.2 12.2 12.8 19.6
5
BM-EN Composite 14.5 -9.3 3.4 12.5 12.9 19.6
BM-EN 16.0 0.0
Panel B: RR
Model Type 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶 Δ% SD Min Median Max
NL-EN 19.7 210.3
NL-EN Composite1 17.1 169.9 1.0 16.5 16.6 18.3
Tree-based2 15.9 150.6 1.9 13.1 16.5 18.3
Non-linear3 7.7 21.2 2.4 4.9 7.6 11.9
Linear4 6.4 0.7 0.4 5.8 6.5 7.1
BM-EN Composite5 6.2 -2.6 0.3 5.8 6.1 6.6
BM-EN 6.3 0.0
Notes: 1: Cubist, RF, S-GBM; 2: Bagging, Boosting, CI-Tree, Cubist, RF, Rtree, S-GBM; 3: KNN, MARS, NN, PCA,
PLS, SVM; 4: BEINF, ENet, FMM, FRR, Lasso, LR, LR-BS, LR-FS, LR-SS, Ridge; 5: BEINF, FMM, FRR, LR
34

Table 4: Dissecting Out-of-time Predictability
This table presents the dissection of predictability across asset classes (Panel A), industries
(Panel B), and regions (Panel C) for CCF (left panel) and RR (right panel). The analyses build
on the models based on the expanding window approach and are aggregated into weighted
averages. To produce the asset class-specific metrics, we use the average from the training set
2
for the particular asset class to compute the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 . The same applies to industry and region-
specific dissections. Δ% quantifies the performance difference of the nonlinear ensemble (NL-
EN) relative to the benchmarking ensemble (BM-EN).
CCF RR
Panel A: Asset Class 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Asset Class BM-EN NL-EN Δ% Asset Class
SME SME
12.6 38.3 202.9 9.6 22.9 137.7
(N=9,069) (N=9,696)
Large Corp. Large Corp.
25.8 59.6 131.1 -5.3 8.9 -
(N=3,826) (N=4,350)
Panel B: Industry 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Industry BM-EN NL-EN Δ% Industry
Services Services
22.0 58.0 163.0 9.4 21.8 132.1
(N=3,903) (N=4,099)
Commerce Commerce
12.0 34.1 185.4 13.2 24.2 83.0
(N=2,542) (N=2,726)
Manufacturing Manufacturing
6.5 25.7 292.6 2.3 19.3 723.6
(N=2,493) (N=2,831)
Construction Construction
-416.3 -45.7 - 12.1 24.9 105.5
(N=1,164) (N=1,235)
FIRE FIRE
11.2 30.3 169.6 2.6 16.3 528.5
(N=413) (N=457)
Others Others
9.6 32.6 239.9 -3.4 12.1 -
(N=2,380) (N=2,698)
Panel C: Region 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Region BM-EN NL-EN Δ% Region
Americas Americas
16.6 45.7 174.8 8.3 21.2 155.9
(N=10,326) (N=10,770)
Non-Americas Non-Americas
-4.1 9.0 - -0.8 14.2 -
(N=2,569) (N=3,276)
2
Note: In cases where BM-EN yields negative 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values, we do not report percent changes.
35

Figure 1: Out-of-time Predictive Performance Dispersion
2
This figure shows the dispersion of the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 for CCF (Panel A) and RR (Panel B), with the
2
unweighted 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂,𝑝𝑝 values of the individual test sets serving as input for the boxplots. The bar
(red) and the circle (white) represent the median and the mean, respectively. The benchmark
models and the ML models form the composite of the BM-EN and NL-EN models, respectively,
using equal weights.
Panel A: CCF
70.0
Benchmark Models ML Models
60.0
50.0
40.0
R2OOT
30.0
20.0
10.0
0.0
-10.0
LR FRR FMM BEINF BM-EN Cubist RF S-GBM NL-EN
Panel B: RR
30.0
Benchmark Models ML Models
25.0
20.0
15.0
R2OOT
10.0
5.0
0.0
-5.0
-10.0
LR FRR FMM BEINF BM-EN Cubist RF S-GBM NL-EN
36

Figure 2: Feature Importance
This figure shows the feature importance for the top-15 features for CCF (Panel A) and RR
(Panel B). We measure the importance using Shapley values following Greenwell (2020). The
feature importance of one-hot-encoded variables is collapsed to reflect them as one feature. The
values in both panels are scaled such that they sum to one across all considered features.
Panel A: CCF
0.30
NL-EN Composite Min-Max Composite Median
0.25
0.20
0.15
0.10
0.05
0.00
Composite Models
rank_min 1 2 3 2 4 3 6 5 7 7 8 12 11 12 14
rank_max 1 18 8 5 6 4 10 9 11 11 13 16 14 39 21
Panel B: RR
0.15
NL-EN Composite Min-Max Composite Median
0.10
0.05
0.00
Composite Models
rank_min 1 1 4 2 2 2 5 8 7 10 12 9 11 13 15
rank_max 3 3 6 5 8 6 8 10 28 13 13 21 22 19 17
37

Figure 3: Sensitivity of the Impact of the Most Important Features
This figure shows the sensitivity of the impact of the most important numerical features for
CCF (Panel A) and RR (Panel B). We measure the importance using Shapley values following
Greenwell (2020). To accurately depict the behavior of the Shapley values over the distribution
of the skewed feature values, we first average the Shapley values at the feature level such that
only unique feature values remain. We then average over 20 buckets and scale the feature values
to the range 0-1 for visualization.
Panel A: CCF
limit outstanding_amount
0.75 0.50
Mean SD
0.38 0.31
0.00 0.12
-0.38 -0.06
-0.75 -0.25
Shapley Value
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
undrawn_pct utilization_rate
1.00 0.75
0.38 0.50
-0.25 0.25
-0.88 0.00
-1.50 -0.25
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
Panel B: RR
0.08 0.08
Mean SD
0.03 0.04
-0.01 0.00
-0.06 -0.04
-0.10 -0.08
Shapley Value
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
undrawn_pct utilization_rate
0.10 0.08
0.06 0.01
0.02 -0.05
-0.01 -0.11
-0.05 -0.18
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
38

Internet Appendix
Credit Risk Modeling in the Age of Machine Learning
Internet Appendix A: Feature Selection Techniques, Predictive Techniques, and Post-

Estimation Evaluation Metrics
A.1. Feature Selection Techniques
In this section, we describe the randomized least absolute shrinkage and selection operator
(lasso) procedure used to select macroeconomic features. Technically, the randomized lasso is
comparable to the adaptive lasso introduced by Zou (2006). However, instead of choosing a
tuning parameter in the first stage as in the adaptive lasso (e.g., by using a ridge regression),
the randomized lasso changes the penalty to a randomly chosen value. Thereby, the empirical
implementation is straightforward by rescaling the relevant predictors. Meinshausen and
Bühlmann (2010) showed that this generalization of the lasso selects variables consistently
even if the necessary conditions for the consistency of the original lasso are violated.
The randomized lasso, when applied with only one random perturbation, is not a useful
selection algorithm since the selection is strongly influenced by the single randomization of
the input features. However, applying this randomization many times and identifying features
that are regularly selected provides a powerful selection algorithm with desirable properties. In
our implementation, we use the stability selection method (Meinshausen and Bühlmann 2010);
stability selection is a general technique for improving the performance of a feature selection
algorithm (e.g., randomized lasso) based on the aggregation of the results obtained by applying
a selection procedure to 𝑁𝑁 data subsamples of size 𝑛𝑛/2. A stable set of features is determined
by running any preferred selection algorithm for each subsample. A feature is part of the stable
feature set if the proportion of times for which the feature is selected across all 𝑁𝑁 subsamples
is greater than a pre-defined threshold; we use a threshold of 0.60 in this paper. For
implementing the stability selection method, we use the package stabs in R (Hofner and
Hothorn 2021).
A.2. Predictive Techniques
In this section, we describe the predictive techniques used in our comparative analysis. For
brevity, we only outline the methods that form the benchmark ensemble (i.e., linear regression,
fractional response regression, and mixture models) and the machine learning (ML) methods

Internet Appendix
that form the nonlinear ensemble (i.e., random forest, stochastic gradient boosting machine,
and cubist), which we also discussed in more detail in our main analyses.
Linear Regression: We first consider the linear regression (LR) estimated via ordinary
least squares (OLS). The conditional expectations can be approximated by a linear function
𝑔𝑔(∙) with 𝛽𝛽 a vector of parameters to be estimated. The baseline specification uses a standard
2
least squares objective function where the optimal prediction is: ℒ(𝛽𝛽) = ∑𝑖𝑖 �𝑦𝑦𝑖𝑖 − 𝑔𝑔(𝑥𝑥𝑖𝑖 , 𝛽𝛽)� .
Fractional Response Regression: We use the fractional response regression (FRR) as

another common benchmark. This method uses quasi-likelihood estimation to model a
response variable bounded between zero and one. Analogous to Tong et al. (2016), we
implement a model where 𝑦𝑦 is modeled as follows: 𝐸𝐸[𝑦𝑦|𝑥𝑥] = 𝐹𝐹(𝑥𝑥𝑥𝑥). In our implementation,
we use the logistic functional form 𝐹𝐹(∙). The FRR requires that the response is bounded within
the unit interval. Given that this is not the case for CCF and RR, we transform each of the
responses to lie within the unit interval before each training step. After using the model for
prediction, the predicted values are back transformed to the original scale to evaluate the
predictions on the given test set.
Mixture Models: We use various forms of mixture models as sophisticated benchmarks,

as these models have been proven in the literature to be effective in modeling credit risk (e.g.,
Betz et al. 2018; Min et al. 2020). First, we implement a finite mixture model (FMM) that
closely follows the recent study of Min et al. (2020), but with adaptations related to the
selection of the relevant features. Min et al. (2020) use an a priori restriction of the features.
We use a data-driven approach, namely group lasso (Yuan and Lin 2006), which allows to
specifically select grouped features that arise naturally as we use various categorical predictors
(e.g., industries) in our modeling exercise. Put differently, we can select either no dummy
variables or all dummy variables that correspond to a particular categorical predictor.
Consistent with the observations of Min et al. (2020), we observe trimodality, e.g., for the RR,
and therefore model an FMM with 𝐾𝐾 = 1, … ,6 components as follows: ℎ(𝑦𝑦|𝑥𝑥, 𝜓𝜓) =

Internet Appendix
1
∑𝐾𝐾
𝑘𝑘=1 𝜋𝜋𝑘𝑘 𝑓𝑓(𝑦𝑦|𝑥𝑥, θk ) where 𝜋𝜋𝑘𝑘 , 𝑘𝑘 = 1, … , 𝐾𝐾 are the weights. We estimate an FMM for each of
the different numbers of components and keep only the one with the highest performance in
our final analysis. For the implementation of the FMMs, we use the package flexmix in R
(Gruen et al. 2020). However, there is no ad hoc implementation of the group lasso procedure
in flexmix, thus, we have adapted the corresponding functions where necessary.
In addition, in the context of mixture models, we use the modeling framework

‘Generalized Additive Models for Location, Scale and Shape (GAMLSS)’ based on Rigby and
Stasinopoulos (2005), which represents a semiparametric approach and allows the relationship
between the explanatory features and the response variable to be modeled parametrically or
nonparametrically, e.g., by using smoothing splines. A key feature of this framework is that all
parameters of the underlying distribution (and not only the mean) are modeled. This approach
is particularly well suited in the context of credit risk modeling, where the response is typically
either highly skewed or bimodal, as this framework allows the modeling of distributions with
up to four parameters. Let 𝜃𝜃𝑗𝑗 be the parameters of the distribution of the response variable and
let 𝑔𝑔𝑗𝑗 (∙) be a link function relating 𝜃𝜃𝑗𝑗 to the explanatory features 𝑋𝑋𝑘𝑘 . Following Rigby and
Stasinopoulos (2005), we model the parameters of a given distribution as follows: 𝑔𝑔𝑘𝑘 (𝜃𝜃𝑘𝑘 ) =
𝐽𝐽
𝜂𝜂𝑘𝑘 = 𝑋𝑋𝑘𝑘 𝛽𝛽𝑘𝑘 + ∑𝑗𝑗=1
𝑘𝑘
𝑍𝑍𝑗𝑗𝑗𝑗 𝛾𝛾𝑗𝑗𝑗𝑗 where 𝑋𝑋𝑘𝑘 𝛽𝛽𝑘𝑘 represents the parametric and 𝑍𝑍𝑗𝑗𝑗𝑗 𝛾𝛾𝑗𝑗𝑗𝑗 the nonparametric
terms.
Within the GAMLSS framework, we use a beta-inflated (BEINF) distribution to model the
CCF and the RR. The BEINF distribution has four components: the mean 𝜇𝜇 and the dispersion
𝜎𝜎 of a response greater than zero and less than one, the probability of a response equal to zero
𝑝𝑝0 , and the probability of a response equal to one 𝑝𝑝1. The probability function of the BEINF
distribution is given by Rigby et al. (2019) as follows:
 p0 , if y = 0,

 1
f ( y | µ , σ ,ν ,τ ) = (1 − p0 − p1 ) ⋅ yα −1 (1 − y ) β −1 , if 0 < y < 1, (1)
 B ( a , β )

 p1 , if y = 1,
1
For a more detailed discussion on FMM and the group lasso procedure, we refer the interested reader to Min et
al. (2020) and Yuan and Lin (2006), respectively.

Internet Appendix
where B(a, β ) represents the beta function. Tong et al. (2016) recommend as a future avenue
for research using the BEINF distribution to account for the bimodal distribution of the CCF.
To the best of our knowledge, however, such a model has not been implemented in the context
of credit risk modeling. We do not consider the BEINF model as a typical ML approach but
rather consider it as a method that is particularly well suited for these distributions.
Consequently, it provides an additional sophisticated benchmark for the ML methods. In our
implementation, we use the same transformation as in the FRR, because the BEINF distribution
requires that the response is bounded within the unit interval. For the implementation, we use
the package gamlss in R (Stasinopoulos et al. 2007).
Random Forest (RF): The RF (Breiman 2001) is a tree-based method that improves the
variance reduction of bagging, whereby the essential idea of bagging is to average many noisy
but approximately unbiased models to reduce the overall variance, by reducing the correlation
between individual trees. The general idea is to not consider every possible feature for every
split during the growth process, but only a subset of the features 𝑚𝑚 < 𝑘𝑘. This procedure helps
to reduce the correlation between individual trees in the forest, which significantly reduces the
variance of the predictor. We tune the number of features considered at each split and the tree
depth using our validation approach.
Stochastic Gradient Boosting: Stochastic Gradient Boosting is a boosting technique and

extension of Gradient Boosting. The simplest form of boosting, i.e., adaptive boosting, is a
sequential model building approach that starts with a weak learner, e.g., a decision tree. Based
on the predictions of the first weak learner in the ensemble, another tree is grown that gives
more emphasis (weight) to the observations that were poorly predicted by the previous tree. As
an extension of adaptive boosting, gradient boosting allows the use of an arbitrary loss function.
Given a weak learner and a pre-defined loss function (e.g., square loss or absolute loss), the
algorithm attempts to minimize the loss function. The algorithm starts again with a weak
learner trying to predict the response. In the next step, the gradients (e.g., ordinary residuals for
square error loss functions) are computed, and another weak learner is fitted to the residuals
with the goal of minimizing the loss function (see Hastie et al. (2009) and Kuhn (2013)). To
overcome local minima or plateaus in the case of not well-behaved cost functions, Greenwell
et al. (2020) suggest that instead of using the entire training data at each iteration, only a random

Internet Appendix
sample without replacement should be used. This method is called a Stochastic Gradient
Boosting Machine (S-GBM). We implement an S-GBM as the most sophisticated boosting
method and specifically tune the number of boosting iterations, the maximum tree depth, a
shrinkage parameter, and the minimum terminal node size.
Cubist: Cubist is an extension of the M5 model tree approach. The M5 model tree differs
considerably from the usual decision trees in several dimensions: First, the criterion for the first
split is the reduction of the standard deviation by splitting the entire dataset. After the initial
partitioning, a linear model is trained at each node using all the features chosen as splitting
criterion in the previous steps. Subsequent splits are based on the (reduction of) error rate of
the linear models. Second, at each terminal node, the outcome is predicted using a linear model
(as opposed to a simple average). Third, when predicting a new sample, the observation moves
down along the path of the tree and all predictions of the linear models in that particular path
are smoothed in a bottom-up procedure. Analogous to the M5 model tree, Cubist uses
smoothing of predictions by combining multiple linear models in a tree path. The linear
combination of multiple predictions depends on the variance and covariance of the residuals of
the individual models. Cubist represents a rule-based model where the final model tree is used
to construct the initial rule set. Cubist uses the described smoothing technique based on the
variance and covariance of the residuals, so that each rule is associated with a smoothed
representation of multiple linear models. In a further step, it is possible to prune and/or combine
rules based on the adjusted error rate. Ultimately, a new sample is predicted by averaging the
predictions of all smoothed linear models from the appropriate final rules. Cubist also allows
a process called committees, which is very similar to boosting in that a sequence of rule-based
models is created, with each rule-based model influenced by the previous one. We tune the
number of committees and neighbors via our validation approach.
In our empirical exercise, we compare a wide range of different ML methods, including

subset selections (Forward- (LR-FS), Backward- (LR-BS), Stepwise-Selection (LR-SS)),
penalized regressions (Lasso, Ridge, Elastic Net (ENet)), dimension reduction methods
(principal component regression (PCA), partial least squares (PLS)), nonlinear methods (k-
Nearest Neighbor (KNN), multivariate adaptive regression splines (MARS), neural network
(NN), support vector machine (SVM)), and other tree-based methods (Decision Tree (Rtree),

Internet Appendix
Bagging, Boosting, Conditional Inference Tree (CI-Tree)). For more details on these methods,
we refer the interested reader to Hastie et al. (2009) and Kuhn (2013). For the implementation
of the ML methods, we use the package caret in R (Kuhn 2008).
A.3. Post-Estimation Evaluation Metrics
In this section, we describe further post-estimation evaluation metrics that we employed in our
2
study. Apart from the out-of-time 𝑅𝑅 2 (𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 ), we also implement the root mean square error
(𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅) and the mean absolute error (𝑀𝑀𝑀𝑀𝑀𝑀), which are standard performance measures in the
literature. Generally, also for these metrics the same idea regarding the weighted averages
1 2
applies. The 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 of a given period p is calculated as 𝑅𝑅𝑅𝑅𝑅𝑅𝐸𝐸𝑝𝑝 = �𝑛𝑛 ∑𝑖𝑖�𝑦𝑦𝑖𝑖 − 𝑦𝑦� 𝑖𝑖ℳ � . The
1
𝑀𝑀𝑀𝑀𝑀𝑀 of a given period p is calculated as 𝑀𝑀𝑀𝑀𝐸𝐸𝑝𝑝 = 𝑛𝑛 ∑𝑖𝑖�𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖ℳ �. For both 𝑅𝑅𝑅𝑅𝑅𝑅𝐸𝐸𝑝𝑝 and 𝑀𝑀𝑀𝑀𝐸𝐸𝑝𝑝 ,
smaller values imply better predictive accuracy, with zero being the lower bound.
We complement the standard performance metrics with Hansen et al.’s (2011) model
�1−𝛼𝛼
confidence set (MCS) procedure, which aims to identify a set of ‘superior’ models, 𝛭𝛭 ∗
⊆
𝛭𝛭0 , with confidence level 1 − 𝛼𝛼. The procedure comprises statistical tests that enable the
econometrician to identify a ‘superior’ set of models with a certain probability (i.e., a given
confidence level). Thereby, sequential hypothesis testing on the null hypothesis of equal
predictive ability (EPA) between the competing objects 𝛭𝛭0 is utilized. The MCS procedure is
as follows: (I) Start with 𝛭𝛭0 of dimension m. (II) Test the EPA hypothesis and, if it is accepted,
�1−𝛼𝛼
terminate the algorithm and set 𝛭𝛭 ∗
= 𝑚𝑚; otherwise, the model with the worst performance
is identified. (III) This worst model is removed from the set of potential ‘superior’ models and
the algorithm returns to step (II). The MCS procedure is based on an arbitrary loss function
�1−𝛼𝛼
ℒ(𝑦𝑦, 𝑦𝑦�) and, thus, has various applications. The MCS 𝛭𝛭 ∗
can include a single ‘best’ model
𝑚𝑚∗ = 1, multiple ‘superior’ models 𝑚𝑚∗ < 𝑚𝑚, as well as possibly all models 𝑚𝑚∗ = 𝑚𝑚. We
implement the procedure in line with the performance measure in our main analysis with a
square error loss function. For the implementation of the MCS procedure, we use the package
MCS in R (Bernardi and Catania 2018).

Internet Appendix
Predictive Techniques and Hyperparameters

Method Command Hyperparameters
LR lm None
FRR glm None
BEINF gamlss None
FMM flexmix Number of components: ∈ [1,6]
LR-FS leapForward Number of features: ∈ [1, 𝑘𝑘]
LR-BS leapBackward Number of features: ∈ [1, 𝑘𝑘]
LR-SS leapSeq Number of features: ∈ [1, 𝑘𝑘]
Ridge glmnet 𝛼𝛼 = 0; regularization parameter: 𝜆𝜆 ∈ [10−4 , 10−1 ]
Lasso glmnet 𝛼𝛼 = 1; regularization parameter: 𝜆𝜆 ∈ [10−4 , 10−1 ]
Enet glmnet 𝛼𝛼 = 0.5; regularization parameter: 𝜆𝜆 ∈ [10−4 , 10−1 ]
PLS pls Components: ∈ [1, 𝑘𝑘]
PCA pcr Components: ∈ [1, 𝑘𝑘]
KNN knn Neighbors: ∈ {1,2,3,4,5,6,7,8,9,10,20, … , √𝑛𝑛}
MARS earth Number of terms: ∈ {1,3,5,7,10,20, … ,100}; number of interactions: ∈{1,2}
NN nnet Hidden units: ∈ {1,2,5,10,15,20, … , 𝑘𝑘}; weight decay: ∈ {10−1 , 10−2 , 10−3 , 10−4 , 10−5 , 10−6 }
SVM svmRadial Cost parameter: ∈ [2−5 , 210 ]; 𝜎𝜎 according to Caputo et al. (2002)
rTree rpart Complexity parameter: ∈ [10−4 , 10−1 ]
Bagging treebag Number of trees: ∈ {100,250,500}
Boosting bstTree Boosting iterations: ∈ {500,750,1000}; max. tree depth: ∈ {1,2,4,6,8}; shrinkage parameter: ∈ {0.01,0.1}
RF rf Randomly selected features: ∈ {3,5,10,15,20}; number of trees: ∈ {500,750,1000}
CI-Tree ctree2 Max. tree depth: ∈ {1,2,4,6,8}; 1-p-value threshold: ∈ {0.95,0.99,0.999}
S-GBM gbm Boosting iterations: ∈ {500,750,1000}; max. tree depth: ∈ {1,2,4,6,8}; shrinkage parameter: ∈ {0.01,0.1};
min. terminal node size ∈ {1,3,5,10,15,20}
Cubist cubist Committees: ∈ {1,3,5,10,15, … ,100}; neighbors: ∈ [0,9]

Internet Appendix
Internet Appendix B: Robustness Checks
In this section, we report a battery of robustness checks that we performed to verify that our
results are not affected by certain specifications in our main analysis, and we provide several
extensions. In the following analyses, we use the macroeconomic features selected in the main
analysis when not indicated differently. The results of the robustness checks are presented in
Table C.5 for both parameters (CCF and RR).
B.1. Rolling Window Analysis
Our preferred estimation strategy is the expanding window approach, as the regulatory
guidelines require that banks base their internal estimates on all available data and, in
particular, prescribe “a minimum data observation period that should ideally cover at least one
complete economic cycle but must in any case be no shorter than a period of seven years”
(BCBS 2019, p. 32). However, there is a reasonable concern that observations that are too
distant from the current environment are no longer relevant and, therefore, distort predictive
performance rather than supporting prediction. To address these concerns, we initialize the first
window with all defaults from 2000 to 2009. Thus, the rolling window approach is also in line
with the regulatory guidelines and covers a period of more than seven years. From then on,
however, we gradually shift not only the end of the window but also its beginning on an annual
basis; that is, the training period covers the same number of periods in all re-estimations. The
results are qualitatively and quantitatively similar, so our main conclusions remain unchanged.
The expanding window approach exhibits slightly better accuracy, which supports the idea of
the regulatory guidelines to use all available data, as more distant observations also convey
valuable information.
B.2. Predictive Horizon
In the out-of-time setting of our main analysis, it is implicitly assumed that the models are re-
calibrated annually. This assumption is in line with the regulatory considerations, from, for
example, the European Banking Authority (2016), BCBS (2019), or European Central Bank
(2019), which require the validation of rating systems or the review of internal estimates to be
carried out at least annually and all relevant and available information to be considered.
However, model development is both time-consuming and costly in practice and, therefore, it

Internet Appendix
can be reasonably argued that financial institutions do not undertake a full re-calibration every
year. For this reason, we use the models not only to predict the subsequent year but extend the
predictive horizon to three years; that is, if the model training is based on defaults up to period
p, the test set contains defaults from periods p+1, p+2, and p+3. This predictive horizon is
motivated by two observations: (i) competent authorities are required under Article 101(1) of
the Capital Requirements Directive to review compliance with the requirements for the use of
internal models at least every three years, and (ii) financial institutions are required to use rating
systems, for example, for internal risk management purposes, in line with the minimum
requirements for at least three years prior to applying for the use of the internal model
(European Banking Authority 2016; BCBS 2019). The results are qualitatively and
quantitatively similar and confirm our main conclusions. Minor deteriorations in performance
metrics could be expected, as the predictive horizon is considerably longer.
B.3. Individual Modeling of Asset Classes
In our main analysis, we find that ML methods have superior predictive abilities for the pooled
data of the core corporate exposures—SMEs and LCs. 1 However, the regulatory guidelines
allow banks to distinguish between these two types of exposure (BCBS 2019). SMEs are
defined as exposures in which the consolidated group revenue is less than EUR 50 million.
Therefore, we re-estimate the nonlinear ensemble and the benchmark ensemble for both asset
classes individually and find qualitatively unchanged results; that is, ML methods outperform
the benchmark models by a large margin. Quantitatively, we find comparative results; that is,
2
the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values of the nonlinear ensemble are within a range of less than 2 percentage points
compared to the main analysis where we used pooled data. Thereby, we find slightly stronger
performance metrics for the CCF when we use pooled modeling data, suggesting that the
information conveyed by exposures to SMEs and LCs is in this context to some extent
1
Pooled data in this context means that the training and test data contain information from both asset classes. In
the dissections of our main analysis, we use the same training data but report the post-estimation evaluation
metrics for both asset classes independently. However, in this robustness check, we already split the training data
into the respective asset classes. The same idea applies to the following robustness checks where, for example,
different geographic regions are used.

Internet Appendix
complementary and that the additional information proves useful in model fitting for the
amount drawn down at the time of default (EaD). This is intuitively plausible since, for
example, the SME group also includes credit lines with very large exposures for which the
additional information from the LC group might be useful, and vice versa. In summary,
depending on the specific design of the credit risk modeling exercise, both options—pooled
and individual modeling of asset classes—appear to be comparable and valid, which also
supports the regulatory guidelines that permit, but do not necessarily require, a distinction
between the two exposures.
B.4. Credit Lines from the United States and Europe
Most of the credit risk studies have a homogeneous setting, with internal data obtained from a
specific bank (Tong et al. 2016; Gürtler et al. 2018; Bellotti et al. 2021) or geographic
restrictions to, for example, the United States (Nazemi and Fabozzi 2018; Min et al. 2020) or
Europe (Bellotti et al. 2021). Our approach is unique in that we conduct a global benchmarking
study based on the world’s largest loss database. However, to ensure that our key conclusions
are not influenced by our unique global setting, we re-estimate the nonlinear ensemble and the
benchmark ensemble for credit lines from the United States and Europe individually, with
credit lines from the United States accounting for less than half of all observations for both
credit risk parameters and observations from Europe accounting for less than a quarter of all
observations for both credit risk parameters. We find that our main conclusions remain
qualitatively unchanged. However, a direct quantitative comparison with the results of our
main analysis is flawed because we do not have a true comparison group. In our main analysis,
Americas covers both North and South America, while Non-Americas covers all other regions.
However, for both subsamples and both credit risk parameters, we observe an increase in the
2
𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value of the nonlinear ensemble by at least 105.0%, confirming our main conclusions
even in more homogeneous settings.
B.5. RRs for Different Recovery Horizons
In Section 2.2 of the main paper, we have already addressed the problem of a resolution bias
(i.e., an underestimation of recent losses). In our main analysis, we consider a three-year
recovery horizon, as 80% of the workout processes are completed after three years. However,
10

Internet Appendix
since three years is somewhat arbitrary, we re-estimate our analyses with a two-year and a five-
year recovery horizon. After two years, around half of the cases are resolved, whereas after
five years, approximately 95% of the cases are resolved. Our main conclusions remain
qualitatively unchanged for the different recovery horizons with quantitatively slightly weaker
2
and stronger 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values for the two-year and five-year horizons, respectively.
B.6. Recovery Waiting Period
In our main analysis, we split our observations into a training set and a test set based on the
default date of the credit lines. One concern with this procedure, however, is that from an
operational perspective, banks must follow a specific credit line over a certain period (e.g.,
three years) to obtain the RR for a three-year recovery horizon. This means that, in practice,
banks must ensure that the observations in the training set are at least three years prior to the
start of the test set. In our specific example, we estimate the first expanding window with
training data from 2000 to 2009 and consider a three-year recovery horizon. This means that
our first out-of-time set of defaults would be from 2013, rather than 2010, resulting in a massive
loss of observations. However, to rule out the possibility that the procedure in our main analysis
affects our main conclusions, we re-estimate the nonlinear ensemble and the benchmark
ensemble with a waiting period of three years (i.e., our first out-of-time set is from 2013). Ad
hoc, the results of this re-estimation and our main analysis are not comparable because they
average over different out-of-time periods. To ensure comparability, we use only the results
from the same period as our main analysis. We find that our conclusions remain qualitatively
2
unchanged for this specification. For 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 , we see a slight performance decrease for the
nonlinear ensemble of less than 2 percentage points, but our main specification seems to benefit
2
rather than hurt the benchmark ensemble. In our main analysis, we find that the 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value for
the nonlinear ensemble is higher by a factor of more than 3 compared to the benchmark
ensemble, and in our waiting scenario, it increases to a factor of more than 8.5. Small
deteriorations in performance could be expected, as in our re-estimation, we predict the RR
effectively seven years after the training data, and not just four years after. Consequently, the
introduction of a waiting period for our RR analysis does not affect the overall conclusions of
our study.
11

Internet Appendix
B.7. Specialized Lending and Private Banking
Finally, we offer an extension in terms of the non-core corporate borrowing segments (i.e.,
specialized lending) and private banking (i.e., high net worth individuals). To the best of our
knowledge, this is the first treatment of these two asset classes in the literature with respect to
the credit risk parameters CCF and RR. Although these asset classes are less common, in our
dataset, they still comprise credit lines with a total EaD of EUR 8 bn for specialized lending
and EUR 1 bn for private banking. Specialized lending covers five sub-classes: (1) project
finance, (2) object finance, (3) commodities finance, (4) income-producing real estate, and (5)
high-volatility commercial real estate (BCBS 2019). In our estimation, we simply use the same
approach as for the corporate segment; that is, we use the same credit line-specific features,
selected macroeconomic features, and hyperparameter set and do not address the unique
technicalities of these specific asset classes. Therefore, our estimates represent at most a lower
bound of the predictive abilities of ML methods for these asset classes. For the specialized
2
lending segment, the benchmark ensemble yields a negative 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value for both credit risk
2
parameters, while the nonlinear ensemble yields a positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 value. These observations
confirm our main conclusions regarding the superiority of the nonlinear ensemble and strongly
support the fact that ML methods are much more flexible and universally applicable in common
banking practice.
For private banking, our general modeling approach is similar, but due to the specific
characteristics of the asset class, we exclude the industry features, operating company indicator,
syndicate indicator, and the number of guarantors, all of which vary slightly or not at all, as
they are simply not applicable to this asset class in most cases. This suggests that our simple
approach can lead to even greater performance deteriorations; as state-of-the-art ML methods
can frequently deal with large feature spaces, it is likely that additionally including asset class-
tailored features would further increase the benefit of ML methods. In effect, a more
comparable picture emerges when we compare the nonlinear ensemble with the benchmark
2
ensemble. The nonlinear ensemble again yields positive 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 for both credit risk parameters,
while this is the case for the benchmark ensemble only for the RR. However, even in this case,
the performance increase of the nonlinear ensemble is greater than 24%.
12

Internet Appendix
In summary, despite our simple approach, our core results are applicable outside the core
corporate banking segment, with superior predictive abilities of ML methods across different
segments and credit risk parameters.
13

Internet Appendix
Internet Appendix C: Tables and Figures
Table C.1: Out-of-time Predictive Performance based on the 𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹

This table presents the out-of-time predictive performance based on the 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 for CCF (Panel
A) and RR (Panel B). All 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 values are averages of the models within the respective model
types; the 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 values of each model represent weighted averages with respect to the out-of-
time years as defined in eq. (4) of the main paper. Ensembles are highlighted in bold italic. Δ%
quantifies the performance difference relative to the benchmarking ensemble (BM-EN).
Panel A: CCF
Model Type 𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 Δ% SD Min Median Max
NL-EN 1.363 -18.2
NL-EN Composite1 1.385 -16.9 0.011 1.371 1.391 1.392
Tree-based2 1.430 -14.2 0.074 1.371 1.392 1.586
Non-linear3 1.665 -0.1 0.033 1.621 1.672 1.698
Linear4 1.693 1.5 0.023 1.627 1.699 1.704
BM-EN Composite5 1.681 0.9 0.036 1.627 1.698 1.703
BM-EN 1.667 0.0
Panel B: RR
Model Type 𝑹𝑹𝑹𝑹𝑹𝑹𝑹𝑹 Δ% SD Min Median Max
NL-EN 0.379 -7.4
NL-EN Composite1 0.385 -5.9 0.002 0.383 0.387 0.387
Tree-based2 0.388 -5.3 0.004 0.383 0.387 0.395
Non-linear3 0.407 -0.8 0.005 0.397 0.407 0.413
Linear4 0.410 0.0 0.001 0.408 0.409 0.411
BM-EN Composite5 0.410 0.1 0.001 0.409 0.410 0.411
BM-EN 0.410 0.0
14

Internet Appendix
Table C.2: Out-of-time Predictive Performance based on the 𝑴𝑴𝑴𝑴𝑴𝑴

This table presents the out-of-time predictive performance based on the 𝑀𝑀𝑀𝑀𝑀𝑀 for CCF (Panel
A) and RR (Panel B). All 𝑀𝑀𝑀𝑀𝑀𝑀 values are averages of the models within the respective model
types; the 𝑀𝑀𝑀𝑀𝑀𝑀 values of each model represent weighted averages with respect to the out-of-
time years as defined in eq. (4) of the main paper. Ensembles are highlighted in bold italic. Δ%
quantifies the performance difference relative to the benchmarking ensemble (BM-EN).
Panel A: CCF
Model Type 𝑴𝑴𝑴𝑴𝑴𝑴 Δ% SD Min Median Max
NL-EN 0.733 -29.2
NL-EN Composite1 0.743 -28.2 0.044 0.693 0.768 0.769
Tree-based2 0.779 -24.7 0.069 0.693 0.768 0.920
Non-linear3 1.017 -1.7 0.046 0.961 1.018 1.067
Linear4 1.056 2.0 0.025 0.986 1.063 1.068
BM-EN Composite5 1.043 0.8 0.039 0.986 1.059 1.068
BM-EN 1.035 0.0
Panel B: RR
Model Type 𝑴𝑴𝑴𝑴𝑴𝑴 Δ% SD Min Median Max
NL-EN 0.332 -10.9
NL-EN Composite1 0.334 -10.3 0.017 0.314 0.343 0.344
Tree-based2 0.340 -8.6 0.012 0.314 0.344 0.350
Non-linear3 0.357 -4.1 0.017 0.327 0.358 0.374
Linear4 0.372 0.0 0.002 0.370 0.372 0.376
BM-EN Composite5 0.372 0.0 0.002 0.371 0.372 0.376
BM-EN 0.372 0.0
15

Internet Appendix
Table C.3: Out-of-time Predictive Performance for All Methods

This table presents the out-of-time predictive performance for both CCF (left panel) and RR
(right panel) for all methods. All measures are weighted averages with respect to the out-of-
time years. 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 is the root-mean square error and 𝑀𝑀𝑀𝑀𝑀𝑀 is the mean absolute error. Models
identified in the superior sets of models based on Hansen et al.’s (2011) MCS procedure are
highlighted in bold italic.
CCF RR
Model 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶 RMSE MAE Model 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶 RMSE MAE
NL-EN 43.5 1.363 0.733 NL-EN 19.7 0.379 0.332
Cubist 42.9 1.371 0.693 RF 18.3 0.383 0.344
Bagging 41.3 1.389 0.751 Boosting 17.8 0.384 0.339
RF 41.1 1.391 0.768 S-GBM 16.6 0.387 0.343
S-GBM 41.1 1.392 0.769 Cubist 16.5 0.387 0.314
Boosting 38.5 1.421 0.783 Bagging 14.7 0.391 0.350
rTree 34.9 1.458 0.767 rTree 14.3 0.392 0.347
CI-Tree 23.6 1.586 0.920 CI-Tree 13.1 0.395 0.345
KNN 20.0 1.621 0.968 KNN 11.9 0.397 0.353
BEINF 19.6 1.627 0.986 MARS 8.3 0.405 0.358
MARS 19.3 1.627 1.026 NN 7.7 0.407 0.358
BM-EN 16.0 1.667 1.035 SVM 7.6 0.407 0.327
NN 15.5 1.667 1.010 LR-FS 7.1 0.408 0.370
SVM 13.5 1.677 0.961 LR-BS 6.8 0.409 0.371
LR 13.0 1.697 1.067 ENet 6.6 0.409 0.372
PCA 13.0 1.697 1.067 FMM 6.6 0.409 0.372
FRR 12.9 1.698 1.068 Lasso 6.6 0.409 0.372
PLS 12.9 1.698 1.067 BM-EN 6.3 0.410 0.372
Ridge 12.9 1.698 1.063 Ridge 6.3 0.410 0.373
ENet 12.8 1.699 1.064 BEINF 6.1 0.410 0.376
LR-SS 12.8 1.699 1.066 FRR 6.1 0.410 0.371
Lasso 12.8 1.699 1.063 LR-SS 5.8 0.411 0.374
FMM 12.5 1.703 1.051 LR 5.8 0.411 0.371
LR-BS 12.5 1.702 1.063 PLS 5.8 0.411 0.372
LR-FS 12.2 1.704 1.067 PCA 4.9 0.413 0.374
16

Internet Appendix
Table C.4: Forecast Error Correlations

This table presents forecast error correlations for CCF (Panel A) and RR (Panel B) defined in
eq. (5). Values above 0.95 are highlighted in bold.
Panel A: CCF
FRR BEINF FMM BM-EN RF Cubist S-GBM NL-EN
LR 1.000* 0.953 0.999 0.998 0.868 0.815 0.880 0.864
FRR 0.952 0.999 0.998 0.868 0.815 0.880 0.864
BEINF 0.950 0.970 0.900 0.875 0.912 0.905
FMM 0.997 0.866 0.811 0.878 0.861
BM-EN 0.882 0.833 0.894 0.879
RF 0.960 0.987 0.993
Cubist 0.961 0.983
S-GBM 0.993
Note: * More precisely, the actual forecast error correlation is 0.999995.
Panel B: RR
FRR BEINF FMM BM-EN RF Cubist S-GBM NL-EN
LR 0.996 0.979 0.986 0.995 0.703 0.654 0.762 0.727
FRR 0.982 0.992 0.998 0.711 0.659 0.770 0.734
BEINF 0.978 0.990 0.720 0.644 0.766 0.728
FMM 0.995 0.721 0.670 0.778 0.744
BM-EN 0.718 0.661 0.773 0.737
RF 0.884 0.916 0.962
Cubist 0.882 0.966
S-GBM 0.963
17

Internet Appendix
Table C.5: Out-of-time Predictability for Robustness Checks and Extensions

This table presents the results of our robustness checks and extensions for both CCF (left
panel) and RR (right panel). Δ% quantifies the performance difference of the nonlinear
ensemble (NL-EN) relative to the benchmarking ensemble (BM-EN).
CCF RR
Panel A: Rolling Window 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% BM-EN NL-EN Δ%
14.1 42.4 200.6 6.5 18.7 186.9
Panel B: Predictive Horizon 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Horizon BM-EN NL-EN Δ% Horizon
15.0 43.2 187.8 3 years 5.6 18.7 232.7 3 years
Panel C: Asset Class 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
SME SME
12.6 36.6 190.1 11.7 23.2 97.8
(N=9,069) (N=9,696)
Large Corp. Large Corp.
28.8 58.7 103.6 -2.1 9.2 -
(N=3,826) (N=4,350)
Panel D: Region 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Region BM-EN NL-EN Δ% Region
U.S. U.S.
5.8 23.4 302.2 13.2 27.0 105.0
(N=6,409) (N=6,565)
Europe Europe
-3.0 14.0 - 4.7 12.1 158.3
(N=2,289) (N=2,932)
Panel E: Recovery Horizon 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Horizon
4.5 16.7 274.7 2 years
8.9 21.1 138.2 5 years
Panel F: Waiting Period 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
BM-EN NL-EN Δ% Period
1.7 14.4 750.1 3 years
Panel G: Non-core Corporate Borrowing Segments 𝑹𝑹𝟐𝟐𝑶𝑶𝑶𝑶𝑶𝑶
SL SL
-23.4 4.8 - -8.3 2.3 -
(N=1,447) (N=1,634)
PB PB
-0.2 0.7 - 6.9 8.6 24.2
(N=2,423) (N=2,590)
2
Note: In cases where BM-EN yields negative 𝑅𝑅𝑂𝑂𝑂𝑂𝑂𝑂 values, we do not report percent changes.
18

Internet Appendix
Figure C.1: Actual vs. Predicted Distributions

This figure shows the histogram of actual observed values for CCF (Panel A) and RR (Panel
B) as well as the distribution of the corresponding predicted values.
Panel A: CCF
2.5
CCF NL-EN BM-EN
2.0
1.5
Density
1.0
0.5
0.0
-5.0 -2.5 0.0 2.5 5.0
CCF
Panel B: RR
10.0
RR NL-EN BM-EN
8.0
6.0
Density
4.0
2.0
0.0
-0.1 0.2 0.5 0.8 1.1
RR
19

Internet Appendix
Figure C.2: Group Feature Importance Plots

This figure shows the group feature importance for CCF (Panel A) and RR (Panel B). We
measure the importance using Shapley values following Greenwell (2020). The feature
importance of one-hot-encoded variables is collapsed to reflect them as one feature. The values
in both panels are scaled such that they sum to one across all considered groups. The
composition of the groups is provided in the Internet Appendix Part D.
Panel A: CCF
Credit line-specific
Macros
Securities
Borrower-specific
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70
Panel B: RR
Credit line-specific
Securities
Borrower-specific
Macros
0.00 0.10 0.20 0.30 0.40 0.50
20

Internet Appendix
Figure C.3: Bee-swarm Plots of Most Important Features

This figure shows bee-swarm plots for the top-15 features for CCF (Panel A) and RR (Panel
B). We measure the importance using Shapley values following Greenwell (2020). The feature
importance of one-hot-encoded variables is collapsed to reflect them as one feature.
Panel A: CCF
High
undrawn_pct
utilization_rate
limit
undrawn_amount
maturity
outstanding_amount
gdp_deflator
Feature value
seniority
credit_line_age
committed
currency
asset_class
collateral
utilization_rate_h
wui_gdpweighted
Low
−2.00 −1.00 0.00 1.00 2.00
Shapley Value (sensitivity of impact on model output)
Panel B: RR
High
committed
limit
operating_firm
outstanding_amount
limit_increase
gdp_deflator
undrawn_pct
Feature value
collateral
utilization_rate
numb_cl
credit_line_age
currency
seniority
ted_spread
industry
Low
−0.50 −0.32 −0.15 0.02 0.20
Shapley Value (sensitivity of impact on model output)
21

Internet Appendix
Figure C.4: Sensitivity of the Impact for Multiple Models

This figure shows the sensitivity of the impact of the credit limit and outstanding amount for
CCF (Panel A) and RR (Panel B). We measure the importance using Shapley values following
Greenwell (2020). To accurately depict the behavior of the Shapley values over the distribution
of the skewed feature values, we first average the Shapley values at the feature level such that
only unique feature values remain. We then average over 20 buckets and scale the feature values
to the range 0-1 for visualization.
Panel A: CCF
1.00 0.50
NL-EN RF
Cubist S-GBM
0.50 0.19
Shapley Value
0.00 -0.12
-0.50 -0.44
-1.00 -0.75
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
Panel B: RR
0.15 0.15
NL-EN RF
Cubist S-GBM
0.07 0.09
Shapley Value
0.00 0.04
-0.07 -0.02
-0.15 -0.07
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Feature Value
22

Internet Appendix
Figure C.5: Feature Importance Dynamics Over Time

This figure shows the stability of feature importance over time for CCF (Panel A) and RR (Panel
B), with 1 being the rank with the highest and 15 being the rank with the lowest importance.
We measure the importance using Shapley values following Greenwell (2020). The ranks are
computed based on the sum of the absolute Shapley values for each year.
Panel A: CCF
1
undrawn_pct 1 1 1 1 1 1 1 1 1 1 1.0
utilization_rate 5 2 2 2 2 2 3 4 5 5 3.2
limit 2 3 4 3 5 4 2 2 3 3 3.1
undrawn_amount 3 4 3 4 4 3 4 3 2 2 3.2
maturity 7 7 8 6 3 5 5 5 4 4 5.4
outstanding_amount 4 6 5 8 8 8 7 7 6 6 6.5
gdp_deflator 8 5 7 5 6 6 6 6 7 7 6.3
Rank
seniority 6 8 6 7 7 7 8 11 11 10 8.1
credit_line_age 10 9 10 10 9 9 9 10 10 11 9.7
committed 11 10 12 9 10 10 11 9 9 9 10.0
currency 9 11 9 12 11 11 10 8 8 8 9.7
asset_class 12 13 13 13 14 12 12 12 12 12 12.5
collateral 14 12 11 11 12 13 15 15 15 15 13.3
utilization_rate_h 13 14 15 15 15 15 14 13 13 13 14.0
wui_gdpweighted 15 15 14 14 13 14 13 14 14 14 14.0
15
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Avg.
Panel B: RR
1
committed 1 1 2 1 1 1 2 1 1.2
limit 2 2 1 2 2 3 1 2 1.9
operating_firm 3 3 5 4 7 6 3 3 4.2
outstanding_amount 6 4 4 5 3 2 7 6 4.6
limit_increase 5 5 3 3 4 7 8 8 5.4
gdp_deflator 4 6 6 6 6 5 4 5 5.2
undrawn_pct 7 7 7 7 5 4 6 7 6.2
Rank
collateral 8 8 8 8 9 10 12 10 9.1
utilization_rate 15 15 9 9 8 8 5 4 9.1
numb_cl 9 9 10 10 11 12 9 11 10.1
credit_line_age 11 10 11 11 10 11 13 14 11.4
currency 12 11 12 12 12 14 11 15 12.4
seniority 14 14 15 13 14 9 10 9 12.2
ted_spread 10 12 13 14 13 13 15 13 12.9
industry 13 13 14 15 15 15 14 12 13.9
15
2010 2011 2012 2013 2014 2015 2016 2017 Avg.
23

Internet Appendix
Internet Appendix D: Feature Definitions
Feature Description
Dependent Variables
Recovery Rate (RR) Proportion of EaD that was recovered (in decimal percentage)
Credit Conversion Proportion of currently undrawn amount of committed limit that is expected
Factor (CCF) to be drawn down at the default date (in decimal percentage)
Borrower-specific Features
Americas 1 if country of residence is ‘Americas’, and 0 otherwise
Asset Class Divided into small and medium-sized enterprises (SME) and large
corporates (LC)
Currency Denomination on which credit line is based: U.S. dollar; Euro; and others
Industry Code Divided into services; finance, real estate and insurance (FIRE); commerce;
manufacturing; construction; and others
No. Credit Lines Number of credit lines of the borrower that defaulted before or on the event
date (frequency)
Operating 1 if entity is an operating company (income from sales to third parties), and
0 otherwise
Rating Borrower risk rating divided into investment grade (IG); non-investment
grade (non-IG); no rating (none); and unknown rating
Credit Line-specific Features
Credit Line Age Credit line age between origination and event date (in months)
Limit Limit advised to the obligor or the bank’s accepted share of syndicate (in
thousand EUR)
Limit Increase 1 if limit was increased between origination and event date, and 0 otherwise
Maturity 1 if maturity of credit line is ≥ 1 year, and 0 otherwise
Outstanding Amount Amount of the current principal outstanding plus past due interest of the
credit line (in thousand EUR)
Syndication 1 if credit line is part of a syndication, and 0 otherwise
Undrawn Amount Amount of advised limit not utilized at event date (in thousand EUR)
Undrawn Percentage Percentage of advised limit not utilized at event date (in decimal
percentage)
Utilization Rate Percentage of limit that is drawn at event date (in decimal percentage)
Utilization Rate High 1 if utilization rate is ≥ 0.95, and 0 otherwise
Security-related Features
Collateral 1 if credit line has underlying protection in form of collateral or security,
and 0 otherwise
No. Collaterals Number of collaterals protecting the credit line (frequency)
Committed 1 if there is a contractual obligation for the bank to ‘make the funds’ when
the facility is drawn by the obligor, and 0 otherwise
Guarantee 1 if the credit line has underlying protection in form of a guarantee, a credit
default swap, or support from a key party, and 0 otherwise
No. Guarantors Number of guarantors protecting the credit line (frequency)
Seniority 1 if seniority code is ‘super senior’, and 0 otherwise
24

Internet Appendix
References
Basel Committee on Banking Supervision. (2019). CRE 30, CRE 31, and CRE 36.
Bellotti, A., Brigo, D., Gambetti, P., & Vrins, F. (2021). Forecasting Recovery Rates on Non-
performing Loans with Machine Learning. International Journal of Forecasting, 37, 428–
444.
Bernardi, M., & Catania, L. (2018). The Model Confidence Set Package for R. International
Journal of Computational Economics and Econometrics, 8, 144–158.
Betz, J., Kellner, R., & Rösch, D. (2018). Systematic Effects among Loss Given Defaults and
their Implications on Downturn Estimation. European Journal of Operational Research,
271, 1113–1144.
Breiman, L. (2001). Random Forests. Machine learning, 45, 5–32.
Caputo, B., Sim, K., Furesio, F., & Smola A. (2002). Appearance-based Object Recognition
Using SVMs: Which Kernel Should I Use? Proceedings of NIPS Workshop of Statistical
Methods for Computational Experiments in Visual Processing and Computer Vision.
European Banking Authority (2016). Final Draft Regulatory Technical Standards on the
Specification of the Assessment Methodology for Competent Authorities regarding
Compliance of an Institution with the Requirements to use the IRB Approach in
Accordance with Articles 144(2), 173(3) and 180(3)(b) of Regulation (EU) No 575/2013.
European Central Bank (2019). Instructions for Reporting the Validation Results of Internal
Models: IRB Pillar I Models for Credit Risk.
Greenwell, B. (2020). fastshap: Fast Approximate Shapley Values. Retrieved from
https://cran.r-project.org/package=fastshap
Greenwell, B., Boehmke, B., Cunningham, J., & Developers, G. (2020). gbm: Generalized
Boosted Regression Models. Retrieved from https://github.com/gbm-developers/gbm.
Gruen, B., Leisch, F., Sarkar, D., Mortier, F., & Picard, N. (2020). flexmix: Flexible Mixture
Modeling. Retrieved from https://cran.r-project.org/web/packages/flexmix/
Gürtler, M., Hibbeln, M., & Usselmann, P. (2018). Exposure at Default Modeling–A
Theoretical and Empirical Assessment of Estimation Approaches and Parameter Choice.
Journal of Banking & Finance, 91, 176–188.
Hansen, P. R., Lunde, A., & Nason, J. M. (2011). The Model Confidence Set. Econometrica,
79, 453–497.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. Springer Science & Business Media.
Hofner, B., & Hothorn, T. (2021). Stabs: Stability Selection with Error Control. Retrieved from
https://CRAN.R-project.org/package=stabs.
Kuhn, M. (2008). Building Predictive Models in R Using the Caret Package. Journal of
Statistical Software, 28, 1–26.
Kuhn, M., Johnson, K. (2013). Applied Predictive Modeling. Springer.
Meinshausen, N., & Bühlmann, P. (2010). Stability Selection. Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 72, 417–473.
Min, A., Scherer, M., Schischke, A., & Zagst, R. (2020). Modeling Recovery Rates of Small-
and Medium-Sized Entities in the US. Mathematics, 8, 1856.
Nazemi, A., & Fabozzi, F. J. (2018). Macroeconomic Variable Selection for Creditor Recovery
Rates. Journal of Banking & Finance, 89, 14–25.
25

Internet Appendix
Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized Additive Models for Location, Scale
and Shape. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54,
507–554.
Rigby, R. A., Stasinopoulos, D. M., Heller, G. Z., & De Bastiani, F. (2019). Distributions for
Modeling Location, Scale, and Shape. CRC Press.
Stasinopoulos, D. M., Rigby, R. A., & Others. (2007). Generalized Additive Models for
Location, Scale and Shape (GAMLSS) in R. Journal of Statistical Software, 23, 1–46.
Tong, E. N., Mues, C., Brown, I., & Thomas, L. C. (2016). Exposure at Default Models With
and Without the Credit Conversion Factor. European Journal of Operational Research,
252, 910–920.
Yuan, M., & Lin, Y. (2006). Model Selection and Estimation in Regression with Grouped
Variables. Journal of the Royal Statistical Society: Series B, 68, 49–67.
Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American
Statistical Association, 101, 1418–1429.
26

Credit Risk ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Credit Risk ML

Uploaded by

Copyright:

Available Formats

Credit Risk Modeling in the Age of Machine Learning *

Martin T. Hibbeln §‡, Raphael M. Kopp‡¶, and Noah Urban‡

July 13, 2023

Keywords: risk management, credit risk modeling, machine learning, forecasting

JEL Classification: G17, G21

Electronic copy available at: https://ssrn.com/abstract=3913710

Recent developments in analytical techniques, massive increases in computing power, and

In this study, we leverage these developments by performing a comparative analysis of

Electronic copy available at: https://ssrn.com/abstract=3913710

Electronic copy available at: https://ssrn.com/abstract=3913710

Electronic copy available at: https://ssrn.com/abstract=3913710

Electronic copy available at: https://ssrn.com/abstract=3913710

Electronic copy available at: https://ssrn.com/abstract=3913710

2. Institutional Setting and Modeling Framework

2.1. Institutional Setting

Electronic copy available at: https://ssrn.com/abstract=3913710

Electronic copy available at: https://ssrn.com/abstract=3913710

2.2. Modeling Framework

Electronic copy available at: https://ssrn.com/abstract=3913710

3. Data and Methodology

In this section, we describe our dataset, macroeconomic features, estimation strategy,

3.1. Credit Line and Default Data

Electronic copy available at: https://ssrn.com/abstract=3913710

[Table 1 – about here]

Electronic copy available at: https://ssrn.com/abstract=3913710

[Table 2 – about here]

3.2. Macroeconomic Features

3.3. Estimation Strategy

Electronic copy available at: https://ssrn.com/abstract=3913710

3.4. Predictive Techniques

We compare a battery of competing methods 𝛭𝛭 ∈ (1, … , 𝑚𝑚), including:

- Linear regression (LR) as simple benchmark,

Electronic copy available at: https://ssrn.com/abstract=3913710

3.5. Post-Estimation Evaluation Metrics

Electronic copy available at: https://ssrn.com/abstract=3913710

period p. The out-of-time 𝑅𝑅 2 of a given period p is calculated as:

To further assess possible performance differences between the different methods, we

of two methods 𝛭𝛭 ∈ {1; 2} are defined as follows:

Electronic copy available at: https://ssrn.com/abstract=3913710

4.1. Comparison of Methods

4.1.1. Overall Results

[Table 3 – about here]

Electronic copy available at: https://ssrn.com/abstract=3913710

The substantial outperformance of sophisticated ML methods is striking for several

perspective is particularly important from a practical viewpoint (e.g., for regulatory

Electronic copy available at: https://ssrn.com/abstract=3913710

[Figure 1 – about here]

Electronic copy available at: https://ssrn.com/abstract=3913710

4.1.2. Forecast Error Correlations

Electronic copy available at: https://ssrn.com/abstract=3913710

[Table 4 – about here]

In summary, the application of ML methods in credit risk modeling, particularly based on

Electronic copy available at: https://ssrn.com/abstract=3913710

4.2.1. Feature Importance

Electronic copy available at: https://ssrn.com/abstract=3913710

Electronic copy available at: https://ssrn.com/abstract=3913710

[Figure 2 – about here]

4.2.2. Feature Sensitivity

Electronic copy available at: https://ssrn.com/abstract=3913710

[Figure 3 – about here]

4.2.3. Feature Importance Dynamics

Electronic copy available at: https://ssrn.com/abstract=3913710

5. Robustness Checks and Extensions