You are on page 1of 9

Multiperiod Corporate Default Prediction

Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

Through Neural Parametric Family Learning


Wei-Lun Luo∗ Yu-Ming Lu∗ Jheng-Hong Yang∗ Jin-Chuan Duan†
Chuan-Ju Wang∗

Abstract gations in accordance with the agreed terms [26]. Natu-


Default analysis plays an essential role in financial mar- rally, the Basel Accords, a global regulation framework
kets because it narrows the information gap between for banks, treats credit risk as one of the three primary
borrowers and lenders. Of late, machine learning-based risks faced by banks. As the financial crisis of 2008 has
methods have found their way to default analysis and demonstrated, a comprehensive understanding of credit
typically view it as a risk classification task by slot- risk and the ability to manage it are vital to our efforts
ting obligors into risk categories. The quality of such to avoid future catastrophic consequences on society.
an approach is assessed by its prediction accuracy in Effective credit risk management entails many com-
risk rankings. Rarely considered but important are is- ponents, and measuring default risk in terms of struc-
sues on the predicted numbers of default occurrences tured probability is key to all. As companies can have
and the term structure of cumulative default probabil- dissimilar short-term and long-term credit risk profiles
ities for which classification tools are by nature silent. attributable to their different debt structures and other
In this paper, we depart from the typical practice of characteristics, a good default prediction model should
risk classification and focus on employing machine learn- provide a term structure of cumulative default prob-
ing to estimate the term structure of cumulative de- abilities (CDP, henceforth) [22]. We here take Fig-
fault probabilities—a structured estimation that con- ure 1(a) as an example to illustrate the concept of the
tains default probabilities from short-term to long-term term structure of CDP, for which it is intuitive and
periods. To this end, we formulate the task as a prob- reasonable to require the CDP of a company at any
lem of parametric family learning via a neural model time point to be an increasing function of the predic-
consisting of two segments: parameter generation and tion horizon. The task of providing CDP as a function
parametric family determination. The proposed neural of prediction horizons is called multiperiod default pre-
approach offers added flexibility in improving long-term diction [15, 14]. Traditional credit ratings provided by
default predictions. Moreover, the carefully designed credit rating agencies that offer loosely defined short-
model successfully maintains vital economic character- term and long-term credit quality assessments are ob-
istics of its predictions. Experiments on a US corporate viously deficient in this regard. Although there is a
default dataset show that our approach achieves mea- long history of academic literature on default analy-
surably better prediction performance in both risk clas- sis in finance, economics, and statistics largely rely-
sification and matching the predicted numbers of default ing on statistical/econometric models [3, 4, 5, 8, 16],
occurrences with the actual ones. most such prior art has failed to address multiperiod
default prediction. For example, quantitative statisti-
1 Introduction cal models such as logit and probit regression [29, 42]
are ill-suited for modeling the CDP term structure as—
Credit risk is inherent in all financial markets. For ex-
counterintuitively—the model may generate a 3-month
ample, a lender takes on credit risk when it initiates or
default probability that is higher than its 6-month de-
extends a trade to an obligor: it is exposed to the bor-
fault probability for a company, which makes the model
rower defaulting on its debt or failing to meet its obli-
not convincing.
On the other hand, in this era of big data, banks
∗ Research Center for Information Technology Innova-
and P2P lending platforms have shown great interest in
tion, Academia Sinica, Taiwan. (awilliea@citi.sinica.edu.tw, exploring the potential of machine learning for credit
d08922008@ntu.edu.tw, justram.ep96@g2.nctu.edu.tw,
analysis [6, 36]. Machine learning methods usually
cjwang@citi.sinica.edu.tw)
† Asian Institute of Digital Finance National University of require rather more relaxed assumptions than statistical
Singapore. (bizdjc@nus.edu.sg) models, making them suitable for almost any kind

Copyright © 2022 by SIAM


316 Unauthorized reproduction of this article is prohibited
1.0
20.0 Federal Home Loan Mortgage strict the model from accommodating the complexity of
Lehman Brother
Cumulative default probability (%)

17.5 0.8
Merrill Lynch “big” data, and thus hinder a model’s performance. For

Cumulative # default companies


15.0
12.5 0.6 example, Figure 1(b) shows that although FIM yields an
10.0
0.4
excellent accuracy ratio (exceeding 90%) for 1-month
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

7.5
5.0 1 month default prediction, its performance deteriorates rapidly
0.2 6 months
2.5 24 months when the prediction horizon is extended. Figure 1(c)
0.0 60 months
1 3 6 12 24 60
0.0
0.0 0.2 0.4 0.6 0.8 1.0 demonstrates the challenge faced by the model when it
Prediction horizon Cumulative # companies
comes to default prediction for a long prediction hori-
(a) CDP term structure (b) CAP curve zon (e.g., 60 months). The predicted numbers of default
25 FIM (AR = 0.94) FIM (AR = 0.52)
occurrences on a pool of firms (the red curves in the fig-
Actual
500 Actual
ures) clearly deviate from the actual numbers for the
20 400
long prediction horizon.
Default numbers
Default numbers

15 300 To take into account the above factors, we formulate


10 200 multiperiod default prediction as a parametric family
5 100
learning problem, for which we propose a neural model
that incorporates two separate units. The first is a pa-
0 0
1991 1995 1999 2003 2007 2011 2015 1992 1996 2000 2004 2008 2012
Year Year rameter generation unit for deciding the relations be-
tween the input features and family parameters, and
(c) Aggregate number of defaults (left:1M, right:60M)
the second is a family mapping unit for generating valid
Figure 1: Empirical results of state-of-the-art FIM CDP term structures. In this paper, the parameter
model evaluated on US corporate dataset (see Sec- generation unit is built on a variant of the recurrent
tion 5.4 for more details): (a) Term structure of cu- neural network (RNN) module to capture the sequen-
mulative default probability (CDP) (b) Cumulative ac- tial dynamics behind input features moving along the
curacy profile (CAP) of different prediction horizons (c) time dimension. In addition, the second family mapping
Aggregate default distribution with prediction horizons: unit is based on a carefully designed network structure
1 month and 60 months. Note that the results of FIM that simultaneously generates default probabilities for
in the figure are provided by CRI. different prediction horizons with consistent CDP term
of problem in the real world. However, thus far, structures. To evaluate the proposed approach, we con-
most machine learning approaches in the literature are duct an empirical analysis on a US corporate default
also, like the previous statistical approaches, failed dataset [14]2 . Our empirical results show that the pro-
to address multiperiod default prediction, which they posed approach achieves significantly better prediction
view default prediction as a naive classification exercise performance both in terms of the accuracy in risk rank-
(single period) for use in risk ranking obligors [38, 39, ing between firms and in matching the model-generated
20, 30, 2, 32, 33, 17]. They do not offer the CDP nor probabilities with actual numbers of default occurrences
consistency in term structure [24], which are however over the sample period, especially for long prediction
indispensable factors for default analysis in practice. horizons.
In prior art, the forward intensity model (abbrevi- To further demonstrate the superiority of the pro-
ated as FIM henceforth) is considered the state-of-the- posed method for long prediction horizons, we detail
art statistical model for multiperiod default prediction. the limitations of traditional methods as follows. FIM
This approach has been demonstrated in [14, 10] and and other traditional default models ground themselves
on the live corporate default prediction platform of the with rigorous statistical assumptions, where short-term
Credit Research Initiative (CRI).1 FIM demonstrates and long-term models share one functional form that
satisfactory performance both in risk ranking accuracy lies within a well-known parametric family. For exam-
and in terms of matching reasonably with the actual ple, FIM uses the exponential (Poisson) distribution to
numbers of default occurrences of large corporate pools model the default time (event, respectively); although
of many economies over a long period. Moreover, the such a distribution yields satisfactory performance in
embedded structural assumptions of the Poisson process short-term prediction, it clearly fails in the long-term
in FIM make it suited for generating consistent CDP case, which may be due to the fact that this distribu-
term structures. However, these assumptions also re- tion is suitable only for modeling the short-term de-
fault behavior of companies. However, it is difficult or
1 CRI, founded in 2009 at the Risk Management Institute of

National University of Singapore, is a non-profit undertaking


offering credit ratings for exchange-listed companies around the 2 One can obtain the dataset by registering an account on the

world. CRI website https://nuscri.org/en/home/.

Copyright © 2022 by SIAM


317 Unauthorized reproduction of this article is prohibited
even infeasible to explicitly find a distribution or an models reflect the fact that default should be modeled
ensemble of several distributions that better fits both by a natural dynamic dependent structure. The Poisson
cases. In contrast to traditional statistical methods, intensity models are thus ideally suited for generating
the proposed neural parametric family learning is flex- self-consistent default predictions along the term struc-
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

ible enough to determine the functional form (i.e., the ture dimension. Duffie et al. (2007) [15] adopt a spot
distribution family) of the default time endogenously; in intensity approach, meaning that they deploy a univer-
other words, in our framework, the distribution family is sal intensity function of some covariates for all firms at
decided based on a large amount of data, which loosens all time points, for which however a high dimension aux-
the exponential assumption and thus better describes iliary model for covariates is needed when considering
companies’ default behavior not only in the short term multiperiod default prediction. Duan et al. (2012) [14]
but also in the long term. propose the FIM to approach the problem with a fam-
In sum, the proposed approach advances the state ily of forward intensity functions, each of which corre-
of the art in default modeling along three dimensions: sponds to a specific forward period. In short, covari-
1. The proposed model successfully uses neural net- ate values are the same, but forward intensity functions
works for multiperiod default prediction and goes are different in each period, thus negating the need to
beyond mere risk classification. build an extremely high-dimension auxiliary model for
covariates. Such an approach establishes a fundamental
2. The model’s inherent design prevents the estimated
decomposability property that permits the parameters
default probabilities from violating term structure
of each forward intensity function to be estimated sep-
consistency, i.e., a monotonically increasing default
arately, and thus to be practically implemented on a
probability with respect to the prediction horizon,
large scale by the CRI [10].
which is vital for default analysis in practice.
Apart from the aforementioned statistical methods,
3. The model greatly improves performance both in machine learning methods have also been applied in de-
terms of accuracy ratios and in matching the actual fault analysis [1, 27]. Most machine learning methods
numbers of default occurrences for long prediction focus on algorithms to identify the complex function
horizons, as compared to the state-of-the-art FIM. underlying the relationship between the covariates and
Furthermore, to the best of our knowledge, our the outcome (e.g., survival or default). Well-known ex-
paper is the first work to frame a class of problems that amples of traditional machine learning methods such
require consistent probability structures along with a as support vector machines (SVM) and random forests
neural network exploration, thus providing vision for (RF) have been applied to solve various problems in fi-
research and applications such a constrain. nance and particularly in default prediction [12, 18, 37].
Most machine learning approaches see default predic-
2 Related Work tion as a naive classification exercise (single period) for
Default analysis should take into account the type of use in risk ranking obligors [20, 30, 2, 38, 39, 17] instead
obligor; for example, corporations, sovereigns, local gov- of a multiperiod prediction.
ernments, supranationals, and individuals differ greatly
in their characteristics. Thus, the attributes that are 3 Problem Definition
available for default analysis differ fundamentally. This 3.1 Multiperiod Default Prediction Measuring
paper focuses on corporate defaults. The mainstream default risk for entities in terms of probability is the
literature on corporate default prediction operates on key to credit risk management. Here we formulate this
the basis of stochastic models and relies on classical problem from the perspective of time to default. We
statistical tools. The first generation of such models first define the prediction horizon as follows.
combines ratio analysis with discriminant analysis and
provides a credit score for each company as a kind of Definition 1. (Prediction Horizon) At time t, the
ranking metric [4, 5, 3], which is however too rough prediction horizon τ defines a forward-looking time
for default analysis. Logit and probit regression ap- period. The default prediction model estimates the
proaches predict a firm’s likelihood of default in the likelihood of a default event occurring within the time
next period [29, 42], but neither addresses multiperiod period (t, t + τ ].
predictions. Poisson intensity models later became the At time t, let random variable Ti be the time
main trend in solving the default prediction problem, to default of company i with support (0, ∞); that
where they specify a stochastic formulation of the point is, Ti = ti denotes that company i defaults at time
process for default [15, 14]. t + ti . Note that t is the starting time point and ti
Among the statistical approaches, Poisson intensity is the prediction horizon defined above. Then, the joint

Copyright © 2022 by SIAM


318 Unauthorized reproduction of this article is prohibited
(cumulative) distribution function of the time to default for i = 1, . . . , n.
for n companies is defined as Illustrating a specific distribution function in the
family FΘ actually involves two aspects: 1) determine
FT1 ,··· ,Tn (t1 , t2 , · · · , tn ) the form of the parametric family, and 2) determine
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

= Pr(T1 ≤ t1 , T2 ≤ t2 , · · · , Tn ≤ tn ), the parameters for the family. For the former part, we
propose a neural model NF to describe the form of the
where n denotes the number of companies that survive parametric family. For the second part, we consider
at time t. For company i, the default probability for the case in which the parameter set Θ for the decided
prediction horizon ti , i.e., the marginal (cumulative) parametric family F is governed by covariates from each
distribution function, is written as company. Specifically, we assume that at time point
t each company i corresponds to a covariate matrix
FTi (ti ) = lim FT1 ,··· ,Tn (t1 , · · · , tj , · · · , tn ) Xt,i ∈ X, where X denotes a set of matrices Xt,i for all
tj →∞,∀j6=i
companies at all time points. Note that we here use a
= Pr(Ti ≤ ti ).
matrix to include the cases when the covariates observed
Note that when τ → ∞, we have from different time points are used as input features
(see Section 4.1 for more details). We then build the
(3.1) lim FTi (τ ) = 1 connection between different samples (corresponding to
τ →∞
different covariate matrices) and the parameters with
for i = 1, · · · , n. the use of a function

Definition 2. (Multiperiod Default Prediction) (3.3) g : X → Θ.


The time-t default term structure for a company i is
defined as its marginal distribution function of time to Above, g(·) can be any function that maps Xt,i to a
default Ft,Ti (ti ) at time t. The task of multiperiod de- θt,i . In this paper, we propose using recurrent neural
fault prediction is to generate the default term structure models for NΘ to approximate g(·) to better capture
for each active firm. the sequential dynamics behind input features moving
along the time dimension. Note that for improved
Note that to be a valid marginal distribution function, presentation, we will describe the neural models NF and
for each company, any time-t default term structure NΘ in reverse order in Section 4.
should be monotonically increasing; that is,
Remark 1. Such a parameterization is further elaborated
∀s, ` such that s < ` one has Ft,Ti (s) ≤ Ft,Ti (`). with the link to the state-of-the-art statistical default predic-
tion model, FIM, as follows. FIM considers default as a rare
This constraint is similar to the problem of quantile event and assumes Eq. (3.2) to be a member of the exponen-
crossing in quantile regression [35]. tial distribution parameterized by the intensity λ, which is
the probability distribution that describes the time between
3.2 Marginal Distribution Parameterization In events in a Poisson point process. Moreover, in FIM, the
this paper, we consider a parametric family of probabil- parameters for company i at time t, λt,i , are governed by
covariates (the same as those used in our model), and the
ity distribution FΘ that includes the marginal distribu-
mapping g(·) in Eq. (3.3) is assumed to be a linear model
tion functions for time to default of all companies at all
with an exponential activation function. In this paper, we
time points. For simplicity, we assume a company can use neural networks to remove the limitations resulting from
default only once;3 we thus have both the exponential distribution and the linear model, which
successfully advances the model fitting with significant per-
(3.2) FΘ : R+ → [0, 1], formance improvements.
where the domain of FΘ , R+ , denotes that default can
4 Network Architecture
happen only in the future from time point t. In other
words, at each time point t, each company i corresponds 4.1 Networks for Parametric Family Parameter
to a parameter set θt,i ∈ Θ and therefore a distribution Generation We first introduce the neural model NΘ ,
function used to approximate g(·) in Eq. (3.3). Similar to many
Fθt,i = Ft,Ti (ti ) : R+ → [0, 1], tasks dealing with financial time-series data [35, 41, 23],
parameter θt,i is likely to depend on past series. To
3 Framing a corporate default as a singular event is a conven- model this characteristic, we here consider recurrent
tional modeling choice in most default analysis literature [11, 13, neural model architectures such as long short-term
14, 15, 16]. memory (LSTM) and gated recurrent unit (GRU) [19, 9]

Copyright © 2022 by SIAM


319 Unauthorized reproduction of this article is prohibited
for parameter generation unit NΘ . Define xt,i as the discretized marginal distribution functions as
d-dimensional covariate vector observed at time t of `
company i. Then, we select a δ-length subsequence X
(4.4) F̂ (τ` ) = y[k], for ` = 1, 2, · · · , m + 1,
xt−1,i , xt−2,i , · · · , xt−δ+1,i to construct a feature vector
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

k=1
sequence, and apply δ RNN units as our front-end unit
on the sequence. Specifically, we take a snapshot of the where y[k] denotes the k-th component of y. Note
d-dimensional feature vector observed at time t for each that we add an additional prediction horizon τm+1 →
company i along with a set of time-lagged observations ∞ (corresponding to the additional node y[m + 1]) to
at t − δ + 1, · · · , t − 1 to form the input covariate represent the case that every company will default in
matrix, i.e., Xt,i = [xt−δ+1,i , · · · , xt−1,i , xt,i ] ∈ Rd×δ
the infinite future.
(corresponding to the domain of g(·) in Eq. (3.3)), to We now introduce the objective function in our
serve as the input of the module. With the input matrix model. Before doing so, we present the ground truth
Xt,i , we then model the parameter θt,i , assumed to be function used to describe the default events. In the real-
p-dimensional, as the hidden state of the last RNN unit world scenario, the observations of the default problem
are simple: at a given time point, a company is either
θi,t = ĝ(Xt,i ) = RNNΛ (Xt,i ), alive or bankrupt. We here use the shifted Heaviside
step function to describe the ground truth function:
where Λ is the set of RNN parameters. Note that for no- (
1, if s ≥ ζi
tational simplicity, the notation RNN above stands for (4.5) Ht,i (s) =
any variant model in the class of artificial neural net- 0, if s < ζi ,
works where connections between nodes form a directed where ζ denotes the time to default from time t for
i
graph along a temporal sequence. company i.
To approximate the ground truth function in
4.2 Networks for Parametric Family Determi- Eq. (4.5), we minimize the following objective function
nation This section introduces the details of the neu- with backpropagation [31]:
ral model NF . To learn the form of a parametric fam-
ily F that includes the marginal distribution functions nt m+1
XX X
for time to default of all companies at all time points, L= CrossEntropy(F̂t,i (τk ), Ht,i (τk )),
we select m time points τ1 , τ2 , · · · , τm , to discretize the t∈T i=1 k=1
support of Ti , where τi ∈ R+ and τj < τk if j < k. For where T and nt denote the set of chosen time points and
easy presentation, we omit subscripts t and i for dif- the number of companies that survive at time t in the
ferent time points and companies, respectively, in the training data, respectively. Note that we here restore
following description. subscripts t and i to better depict the above objective
Specifically, inspired by earlier studies [7] that focus function.
on delivering cumulative probabilistic representations
satisfying inequality events in terms of the monotonicity 5 Experiments
described in Eq. (3.1), we propose a neural model with a
5.1 Experimental Settings We conducted experi-
differentiable objective function that satisfies the indis-
ments on a real-world default and bankruptcy dataset
pensable properties in multiperiod default prediction as
provided by CRI, which is publicly available and con-
defined in Definition 2. First, we adopt a hidden layer
tains 1.5 million monthly samples of US public com-
of perceptrons to learn the distribution; that is,
panies over the period from January 1990 to December
2017. For each company in a specific month, there are 14
y = ϕ(Wθ + b),
covariates, in which the 12 covariates—2 common and
10 firm-specific factors—are also used in [14]; the re-
where θ ∈ Rp is the hidden state of the last RNN maining two are related to current assets/current liabil-
unit (i.e., the output of the first step) denoting the ities for non-financial firms.4 The three corresponding
parameters of the family in Eq. (4.1), W ∈ R(m+1)×p event labels are 0 (alive), 1 (default), or 2 (other exit).
denotes the vector of weights, b ∈ R(m+1) is the bias Note that as these labels indicate the status of a com-
vector, and ϕ is the softmax function. Note that the pany in any given month, they can be directly used for
output vector y in Eq. (4.2) is an (m + 1)-dimensional
vector specifically designed for the later cumulative 4 Dataset and detailed definitions of the covariates can be found
default probability calculation. Second, to maintain the in [14]; our dataset also aligns with CRI’s definition up to the end
monotonicity of CDP term structures, we estimate the of 2017 (see the technical report [10] for more details).

Copyright © 2022 by SIAM


320 Unauthorized reproduction of this article is prohibited
one-month prediction. For prediction horizons exceed- 5.2 Implementation We compared our approach
ing one month, we construct the event labels for each of with the FIM model, the state-of-the-art statistical
the horizons cumulatively; that is, for a given company, model for multiperiod default prediction. In order to
if a default (or other exit) event occurs in a certain fu- ensure a fair comparison, we implemented the likelihood
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

ture period, the corresponding cumulative event label is function of FIM from [14] and calculated the CDP, but
set to 1 (or 2, respectively) for all of the prediction hori- used the Adam optimizer [25] and batch normalization
zons afterwards, which is consistent with the idea of our layer [21] for input covariates to reflect the settings of
ground truth function in Eq. (4.5).5 In this paper, as we our neural model. A two-layer multilayer perceptron
attempt to estimate the marginal distribution functions (MLP) with the sigmoid activation function was also
of the time to default (i.e., the default term structures) adopted as the parameter generation unit to demon-
for all companies, we combine labels 0 and 2 and focus strate the effectiveness of the RNN module in capturing
on the problem of interest in a one-vs-all fashion. the temporal dynamics underlying the covariates. For
To evaluate our model, we used two different set- the variants of RNN module, we chose GRU and LSTM
tings when splitting the data into training and testing to evaluate the performance. Thus, the following exper-
sets. The first experimental setting is referred to as the iments included MLP, GRU, and LSTM as our selected
“cross-sectional experiment,” in which we mix 1.5 mil- parameter generation units.
lion monthly samples and separate them into thirteen Note however that with the overparametrized na-
folds randomly. Note that in this setting, the data sam- ture of neural networks comes a poorer generalization
ples from different periods are mixed, which is a com- ability. As the distributions of default events and econo-
monly adopted approach in literature to attest model metric covariates vary over time (i.e., they are non-
capacity and compare model performance. The second stationary processes), it is important to address over-
setting is referred to as the “cross-time experiment,” for fitting in the overparameterized neural model. Thus
which we use a rolling window setting along the time we used dropout [40] in our neural units and weight
axis to conduct the experiments. Note that this setting decay regularization in the Adam optimizer [25, 28].
is a commonly used and practical setting for scenarios The hyperparameters were tuned over the following
involving time effects. Specifically, the dataset was di- sets: the number of MLP/LSTM/GRU hidden units
vided along the time axis into thirteen folds of training in {32, 64, 128}, the learning rate in {10−3 , 10−4 , 10−5 },
and testing sets using a one-year step size; each fold con- the dropout rate in {0.25, 0.5, 0.75}, and the weight de-
tained ten years of monthly samples for training and the cay in {10−4 , 10−5 , 10−6 }.
subsequent year of samples for testing. For example, the
first (second) fold contained firm-month samples during 5.3 Quantitative Evaluation To assess the effec-
the period from years 1990 to 1999 (from years 1991 to tiveness of the overall fit of the proposed model, we
2000, respectively) for training and those in year 2000 deployed two quantitative aspects: the discriminatory
(those in year 2001, respectively) for testing. Note that power of risk ranking among companies and the match-
as our longest prediction horizon was 60 months, for ing ability between actual default occurrences and es-
the last fold, the prediction labels involved the events timated ones. First, to evaluate model performance in
occurring before the end of year 2017. It is worth not- terms of risk ranking, we employed the cumulative ac-
ing that for such a prediction task involving data across curacy profile (CAP) and its associated accuracy ratio
a very long time period, the data distributions are by (AR), both of which examine a model’s performance
nature extremely volatile across different time periods; based on risk ranking among companies’ default proba-
therefore, the purpose of the second experimental set- bilities. The accuracy ratio (AR) is a summarized quan-
ting is to evaluate the model’s ability to react to new, titative measure of the discriminatory power in classi-
incoming data. Note that for both settings, the hyper- fication models based on its CAP curve. Note that a
parameters were selected on the model trained on the good model should provide an accuracy ratio close to
first fold with respect to the accuracy ratio from the cu- one, meaning that most of the companies that default
mulative accuracy profile [11, 34]; they were applied to in reality receive higher model estimated default proba-
all of the models for the remaining folds of data. The bilities. Additionally, there exists a relation between the
reported performance metrics are summarized with the AR and the area under the receiver operating charac-
mean of the results of the thirteen folds. teristic (ROC) curve: AR = 2AUC − 1.6 For the second
aspect, we followed [14] in employing the convolution-

6 Both CAP and ROC are commonly applied by banks and


5 We set m = 8 and {τ1 , τ2 , . . . , τ8 } = {1,3,6,12,24,36,48,60} regulators to analyze the discriminatory ability of rating systems
for the month unit. that evaluate credit risk [11, 34].

Copyright © 2022 by SIAM


321 Unauthorized reproduction of this article is prohibited
based default aggregation algorithm [13] to estimate the Table 1: Results of cross-sectional experiments
Horizons (months) 1 3 6 12 24 36 48 60
number of defaults from the predicted probability for a
Panel A Accuracy ratio (AR) (%)
given prediction horizon (1, 3, . . . , or 60 months). (Re- FIM 94.57 92.37 88.74 81.45 70.85 63.46 58.33 53.37
call that Figure 1(c) plots the comparison for 1- and MLP (δ = 1) 94.48 92.85 90.43 85.10 75.63 68.08 62.87 58.26
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

60-month prediction horizons, where the bars depict the MLP (δ = 6) 94.29 92.76 90.47 85.73 76.88 69.73 64.55 60.07
MLP (δ = 12) 93.99 92.64 90.55 86.05 77.67 70.81 65.93 61.45
actual number of defaults and the lines correspond to LSTM (δ = 1) 94.78 93.17 90.87 86.11 77.47 70.69 65.70 61.09
the model estimations.) To measure the distance be- LSTM (δ = 6) 94.63 93.29 91.23 87.05 79.00 72.63 67.55 62.96
LSTM (δ = 12) 94.68 93.48 91.77 87.91 80.79 74.76 69.91 65.32
tween the two distributions of the number of default, we GRU (δ = 1) 94.66 93.03 90.77 85.94 77.21 70.34 65.39 60.79
use the root mean square normalized error (RMSNE) to GRU (δ = 6) 94.41 92.97 90.84 86.54 78.26 71.60 66.45 61.91
GRU (δ = 12) 94.26 92.94 91.12 86.98 79.22 72.77 67.80 63.27
compare our estimation (D̂i ) with the observed (actual)
Improvement (%) 0.22 1.20 3.41 7.93 14.03 17.81 19.85 22.39
number of default occurrences (Di ) at each month i with
Panel B Root mean square normalized error (RMSNE)
respect to the total number of starting time points (T ) FIM 0.74 0.64 0.62 0.84 1.23 1.18 1.06 0.96
across the whole sample period. Specifically, the RM- MLP (δ = 1) 0.63 0.58 0.62 0.88 1.03 1.30 1.24 1.11
MLP (δ = 6) 0.64 0.58 0.61 0.86 1.23 1.32 1.26 1.12
SNE is defined as the error terms in RMSE normalized MLP (δ = 12) 0.63 0.57 0.60 0.83 1.21 1.27 1.17 1.03
by Di as7 LSTM (δ = 1) 0.62 0.60 0.64 0.89 1.26 1.30 1.23 1.11
LSTM (δ = 6) 0.64 0.61 0.62 0.86 1.23 1.25 1.19 1.07
LSTM (δ = 12) 0.64 0.62 0.61 0.81 1.11 1.12 1.03 0.90
v !2
T
u
u1 X D̂i − Di GRU (δ = 1) 0.61 0.61 0.65 0.91 1.25 1.32 1.23 1.11
RMSNE = t . GRU (δ = 6) 0.64 0.63 0.64 0.87 1.24 1.29 1.22 1.11
T i=1 Di GRU (δ = 12) 0.64 0.64 0.64 0.83 1.13 1.18 1.10 0.98
Improvement (%) 17.57 10.94 3.23 3.57 9.76 5.08 2.83 6.25

Table 1 shows the quantitative results of the cross-


sectional experiments for each prediction horizon, in Table 2: Results of cross-time experiments
which the improvement is between the metrics obtained Horizons (months) 1 3 6 12 24 36 48 60

from the best models (denoted in bold) and FIM. From Panel A Accuracy ratio (AR) (%)
FIM 94.08 91.86 87.74 81.88 74.86 69.20 64.40 59.61
the table we observe that although the ARs of neural
MLP (δ = 1) 93.69 91.76 89.26 84.92 78.06 72.16 67.30 62.63
models are similar to the FIM model in short-term pre- MLP (δ = 6) 93.30 91.52 89.10 85.05 78.44 72.48 67.45 62.72
diction, all the neural models (i.e., MLP,8 LSTM, GRU) MLP (δ = 12) 92.77 91.11 88.78 85.05 78.61 72.81 67.92 63.21
LSTM (δ = 1) 93.67 92.03 89.54 85.45 78.67 72.88 67.89 63.38
surpass FIM in long-term prediction, demonstrating the LSTM (δ = 6) 93.46 91.84 89.41 85.43 78.70 72.87 67.90 63.26
great potential of neural networks applied to the task LSTM (δ = 12) 92.81 91.27 88.96 85.27 78.57 72.79 67.70 62.77
GRU (δ = 1) 93.54 91.87 89.53 85.53 78.63 72.79 68.05 63.49
of multiperiod default prediction. In addition, among GRU (δ = 6) 93.48 91.91 89.51 85.45 78.65 72.83 67.86 63.25
the three neural models, LSTM with δ = 12 yields the GRU (δ = 12) 93.03 91.45 89.26 85.34 78.76 72.89 67.98 63.35
Improvement (%) 0 0.19 2.05 4.46 5.21 5.33 5.67 6.51
best performance. For example, the AR increases from
Panel B Root mean square normalized error (RMSNE)
53.37% to 65.32% for the 60-month prediction horizon;
FIM 1.09 0.77 0.51 0.47 0.40 0.36 0.39 0.39
the improvement in RMSNE from each prediction hori- MLP (δ = 1) 0.83 0.60 0.43 0.44 0.38 0.34 0.35 0.34
zon ranges from 3.23% to 17.57%, which is also com- MLP (δ = 6) 0.73 0.60 0.40 0.40 0.34 0.33 0.35 0.33
MLP (δ = 12) 0.72 0.62 0.40 0.37 0.34 0.31 0.32 0.32
mendable progress for corporate default prediction.
LSTM (δ = 1) 1.00 0.67 0.43 0.40 0.37 0.34 0.35 0.35
The results of the cross-time experiments are listed LSTM (δ = 6) 0.82 0.64 0.41 0.38 0.32 0.32 0.33 0.31
LSTM (δ = 12) 0.97 0.61 0.34 0.33 0.28 0.26 0.27 0.25
in Table 2. From this table, we observe that the re-
GRU (δ = 1) 1.08 0.69 0.41 0.39 0.36 0.34 0.34 0.33
current architectures—LSTMs or GRUs with different GRU (δ = 6) 0.86 0.60 0.39 0.36 0.32 0.32 0.32 0.31
GRU (δ = 12) 1.14 0.60 0.34 0.28 0.26 0.26 0.26 0.26
time-lagged δ-lengths—yield the best performance es-
Improvement (%) 33.95 22.08 33.33 40.43 35.00 27.78 33.33 35.90
pecially for long prediction horizons. More importantly,
the results demonstrate that the recurrent architectures
have a stronger expressive ability both for risk ranking and 35.90%, respectively. Furthermore, the difference
and in terms of matching the aggregate default distribu- between MLP and LSTM/GRU is evidential. Although
tion for new incoming data, especially in long prediction the results of AR fluctuate when δ changes, only LSTM
horizons; for example, for the 60-month default predic- or GRU with larger δ still delivers more significant im-
tion, the improvements on AR and RMSNE are 6.51% provements in terms of RMSNE.

7 Note that while the RMSE focuses more on monthly instances 5.4 Discussion on aggregate default distribu-
with large default numbers (e.g., collective defaults during the tions Figure 2 shows the aggregate default distribu-
2008 financial crisis), RMSNE evaluates the overall performance tions of FIM and models with three types of parameter
by fairly treating those instances with fewer default numbers.
8 Note that here we directly concatenate the input covariates generation units (i.e., MLP, LSTM, and GRU). Here we
for the past 6 and 12 months for a fair comparison with the use the cross-time experimental setting and 48-month
recurrent neural models with δ = 6, 12. prediction horizon as example. Due to how we split the

Copyright © 2022 by SIAM


322 Unauthorized reproduction of this article is prohibited
400 400
350
FIM
Actual 350
MLP ( =1)
MLP ( =6)
Future work includes revising the proposed model
300 300
MLP ( =12)
Actual
to accommodate problems such as multi-step-ahead ex-
Default numbers

Default numbers
250 250 treme weather forecasting, customer churn prediction
200 200
or patients’ risk of future re-admissions prediction con-
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

150 150
100 100 cerning different future periods.
50 50
0 0
2000 2002 2004 2006 2008 2010 2012 2000 2002 2004 2006 2008 2010 2012
Year Year References
400 LSTM ( =1) 400 GRU ( =1)
LSTM ( =6) GRU ( =6)
350 350
LSTM ( =12) GRU ( =12)
300 Actual 300 Actual
Default numbers

Default numbers

250 250 [1] Peter Martey Addo, Dominique Guegan, and Bertrand
200 200 Hassani. Credit risk analysis using machine and deep
150 150
learning models. Risks, 6(2):38, 2018.
100 100
50 50
[2] Hafiz Alakar, Lukumon O. Oyedele, Hakeem Owolabi,
0
2000 2002 2004 2006 2008 2010 2012
0
2000 2002 2004 2006 2008 2010 2012
Vikas Kumar, Saheed Ajayi, Olúgbénga O. Aki-
Year Year
nadé, and Muhammad Bilal. Systematic Review of
Figure 2: Aggregate default distributions Bankruptcy Prediction Models. Expert Systems With
Applications, 94:164–184, 2018.
data in cross-time experiments, the result shown is con-
[3] Edward I. Altman. Financial Ratios, Discriminant
catenated from each testing fold. For example, the de- Analysis and the Prediction of Corporate Bankruptcy.
fault numbers in year 2000 of each subfigure come from Journal of Finance, 23(4):589–609, 1968.
the first testing fold; similarly, the ones in year 2001 [4] William H. Beaver. Financial Ratios as Predictors
come from the second fold. The blue bars in the fig- of Failure. Journal of Accounting Research, 4:71–111,
ure indicate the actual default numbers and the curves 1966.
correspond to the estimation of different models. First, [5] William H. Beaver. Market Prices, Financial Ratios,
it is clear that the FIM estimation departs from real- and the Prediction of Failure. Journal of Accounting
ity for such a long prediction horizon, especially for the Research, 6(2):179–192, 1968.
period around year 2000. It is worth mentioning that [6] Ajay Byanjankar, Markku Heikkilä, and József Mezei.
not only FIM but neural models with small δ yield poor Predicting Credit Risk in Peer-to-Peer Lending: A
Neural Network Approach. In Proceedings of 2015
performance. However, as δ increases, neural models
IEEE Symposium Series on Computational Intelli-
begin to approach the realized default distribution. Fur- gence, pages 719–725, 2015.
thermore, we observe that the two RNN-based units— [7] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and
LSTM and GRU—both fit the actual distribution much Hang Li. Learning to Rank: From Pairwise Approach
better than MLP does when δ grows. These observa- to Listwise Approach. In Proceedings of the 24th
tions suggest that RNN models better capture the dy- International Conference on Machine Learning, pages
namic patterns behind the input features and make for 129–136, 2007.
a better estimator of future uncertainty. [8] Sudheer Chava and Robert A. Jarrow. Bankruptcy
Prediction with Industry Effects. Review of Finance,
6 Conclusion 8(4):537–569, 2004.
[9] Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
In this paper, we develop a multiperiod default predic- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
tion framework with parametric family learning through Schwenk, and Yoshua Bengio. Learning phrase rep-
deep neural models. The effectiveness of the proposed resentations using RNN encoder–decoder for statisti-
method is attested by experiments on a large-scale real- cal machine translation. In Proceedings of the 2014
world corporate default dataset for a long period, the Conference on Empirical Methods in Natural Language
results of which suggest that incorporating neural net- Processing (EMNLP), pages 1724–1734. Association
works in default prediction yields significantly better for Computational Linguistics, 2014.
performance than the state-of-the-art statistical model. [10] Credit Research Initiative. NUS Credit Research
In addition, we show that applying recurrent neural ar- Initiative Technical Report. Technical report, Credit
Research Initiative, National University of Singapore,
chitectures to capture temporal dynamics within eco-
07 2020.
nomical covariates is promising for multiperiod default [11] Peter Crosbie and Jeffrey Bohn. Modeling Default
prediction. Along with the contributions made in de- Risk. Moody’S KMV White Paper, 2003.
fault analysis, this paper provides vision and approaches [12] Paulius Danenas and Gintautas Garsva. Selection of
for research and applications that require monotonicity support vector machines based classifiers for credit risk
for cumulative probability estimation, which is indis- domain. Expert Systems with Applications, 42(6):3194–
pensable in multiperiod prediction. 3204, 2015.

Copyright © 2022 by SIAM


323 Unauthorized reproduction of this article is prohibited
[13] Jin-Chuan Duan. Clustered Defaults. National Uni- arXiv:1711.05101, 2017.
versity of Singapore Working Paper, 2010. [29] Ja Ohlson. Financial Ratios and the Probabilistic
[14] Jin-Chuan Duan, Jie Sun, and Tao Wang. Multiperiod Prediction of Bankruptcy. Journal of Accounting
Corporate Default PredictionA forward Intensity Ap- Research, 18(1):109–131, 1980.
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy

proach. Journal of Econometrics, 170(1)(1):191–209, [30] Bernardete Ribeiro, Catarina Silva, Ning Chen, Ar-
2012. mando Vieira, and João Carvalho das Neves. En-
[15] Darrell Duffie, Leandro Saita, and Ke Wang. Multi- hanced Default Risk Models with SVM+. Expert Sys-
Period Corporate Default Prediction with Stochas- tems With Applications, 39(11):10140–10152, 2012.
tic Covariates. Journal of Financial Economics, [31] David E. Rumelhart, Geoffrey E. Hinton, and
83(3)(3):635–665, 2007. Ronald J. Williams. Learning Representations by
[16] Darrell Duffie and Kenneth Singleton. Modeling Term Back-propagating Errors. Nature, 323(6088):533–536,
Structure of Defaultable Bonds. Review of Financial 1986.
Studies, 12(3):687–720, 1999. [32] Suproteem K. Sarkar, Kojin Oshiba, Daniel Giebisch,
[17] Eom, Haneul, Jaeseong Kim, and Sangok Choi. Ma- and Yaron Singer. Robust Classification of Financial
chine learning-based corporate default risk prediction Risk. arXiv preprint arXiv:1811.11079, 2018.
model verification and policy recommendation: Fo- [33] Justin Sirignano, Apaar Sadhwani, and Kay Giesecke.
cusing on improvement through stacking ensemble Deep Learning for Mortgage Risk. arXiv preprint
model. Journal of Intelligence and Information Sys- arXiv:1607.02470, 2016.
tems, 26(2):105129, 06 2020. [34] Maria Vassalou and Yuhang Xing. Default Risk in
[18] Silvia Figini, Roberto Savona, and Marika Vezzoli. Equity Returns. Journal of Finance, 59(2):831–868,
Corporate default prediction model averaging: a nor- 2004.
mative linear pooling approach. Intelligent Systems in [35] Xing Yan, Weizhong Zhang, Lin Ma, Wei Liu, and
Accounting, Finance and Management, 23(1-2):6–20, Qi Wu. Parsimonious Quantile Regression of Financial
2016. Asset Tail Dynamics via Sequential Learning. In
[19] Sepp Hochreiter and Jürgen Schmidhuber. Long Short- Advances in Neural Information Processing Systems
Term Memory. Neural Computation, 9(8):1735–1780, 31, pages 1575–1585. 2018.
1997. [36] Zhi Yang, Yusi Zhang, Binghui Guo, Ben Y. Zhao, and
[20] Zan Huang, Hsinchun Chen, Chia-Jung Hsu, Wun- Yafei Dai. DeepCredit: Exploiting User Cickstream for
Hwa Chen, and Soushan Wu. Credit Rating Analysis Loan Risk Prediction In P2P Lending. In Proceedings
with Support Vector Machines and Neural Networks: of the 12th International Conference on Web and Social
A Market Comparative Study. Decision Support Sys- Media, pages 444–453, 2018.
tems, 37(4):543–558, 2004. [37] Ching-Chiang Yeh, Der-Jang Chi, and Yi-Rong Lin.
[21] Sergey Ioffe and Christian Szegedy. Batch Normal- Going-concern prediction using hybrid random forests
ization: Accelerating Deep Network Training by Re- and rough set approach. Information Sciences, 254:98–
ducing Internal Covariate Shift. In Proceedings of the 110, 2014.
32nd International Conference on Machine Learning, [38] Shu-Hao Yeh, Chuan-Ju Wang, and Ming-Feng Tsai.
volume 37, pages 448–456, 2015. Corporate Default Prediction via Deep Learning. 01
[22] Robert Jarrow, David Lando, and Stuart M. Turnbull. 2014.
A markov model for the term structure of credit risk [39] Shu-Hao Yeh, Chuan-Ju Wang, and Ming-Feng Tsai.
spreads. volume 10, pages 481–523, 1997. Deep belief networks for predicting corporate defaults.
[23] Hengjian Jia. Investigation into the Effectiveness of In 2015 24th Wireless and Optical Communication
Long Short Term Memory Networks for Stock Price Conference (WOCC), pages 159–163, 2015.
Prediction. arXiv preprint arXiv:1603.07893, 2016. [40] Wojciech Zaremb, Ilya Sutskeve, and Oriol Vinyals.
[24] Hyeongjun Kim, Hoon Cho, and Doojin Ryu. Corpo- Recurrent Neural Network Regularization. arXiv
rate default predictions using machine learning: Liter- preprint arXiv:1409.2329, 2014.
ature review. Sustainability, 12(16), 2020. [41] Qiang Zhang, Rui Luo, Yaodong Yang, and Yuanyuan
[25] Diederik P. Kingma and Jimmy Ba. Adam: A Liu. Benchmarking Deep Sequential Models on Volatil-
Method for Stochastic Optimization. arXiv preprint ity Predictions for Financial Time Series. arXiv
arXiv:1412.6980, 2014. preprint arXiv:1811.03711, 2018.
[26] David Lando. Credit Risk Modeling, pages 787–798. [42] Me Zmijewski. Methodological Issues Related To the
2009. Estimation of Financial Distress Prediction Models.
[27] Wei-Yang Lin, Ya-Han Hu, and Chih-Fong Tsai. Ma- Journal of Accounting Research, 22:59–82, 1984.
chine learning in financial crisis prediction: a survey.
IEEE Transactions on Systems, Man, and Cybernet-
ics, Part C (Applications and Reviews), 42(4):421–436,
2012.
[28] Ilya Loshchilov and Frank Hutter. Decoupled
Weight Decay Regularization. arXiv preprint

Copyright © 2022 by SIAM


324 Unauthorized reproduction of this article is prohibited

You might also like