Professional Documents
Culture Documents
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
17.5 0.8
Merrill Lynch “big” data, and thus hinder a model’s performance. For
7.5
5.0 1 month default prediction, its performance deteriorates rapidly
0.2 6 months
2.5 24 months when the prediction horizon is extended. Figure 1(c)
0.0 60 months
1 3 6 12 24 60
0.0
0.0 0.2 0.4 0.6 0.8 1.0 demonstrates the challenge faced by the model when it
Prediction horizon Cumulative # companies
comes to default prediction for a long prediction hori-
(a) CDP term structure (b) CAP curve zon (e.g., 60 months). The predicted numbers of default
25 FIM (AR = 0.94) FIM (AR = 0.52)
occurrences on a pool of firms (the red curves in the fig-
Actual
500 Actual
ures) clearly deviate from the actual numbers for the
20 400
long prediction horizon.
Default numbers
Default numbers
ible enough to determine the functional form (i.e., the ture dimension. Duffie et al. (2007) [15] adopt a spot
distribution family) of the default time endogenously; in intensity approach, meaning that they deploy a univer-
other words, in our framework, the distribution family is sal intensity function of some covariates for all firms at
decided based on a large amount of data, which loosens all time points, for which however a high dimension aux-
the exponential assumption and thus better describes iliary model for covariates is needed when considering
companies’ default behavior not only in the short term multiperiod default prediction. Duan et al. (2012) [14]
but also in the long term. propose the FIM to approach the problem with a fam-
In sum, the proposed approach advances the state ily of forward intensity functions, each of which corre-
of the art in default modeling along three dimensions: sponds to a specific forward period. In short, covari-
1. The proposed model successfully uses neural net- ate values are the same, but forward intensity functions
works for multiperiod default prediction and goes are different in each period, thus negating the need to
beyond mere risk classification. build an extremely high-dimension auxiliary model for
covariates. Such an approach establishes a fundamental
2. The model’s inherent design prevents the estimated
decomposability property that permits the parameters
default probabilities from violating term structure
of each forward intensity function to be estimated sep-
consistency, i.e., a monotonically increasing default
arately, and thus to be practically implemented on a
probability with respect to the prediction horizon,
large scale by the CRI [10].
which is vital for default analysis in practice.
Apart from the aforementioned statistical methods,
3. The model greatly improves performance both in machine learning methods have also been applied in de-
terms of accuracy ratios and in matching the actual fault analysis [1, 27]. Most machine learning methods
numbers of default occurrences for long prediction focus on algorithms to identify the complex function
horizons, as compared to the state-of-the-art FIM. underlying the relationship between the covariates and
Furthermore, to the best of our knowledge, our the outcome (e.g., survival or default). Well-known ex-
paper is the first work to frame a class of problems that amples of traditional machine learning methods such
require consistent probability structures along with a as support vector machines (SVM) and random forests
neural network exploration, thus providing vision for (RF) have been applied to solve various problems in fi-
research and applications such a constrain. nance and particularly in default prediction [12, 18, 37].
Most machine learning approaches see default predic-
2 Related Work tion as a naive classification exercise (single period) for
Default analysis should take into account the type of use in risk ranking obligors [20, 30, 2, 38, 39, 17] instead
obligor; for example, corporations, sovereigns, local gov- of a multiperiod prediction.
ernments, supranationals, and individuals differ greatly
in their characteristics. Thus, the attributes that are 3 Problem Definition
available for default analysis differ fundamentally. This 3.1 Multiperiod Default Prediction Measuring
paper focuses on corporate defaults. The mainstream default risk for entities in terms of probability is the
literature on corporate default prediction operates on key to credit risk management. Here we formulate this
the basis of stochastic models and relies on classical problem from the perspective of time to default. We
statistical tools. The first generation of such models first define the prediction horizon as follows.
combines ratio analysis with discriminant analysis and
provides a credit score for each company as a kind of Definition 1. (Prediction Horizon) At time t, the
ranking metric [4, 5, 3], which is however too rough prediction horizon τ defines a forward-looking time
for default analysis. Logit and probit regression ap- period. The default prediction model estimates the
proaches predict a firm’s likelihood of default in the likelihood of a default event occurring within the time
next period [29, 42], but neither addresses multiperiod period (t, t + τ ].
predictions. Poisson intensity models later became the At time t, let random variable Ti be the time
main trend in solving the default prediction problem, to default of company i with support (0, ∞); that
where they specify a stochastic formulation of the point is, Ti = ti denotes that company i defaults at time
process for default [15, 14]. t + ti . Note that t is the starting time point and ti
Among the statistical approaches, Poisson intensity is the prediction horizon defined above. Then, the joint
= Pr(T1 ≤ t1 , T2 ≤ t2 , · · · , Tn ≤ tn ), the parameters for the family. For the former part, we
propose a neural model NF to describe the form of the
where n denotes the number of companies that survive parametric family. For the second part, we consider
at time t. For company i, the default probability for the case in which the parameter set Θ for the decided
prediction horizon ti , i.e., the marginal (cumulative) parametric family F is governed by covariates from each
distribution function, is written as company. Specifically, we assume that at time point
t each company i corresponds to a covariate matrix
FTi (ti ) = lim FT1 ,··· ,Tn (t1 , · · · , tj , · · · , tn ) Xt,i ∈ X, where X denotes a set of matrices Xt,i for all
tj →∞,∀j6=i
companies at all time points. Note that we here use a
= Pr(Ti ≤ ti ).
matrix to include the cases when the covariates observed
Note that when τ → ∞, we have from different time points are used as input features
(see Section 4.1 for more details). We then build the
(3.1) lim FTi (τ ) = 1 connection between different samples (corresponding to
τ →∞
different covariate matrices) and the parameters with
for i = 1, · · · , n. the use of a function
k=1
sequence, and apply δ RNN units as our front-end unit
on the sequence. Specifically, we take a snapshot of the where y[k] denotes the k-th component of y. Note
d-dimensional feature vector observed at time t for each that we add an additional prediction horizon τm+1 →
company i along with a set of time-lagged observations ∞ (corresponding to the additional node y[m + 1]) to
at t − δ + 1, · · · , t − 1 to form the input covariate represent the case that every company will default in
matrix, i.e., Xt,i = [xt−δ+1,i , · · · , xt−1,i , xt,i ] ∈ Rd×δ
the infinite future.
(corresponding to the domain of g(·) in Eq. (3.3)), to We now introduce the objective function in our
serve as the input of the module. With the input matrix model. Before doing so, we present the ground truth
Xt,i , we then model the parameter θt,i , assumed to be function used to describe the default events. In the real-
p-dimensional, as the hidden state of the last RNN unit world scenario, the observations of the default problem
are simple: at a given time point, a company is either
θi,t = ĝ(Xt,i ) = RNNΛ (Xt,i ), alive or bankrupt. We here use the shifted Heaviside
step function to describe the ground truth function:
where Λ is the set of RNN parameters. Note that for no- (
1, if s ≥ ζi
tational simplicity, the notation RNN above stands for (4.5) Ht,i (s) =
any variant model in the class of artificial neural net- 0, if s < ζi ,
works where connections between nodes form a directed where ζ denotes the time to default from time t for
i
graph along a temporal sequence. company i.
To approximate the ground truth function in
4.2 Networks for Parametric Family Determi- Eq. (4.5), we minimize the following objective function
nation This section introduces the details of the neu- with backpropagation [31]:
ral model NF . To learn the form of a parametric fam-
ily F that includes the marginal distribution functions nt m+1
XX X
for time to default of all companies at all time points, L= CrossEntropy(F̂t,i (τk ), Ht,i (τk )),
we select m time points τ1 , τ2 , · · · , τm , to discretize the t∈T i=1 k=1
support of Ti , where τi ∈ R+ and τj < τk if j < k. For where T and nt denote the set of chosen time points and
easy presentation, we omit subscripts t and i for dif- the number of companies that survive at time t in the
ferent time points and companies, respectively, in the training data, respectively. Note that we here restore
following description. subscripts t and i to better depict the above objective
Specifically, inspired by earlier studies [7] that focus function.
on delivering cumulative probabilistic representations
satisfying inequality events in terms of the monotonicity 5 Experiments
described in Eq. (3.1), we propose a neural model with a
5.1 Experimental Settings We conducted experi-
differentiable objective function that satisfies the indis-
ments on a real-world default and bankruptcy dataset
pensable properties in multiperiod default prediction as
provided by CRI, which is publicly available and con-
defined in Definition 2. First, we adopt a hidden layer
tains 1.5 million monthly samples of US public com-
of perceptrons to learn the distribution; that is,
panies over the period from January 1990 to December
2017. For each company in a specific month, there are 14
y = ϕ(Wθ + b),
covariates, in which the 12 covariates—2 common and
10 firm-specific factors—are also used in [14]; the re-
where θ ∈ Rp is the hidden state of the last RNN maining two are related to current assets/current liabil-
unit (i.e., the output of the first step) denoting the ities for non-financial firms.4 The three corresponding
parameters of the family in Eq. (4.1), W ∈ R(m+1)×p event labels are 0 (alive), 1 (default), or 2 (other exit).
denotes the vector of weights, b ∈ R(m+1) is the bias Note that as these labels indicate the status of a com-
vector, and ϕ is the softmax function. Note that the pany in any given month, they can be directly used for
output vector y in Eq. (4.2) is an (m + 1)-dimensional
vector specifically designed for the later cumulative 4 Dataset and detailed definitions of the covariates can be found
default probability calculation. Second, to maintain the in [14]; our dataset also aligns with CRI’s definition up to the end
monotonicity of CDP term structures, we estimate the of 2017 (see the technical report [10] for more details).
ture period, the corresponding cumulative event label is function of FIM from [14] and calculated the CDP, but
set to 1 (or 2, respectively) for all of the prediction hori- used the Adam optimizer [25] and batch normalization
zons afterwards, which is consistent with the idea of our layer [21] for input covariates to reflect the settings of
ground truth function in Eq. (4.5).5 In this paper, as we our neural model. A two-layer multilayer perceptron
attempt to estimate the marginal distribution functions (MLP) with the sigmoid activation function was also
of the time to default (i.e., the default term structures) adopted as the parameter generation unit to demon-
for all companies, we combine labels 0 and 2 and focus strate the effectiveness of the RNN module in capturing
on the problem of interest in a one-vs-all fashion. the temporal dynamics underlying the covariates. For
To evaluate our model, we used two different set- the variants of RNN module, we chose GRU and LSTM
tings when splitting the data into training and testing to evaluate the performance. Thus, the following exper-
sets. The first experimental setting is referred to as the iments included MLP, GRU, and LSTM as our selected
“cross-sectional experiment,” in which we mix 1.5 mil- parameter generation units.
lion monthly samples and separate them into thirteen Note however that with the overparametrized na-
folds randomly. Note that in this setting, the data sam- ture of neural networks comes a poorer generalization
ples from different periods are mixed, which is a com- ability. As the distributions of default events and econo-
monly adopted approach in literature to attest model metric covariates vary over time (i.e., they are non-
capacity and compare model performance. The second stationary processes), it is important to address over-
setting is referred to as the “cross-time experiment,” for fitting in the overparameterized neural model. Thus
which we use a rolling window setting along the time we used dropout [40] in our neural units and weight
axis to conduct the experiments. Note that this setting decay regularization in the Adam optimizer [25, 28].
is a commonly used and practical setting for scenarios The hyperparameters were tuned over the following
involving time effects. Specifically, the dataset was di- sets: the number of MLP/LSTM/GRU hidden units
vided along the time axis into thirteen folds of training in {32, 64, 128}, the learning rate in {10−3 , 10−4 , 10−5 },
and testing sets using a one-year step size; each fold con- the dropout rate in {0.25, 0.5, 0.75}, and the weight de-
tained ten years of monthly samples for training and the cay in {10−4 , 10−5 , 10−6 }.
subsequent year of samples for testing. For example, the
first (second) fold contained firm-month samples during 5.3 Quantitative Evaluation To assess the effec-
the period from years 1990 to 1999 (from years 1991 to tiveness of the overall fit of the proposed model, we
2000, respectively) for training and those in year 2000 deployed two quantitative aspects: the discriminatory
(those in year 2001, respectively) for testing. Note that power of risk ranking among companies and the match-
as our longest prediction horizon was 60 months, for ing ability between actual default occurrences and es-
the last fold, the prediction labels involved the events timated ones. First, to evaluate model performance in
occurring before the end of year 2017. It is worth not- terms of risk ranking, we employed the cumulative ac-
ing that for such a prediction task involving data across curacy profile (CAP) and its associated accuracy ratio
a very long time period, the data distributions are by (AR), both of which examine a model’s performance
nature extremely volatile across different time periods; based on risk ranking among companies’ default proba-
therefore, the purpose of the second experimental set- bilities. The accuracy ratio (AR) is a summarized quan-
ting is to evaluate the model’s ability to react to new, titative measure of the discriminatory power in classi-
incoming data. Note that for both settings, the hyper- fication models based on its CAP curve. Note that a
parameters were selected on the model trained on the good model should provide an accuracy ratio close to
first fold with respect to the accuracy ratio from the cu- one, meaning that most of the companies that default
mulative accuracy profile [11, 34]; they were applied to in reality receive higher model estimated default proba-
all of the models for the remaining folds of data. The bilities. Additionally, there exists a relation between the
reported performance metrics are summarized with the AR and the area under the receiver operating charac-
mean of the results of the thirteen folds. teristic (ROC) curve: AR = 2AUC − 1.6 For the second
aspect, we followed [14] in employing the convolution-
60-month prediction horizons, where the bars depict the MLP (δ = 6) 94.29 92.76 90.47 85.73 76.88 69.73 64.55 60.07
MLP (δ = 12) 93.99 92.64 90.55 86.05 77.67 70.81 65.93 61.45
actual number of defaults and the lines correspond to LSTM (δ = 1) 94.78 93.17 90.87 86.11 77.47 70.69 65.70 61.09
the model estimations.) To measure the distance be- LSTM (δ = 6) 94.63 93.29 91.23 87.05 79.00 72.63 67.55 62.96
LSTM (δ = 12) 94.68 93.48 91.77 87.91 80.79 74.76 69.91 65.32
tween the two distributions of the number of default, we GRU (δ = 1) 94.66 93.03 90.77 85.94 77.21 70.34 65.39 60.79
use the root mean square normalized error (RMSNE) to GRU (δ = 6) 94.41 92.97 90.84 86.54 78.26 71.60 66.45 61.91
GRU (δ = 12) 94.26 92.94 91.12 86.98 79.22 72.77 67.80 63.27
compare our estimation (D̂i ) with the observed (actual)
Improvement (%) 0.22 1.20 3.41 7.93 14.03 17.81 19.85 22.39
number of default occurrences (Di ) at each month i with
Panel B Root mean square normalized error (RMSNE)
respect to the total number of starting time points (T ) FIM 0.74 0.64 0.62 0.84 1.23 1.18 1.06 0.96
across the whole sample period. Specifically, the RM- MLP (δ = 1) 0.63 0.58 0.62 0.88 1.03 1.30 1.24 1.11
MLP (δ = 6) 0.64 0.58 0.61 0.86 1.23 1.32 1.26 1.12
SNE is defined as the error terms in RMSE normalized MLP (δ = 12) 0.63 0.57 0.60 0.83 1.21 1.27 1.17 1.03
by Di as7 LSTM (δ = 1) 0.62 0.60 0.64 0.89 1.26 1.30 1.23 1.11
LSTM (δ = 6) 0.64 0.61 0.62 0.86 1.23 1.25 1.19 1.07
LSTM (δ = 12) 0.64 0.62 0.61 0.81 1.11 1.12 1.03 0.90
v !2
T
u
u1 X D̂i − Di GRU (δ = 1) 0.61 0.61 0.65 0.91 1.25 1.32 1.23 1.11
RMSNE = t . GRU (δ = 6) 0.64 0.63 0.64 0.87 1.24 1.29 1.22 1.11
T i=1 Di GRU (δ = 12) 0.64 0.64 0.64 0.83 1.13 1.18 1.10 0.98
Improvement (%) 17.57 10.94 3.23 3.57 9.76 5.08 2.83 6.25
from the best models (denoted in bold) and FIM. From Panel A Accuracy ratio (AR) (%)
FIM 94.08 91.86 87.74 81.88 74.86 69.20 64.40 59.61
the table we observe that although the ARs of neural
MLP (δ = 1) 93.69 91.76 89.26 84.92 78.06 72.16 67.30 62.63
models are similar to the FIM model in short-term pre- MLP (δ = 6) 93.30 91.52 89.10 85.05 78.44 72.48 67.45 62.72
diction, all the neural models (i.e., MLP,8 LSTM, GRU) MLP (δ = 12) 92.77 91.11 88.78 85.05 78.61 72.81 67.92 63.21
LSTM (δ = 1) 93.67 92.03 89.54 85.45 78.67 72.88 67.89 63.38
surpass FIM in long-term prediction, demonstrating the LSTM (δ = 6) 93.46 91.84 89.41 85.43 78.70 72.87 67.90 63.26
great potential of neural networks applied to the task LSTM (δ = 12) 92.81 91.27 88.96 85.27 78.57 72.79 67.70 62.77
GRU (δ = 1) 93.54 91.87 89.53 85.53 78.63 72.79 68.05 63.49
of multiperiod default prediction. In addition, among GRU (δ = 6) 93.48 91.91 89.51 85.45 78.65 72.83 67.86 63.25
the three neural models, LSTM with δ = 12 yields the GRU (δ = 12) 93.03 91.45 89.26 85.34 78.76 72.89 67.98 63.35
Improvement (%) 0 0.19 2.05 4.46 5.21 5.33 5.67 6.51
best performance. For example, the AR increases from
Panel B Root mean square normalized error (RMSNE)
53.37% to 65.32% for the 60-month prediction horizon;
FIM 1.09 0.77 0.51 0.47 0.40 0.36 0.39 0.39
the improvement in RMSNE from each prediction hori- MLP (δ = 1) 0.83 0.60 0.43 0.44 0.38 0.34 0.35 0.34
zon ranges from 3.23% to 17.57%, which is also com- MLP (δ = 6) 0.73 0.60 0.40 0.40 0.34 0.33 0.35 0.33
MLP (δ = 12) 0.72 0.62 0.40 0.37 0.34 0.31 0.32 0.32
mendable progress for corporate default prediction.
LSTM (δ = 1) 1.00 0.67 0.43 0.40 0.37 0.34 0.35 0.35
The results of the cross-time experiments are listed LSTM (δ = 6) 0.82 0.64 0.41 0.38 0.32 0.32 0.33 0.31
LSTM (δ = 12) 0.97 0.61 0.34 0.33 0.28 0.26 0.27 0.25
in Table 2. From this table, we observe that the re-
GRU (δ = 1) 1.08 0.69 0.41 0.39 0.36 0.34 0.34 0.33
current architectures—LSTMs or GRUs with different GRU (δ = 6) 0.86 0.60 0.39 0.36 0.32 0.32 0.32 0.31
GRU (δ = 12) 1.14 0.60 0.34 0.28 0.26 0.26 0.26 0.26
time-lagged δ-lengths—yield the best performance es-
Improvement (%) 33.95 22.08 33.33 40.43 35.00 27.78 33.33 35.90
pecially for long prediction horizons. More importantly,
the results demonstrate that the recurrent architectures
have a stronger expressive ability both for risk ranking and 35.90%, respectively. Furthermore, the difference
and in terms of matching the aggregate default distribu- between MLP and LSTM/GRU is evidential. Although
tion for new incoming data, especially in long prediction the results of AR fluctuate when δ changes, only LSTM
horizons; for example, for the 60-month default predic- or GRU with larger δ still delivers more significant im-
tion, the improvements on AR and RMSNE are 6.51% provements in terms of RMSNE.
7 Note that while the RMSE focuses more on monthly instances 5.4 Discussion on aggregate default distribu-
with large default numbers (e.g., collective defaults during the tions Figure 2 shows the aggregate default distribu-
2008 financial crisis), RMSNE evaluates the overall performance tions of FIM and models with three types of parameter
by fairly treating those instances with fewer default numbers.
8 Note that here we directly concatenate the input covariates generation units (i.e., MLP, LSTM, and GRU). Here we
for the past 6 and 12 months for a fair comparison with the use the cross-time experimental setting and 48-month
recurrent neural models with δ = 6, 12. prediction horizon as example. Due to how we split the
Default numbers
250 250 treme weather forecasting, customer churn prediction
200 200
or patients’ risk of future re-admissions prediction con-
Downloaded 05/21/22 to 59.153.235.224 . Redistribution subject to SIAM license or copyright; see https://epubs.siam.org/terms-privacy
150 150
100 100 cerning different future periods.
50 50
0 0
2000 2002 2004 2006 2008 2010 2012 2000 2002 2004 2006 2008 2010 2012
Year Year References
400 LSTM ( =1) 400 GRU ( =1)
LSTM ( =6) GRU ( =6)
350 350
LSTM ( =12) GRU ( =12)
300 Actual 300 Actual
Default numbers
Default numbers
250 250 [1] Peter Martey Addo, Dominique Guegan, and Bertrand
200 200 Hassani. Credit risk analysis using machine and deep
150 150
learning models. Risks, 6(2):38, 2018.
100 100
50 50
[2] Hafiz Alakar, Lukumon O. Oyedele, Hakeem Owolabi,
0
2000 2002 2004 2006 2008 2010 2012
0
2000 2002 2004 2006 2008 2010 2012
Vikas Kumar, Saheed Ajayi, Olúgbénga O. Aki-
Year Year
nadé, and Muhammad Bilal. Systematic Review of
Figure 2: Aggregate default distributions Bankruptcy Prediction Models. Expert Systems With
Applications, 94:164–184, 2018.
data in cross-time experiments, the result shown is con-
[3] Edward I. Altman. Financial Ratios, Discriminant
catenated from each testing fold. For example, the de- Analysis and the Prediction of Corporate Bankruptcy.
fault numbers in year 2000 of each subfigure come from Journal of Finance, 23(4):589–609, 1968.
the first testing fold; similarly, the ones in year 2001 [4] William H. Beaver. Financial Ratios as Predictors
come from the second fold. The blue bars in the fig- of Failure. Journal of Accounting Research, 4:71–111,
ure indicate the actual default numbers and the curves 1966.
correspond to the estimation of different models. First, [5] William H. Beaver. Market Prices, Financial Ratios,
it is clear that the FIM estimation departs from real- and the Prediction of Failure. Journal of Accounting
ity for such a long prediction horizon, especially for the Research, 6(2):179–192, 1968.
period around year 2000. It is worth mentioning that [6] Ajay Byanjankar, Markku Heikkilä, and József Mezei.
not only FIM but neural models with small δ yield poor Predicting Credit Risk in Peer-to-Peer Lending: A
Neural Network Approach. In Proceedings of 2015
performance. However, as δ increases, neural models
IEEE Symposium Series on Computational Intelli-
begin to approach the realized default distribution. Fur- gence, pages 719–725, 2015.
thermore, we observe that the two RNN-based units— [7] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and
LSTM and GRU—both fit the actual distribution much Hang Li. Learning to Rank: From Pairwise Approach
better than MLP does when δ grows. These observa- to Listwise Approach. In Proceedings of the 24th
tions suggest that RNN models better capture the dy- International Conference on Machine Learning, pages
namic patterns behind the input features and make for 129–136, 2007.
a better estimator of future uncertainty. [8] Sudheer Chava and Robert A. Jarrow. Bankruptcy
Prediction with Industry Effects. Review of Finance,
6 Conclusion 8(4):537–569, 2004.
[9] Kyunghyun Cho, Bart van Merriënboer, Caglar Gul-
In this paper, we develop a multiperiod default predic- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
tion framework with parametric family learning through Schwenk, and Yoshua Bengio. Learning phrase rep-
deep neural models. The effectiveness of the proposed resentations using RNN encoder–decoder for statisti-
method is attested by experiments on a large-scale real- cal machine translation. In Proceedings of the 2014
world corporate default dataset for a long period, the Conference on Empirical Methods in Natural Language
results of which suggest that incorporating neural net- Processing (EMNLP), pages 1724–1734. Association
works in default prediction yields significantly better for Computational Linguistics, 2014.
performance than the state-of-the-art statistical model. [10] Credit Research Initiative. NUS Credit Research
In addition, we show that applying recurrent neural ar- Initiative Technical Report. Technical report, Credit
Research Initiative, National University of Singapore,
chitectures to capture temporal dynamics within eco-
07 2020.
nomical covariates is promising for multiperiod default [11] Peter Crosbie and Jeffrey Bohn. Modeling Default
prediction. Along with the contributions made in de- Risk. Moody’S KMV White Paper, 2003.
fault analysis, this paper provides vision and approaches [12] Paulius Danenas and Gintautas Garsva. Selection of
for research and applications that require monotonicity support vector machines based classifiers for credit risk
for cumulative probability estimation, which is indis- domain. Expert Systems with Applications, 42(6):3194–
pensable in multiperiod prediction. 3204, 2015.
proach. Journal of Econometrics, 170(1)(1):191–209, [30] Bernardete Ribeiro, Catarina Silva, Ning Chen, Ar-
2012. mando Vieira, and João Carvalho das Neves. En-
[15] Darrell Duffie, Leandro Saita, and Ke Wang. Multi- hanced Default Risk Models with SVM+. Expert Sys-
Period Corporate Default Prediction with Stochas- tems With Applications, 39(11):10140–10152, 2012.
tic Covariates. Journal of Financial Economics, [31] David E. Rumelhart, Geoffrey E. Hinton, and
83(3)(3):635–665, 2007. Ronald J. Williams. Learning Representations by
[16] Darrell Duffie and Kenneth Singleton. Modeling Term Back-propagating Errors. Nature, 323(6088):533–536,
Structure of Defaultable Bonds. Review of Financial 1986.
Studies, 12(3):687–720, 1999. [32] Suproteem K. Sarkar, Kojin Oshiba, Daniel Giebisch,
[17] Eom, Haneul, Jaeseong Kim, and Sangok Choi. Ma- and Yaron Singer. Robust Classification of Financial
chine learning-based corporate default risk prediction Risk. arXiv preprint arXiv:1811.11079, 2018.
model verification and policy recommendation: Fo- [33] Justin Sirignano, Apaar Sadhwani, and Kay Giesecke.
cusing on improvement through stacking ensemble Deep Learning for Mortgage Risk. arXiv preprint
model. Journal of Intelligence and Information Sys- arXiv:1607.02470, 2016.
tems, 26(2):105129, 06 2020. [34] Maria Vassalou and Yuhang Xing. Default Risk in
[18] Silvia Figini, Roberto Savona, and Marika Vezzoli. Equity Returns. Journal of Finance, 59(2):831–868,
Corporate default prediction model averaging: a nor- 2004.
mative linear pooling approach. Intelligent Systems in [35] Xing Yan, Weizhong Zhang, Lin Ma, Wei Liu, and
Accounting, Finance and Management, 23(1-2):6–20, Qi Wu. Parsimonious Quantile Regression of Financial
2016. Asset Tail Dynamics via Sequential Learning. In
[19] Sepp Hochreiter and Jürgen Schmidhuber. Long Short- Advances in Neural Information Processing Systems
Term Memory. Neural Computation, 9(8):1735–1780, 31, pages 1575–1585. 2018.
1997. [36] Zhi Yang, Yusi Zhang, Binghui Guo, Ben Y. Zhao, and
[20] Zan Huang, Hsinchun Chen, Chia-Jung Hsu, Wun- Yafei Dai. DeepCredit: Exploiting User Cickstream for
Hwa Chen, and Soushan Wu. Credit Rating Analysis Loan Risk Prediction In P2P Lending. In Proceedings
with Support Vector Machines and Neural Networks: of the 12th International Conference on Web and Social
A Market Comparative Study. Decision Support Sys- Media, pages 444–453, 2018.
tems, 37(4):543–558, 2004. [37] Ching-Chiang Yeh, Der-Jang Chi, and Yi-Rong Lin.
[21] Sergey Ioffe and Christian Szegedy. Batch Normal- Going-concern prediction using hybrid random forests
ization: Accelerating Deep Network Training by Re- and rough set approach. Information Sciences, 254:98–
ducing Internal Covariate Shift. In Proceedings of the 110, 2014.
32nd International Conference on Machine Learning, [38] Shu-Hao Yeh, Chuan-Ju Wang, and Ming-Feng Tsai.
volume 37, pages 448–456, 2015. Corporate Default Prediction via Deep Learning. 01
[22] Robert Jarrow, David Lando, and Stuart M. Turnbull. 2014.
A markov model for the term structure of credit risk [39] Shu-Hao Yeh, Chuan-Ju Wang, and Ming-Feng Tsai.
spreads. volume 10, pages 481–523, 1997. Deep belief networks for predicting corporate defaults.
[23] Hengjian Jia. Investigation into the Effectiveness of In 2015 24th Wireless and Optical Communication
Long Short Term Memory Networks for Stock Price Conference (WOCC), pages 159–163, 2015.
Prediction. arXiv preprint arXiv:1603.07893, 2016. [40] Wojciech Zaremb, Ilya Sutskeve, and Oriol Vinyals.
[24] Hyeongjun Kim, Hoon Cho, and Doojin Ryu. Corpo- Recurrent Neural Network Regularization. arXiv
rate default predictions using machine learning: Liter- preprint arXiv:1409.2329, 2014.
ature review. Sustainability, 12(16), 2020. [41] Qiang Zhang, Rui Luo, Yaodong Yang, and Yuanyuan
[25] Diederik P. Kingma and Jimmy Ba. Adam: A Liu. Benchmarking Deep Sequential Models on Volatil-
Method for Stochastic Optimization. arXiv preprint ity Predictions for Financial Time Series. arXiv
arXiv:1412.6980, 2014. preprint arXiv:1811.03711, 2018.
[26] David Lando. Credit Risk Modeling, pages 787–798. [42] Me Zmijewski. Methodological Issues Related To the
2009. Estimation of Financial Distress Prediction Models.
[27] Wei-Yang Lin, Ya-Han Hu, and Chih-Fong Tsai. Ma- Journal of Accounting Research, 22:59–82, 1984.
chine learning in financial crisis prediction: a survey.
IEEE Transactions on Systems, Man, and Cybernet-
ics, Part C (Applications and Reviews), 42(4):421–436,
2012.
[28] Ilya Loshchilov and Frank Hutter. Decoupled
Weight Decay Regularization. arXiv preprint