Statistical Methods in Credit Risk Modeling

by
Aijun Zhang
A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
(Statistics)
in The University of Michigan
2009
Doctoral Committee:
Professor Vijayan N. Nair, Co-Chair
Agus Sudjianto, Co-Chair, Bank of America
Professor Tailen Hsing
Associate Professor Jionghua Jin
Associate Professor Ji Zhu
c _ Aijun Zhang 2009
All Rights Reserved
To my elementary school, high school and university teachers
ii
ACKNOWLEDGEMENTS
First of all, I would express my gratitude to my advisor Prof. Vijay Nair for guiding
me during the entire PhD research. I appreciate his inspiration, encouragement and
protection through these valuable years at the University of Michigan. I am thankful
to Julian Faraway for his encouragement during the first years of my PhD journey. I
would also like to thank Ji Zhu, Judy Jin and Tailen Hsing for serving on my doctoral
committee and helpful discussions on this thesis and other research works.
I am grateful to Dr. Agus Sudjianto, my co-advisor from Bank of America, for
giving me the opportunity to work with him during the summers of 2006 and 2007
and for offering me a full-time position. I appreciate his guidance, active support
and his many illuminating ideas. I would also like to thank Tony Nobili, Mike Bonn,
Ruilong He, Shelly Ennis, Xuejun Zhou, Arun Pinto, and others I first met in 2006
at the Bank. They all persuaded me to jump into the area of credit risk research; I
did it a year later and finally came up with this thesis within two more years.
I would extend my acknowledgement to Prof. Kai-Tai Fang for his consistent
encouragement ever since I graduated from his class in Hong Kong 5 years ago.
This thesis is impossible without the love and remote support of my parents in
China. To them, I am most indebted.
iii
TABLE OF CONTENTS
DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . iii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER
I. An Introduction to Credit Risk Modeling . . . . . . . . . . . . 1
1.1 Two Worlds of Credit Risk . . . . . . . . . . . . . . . . . . . 2
1.1.1 Credit Spread Puzzle . . . . . . . . . . . . . . . . . 2
1.1.2 Actual Defaults . . . . . . . . . . . . . . . . . . . . 5
1.2 Credit Risk Models . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Structural Approach . . . . . . . . . . . . . . . . . 6
1.2.2 Intensity-based Approach . . . . . . . . . . . . . . . 9
1.3 Survival Models . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Parametrizing Default Intensity . . . . . . . . . . . 12
1.3.2 Incorporating Covariates . . . . . . . . . . . . . . . 13
1.3.3 Correlating Credit Defaults . . . . . . . . . . . . . . 15
1.4 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . 18
II. Adaptive Smoothing Spline . . . . . . . . . . . . . . . . . . . . . 24
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 AdaSS: Adaptive Smoothing Spline . . . . . . . . . . . . . . . 28
2.2.1 Reproducing Kernels . . . . . . . . . . . . . . . . . 30
2.2.2 Local Penalty Adaptation . . . . . . . . . . . . . . . 32
2.3 AdaSS Properties . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 38
iv
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 46
III. Vintage Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2 MEV Decomposition Framework . . . . . . . . . . . . . . . . 51
3.3 Gaussian Process Models . . . . . . . . . . . . . . . . . . . . 58
3.3.1 Covariance Kernels . . . . . . . . . . . . . . . . . . 58
3.3.2 Kriging, Spline and Kernel Methods . . . . . . . . . 60
3.3.3 MEV Backfitting Algorithm . . . . . . . . . . . . . 65
3.4 Semiparametric Regression . . . . . . . . . . . . . . . . . . . 67
3.4.1 Single Segment . . . . . . . . . . . . . . . . . . . . 68
3.4.2 Multiple Segments . . . . . . . . . . . . . . . . . . . 69
3.5 Applications in Credit Risk Modeling . . . . . . . . . . . . . 73
3.5.1 Simulation Study . . . . . . . . . . . . . . . . . . . 74
3.5.2 Corporate Default Rates . . . . . . . . . . . . . . . 78
3.5.3 Retail Loan Loss Rates . . . . . . . . . . . . . . . . 80
3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.7 Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 86
IV. Dual-time Survival Analysis . . . . . . . . . . . . . . . . . . . . . 88
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . 93
4.2.1 Empirical Hazards . . . . . . . . . . . . . . . . . . . 94
4.2.2 DtBreslow Estimator . . . . . . . . . . . . . . . . . 95
4.2.3 MEV Decomposition . . . . . . . . . . . . . . . . . 97
4.3 Structural Models . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3.1 First-passage-time Parameterization . . . . . . . . . 104
4.3.2 Incorporation of Covariate Effects . . . . . . . . . . 109
4.4 Dual-time Cox Regression . . . . . . . . . . . . . . . . . . . . 111
4.4.1 Dual-time Cox Models . . . . . . . . . . . . . . . . 112
4.4.2 Partial Likelihood Estimation . . . . . . . . . . . . 114
4.4.3 Frailty-type Vintage Effects . . . . . . . . . . . . . . 116
4.5 Applications in Retail Credit Risk Modeling . . . . . . . . . . 119
4.5.1 Credit Card Portfolios . . . . . . . . . . . . . . . . 119
4.5.2 Mortgage Competing Risks . . . . . . . . . . . . . . 124
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.7 Supplementary Materials . . . . . . . . . . . . . . . . . . . . 130
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
v
LIST OF FIGURES
Figure
1.1 Moody’s-rated corporate default rates, bond spreads and NBER-
dated recessions. Data sources: a) Moody’s Baa & Aaa corpo-
rate bond yields (http://research.stlouisfed.org/fred2/categories/119);
b) Moody’s Special Comment on Corporate Default and Recovery
Rates, 1920-2008 (http://www.moodys.com/); c) NBER-dated reces-
sions (http://www.nber.org/cycles/). . . . . . . . . . . . . . . . . . . 4
1.2 Simulated drifted Wiener process, first-passage-time and hazard rate. 8
1.3 Illustration of retail credit portfolios and vintage diagram. . . . . . 19
1.4 Moody’s speculative-grade default rates for annual cohorts 1970-2008:
projection views in lifetime, calendar and vintage origination time. . 21
1.5 A road map of thesis developments of statistical methods in credit
risk modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1 Heterogeneous smooth functions: (1) Doppler function simulated with
noise, (2) HeaviSine function simulated with noise, (3) Motorcycle-
Accident experimental data, and (4) Credit-Risk tweaked sample. . . 26
2.2 OrdSS function estimate and its 2nd-order derivative (upon stan-
dardization): scaled signal from f(t) = sin(ωt) (upper panel) and
f(t) = sin(ωt
4
) (lower panel), ω = 10π, n = 100 and snr = 7. The
sin(ωt
4
) signal resembles the Doppler function in Figure 2.1; both
have time-varying frequency. . . . . . . . . . . . . . . . . . . . . . . 34
2.3 OrdSS curve fitting with m = 2 (shown by solid lines). The dashed
lines represent 95% confidence intervals. In the credit-risk case, the
log loss rates are considered as the responses, and the time-dependent
weights are specified proportional to the number of replicates. . . . 39
vi
2.4 Simulation study of Doppler and HeaviSine functions: OrdSS (blue),
AdaSS (red) and the heterogeneous truth (light background). . . . . 40
2.5 Non-stationary OrdSS and AdaSS for Motorcycle-Accident Data. . . 42
2.6 OrdSS, non-stationary OrdSS and AdaSS: performance in comparison. 42
2.7 AdaSS estimate of maturation curve for Credit-Risk sample: piecewise-
constant ρ
−1
(t) (upper panel) and piecewise-linear ρ
−1
(t) (lower panel). 44
3.1 Vintage diagram upon truncation and exemplified prototypes. . . . 50
3.2 Synthetic vintage data analysis: (top) underlying true marginal ef-
fects; (2nd) simulation with noise; (3nd) MEV Backfitting algorithm
upon convergence; (bottom) Estimation compared to the underlying
truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.3 Synthetic vintage data analysis: (top) projection views of fitted val-
ues; (bottom) data smoothing on vintage diagram. . . . . . . . . . . 77
3.4 Results of MEV Backfitting algorithm upon convergence (right panel)
based on the squared exponential kernels. Shown in the left panel is
the GCV selection of smoothing and structural parameters. . . . . . 79
3.5 MEV fitted values: Moody’s-rated Corporate Default Rates . . . . . 79
3.6 Vintage data analysis of retail loan loss rates: (top) projection views
of emprical loss rates in lifetime m, calendar t and vintage origination
time v; (bottom) MEV decomposition effects
ˆ
f(m), ˆ g(t) and
ˆ
h(v) (at
log scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1 Dual-time-to-default: (left) Lexis diagram of sub-sampled simula-
tions; (right) empirical hazard rates in either lifetime or calendar time. 90
4.2 DtBreslow estimator vs one-way nonparametric estimator using the
data of (top) both pre-2005 and post-2005 vintages; (middle) pre-
2005 vintages only; (bottom) post-2005 vintages only. . . . . . . . . 98
4.3 MEV modeling of empirical hazard rates, based on the dual-time-to-
default data of (top) both pre-2005 and post-2005 vintages; (middle)
pre-2005 vintages only; (bottom) post-2005 vintages only. . . . . . . 101
4.4 Nonparametric analysis of credit card risk: (top) one-way empirical
hazards, (middle) DtBrewlow estimation, (bottom) MEV decompo-
sition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
vii
4.5 MEV decomposition of credit card risk for low, medium and high
FICO buckets: Segment 1 (top panel); Segment 2 (bottom panel). . 123
4.6 Home prices and unemployment rate of California state: 2000-2008. 125
4.7 MEV decomposition of mortgage hazards: default vs. prepayment . 125
4.8 Split time-varying covariates into little time segments, illustrated. . 133
viii
LIST OF TABLES
Table
1.1 Moody’s speculative-grade default rates. Data source: Moody’s spe-
cial comment (release: February 2009) on corporate default and re-
covery rates, 1920-2008 (http://www.moodys.com/) and author’s cal-
culations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1 AdaSS parameters in Doppler and HeaviSine simulation study. . . . . 41
3.1 Synthetic vintage data analysis: MEV modeling exercise with GCV-
selected structural and smoothing parameters. . . . . . . . . . . . . 77
4.1 Illustrative data format of pooled credit card loans . . . . . . . . . . 120
4.2 Loan-level covariates considered in mortgage credit risk modeling,
where NoteRate could be dynamic and others are static upon origi-
nation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.3 Maximum partial likelihood estimation of mortgage covariate effects
in dual-time Cox regression models. . . . . . . . . . . . . . . . . . . 128
ix
ABSTRACT
This research deals with some statistical modeling problems that are motivated
by credit risk analysis. Credit risk modeling has been the subject of considerable
research interest in finance and has recently drawn the attention of statistical re-
searchers. In the first chapter, we provide an up-to-date review of credit risk models
and demonstrate their close connection to survival analysis.
The first statistical problem considered is the development of adaptive smooth-
ing spline (AdaSS) for heterogeneously smooth function estimation. Two challenging
issues that arise in this context are evaluation of reproducing kernel and determi-
nation of local penalty, for which we derive an explicit solution based on piecewise
type of local adaptation. Our nonparametric AdaSS technique is capable of fitting a
diverse set of ‘smooth’ functions including possible jumps, and it plays a key role in
subsequent work in the thesis.
The second topic is the development of dual-time analytics for observations in-
volving both lifetime and calendar timescale. It includes “vintage data analysis”
(VDA) for continuous type of responses in the third chapter, and “dual-time survival
analysis” (DtSA) in the fourth chapter. We propose a maturation-exogenous-vintage
(MEV) decomposition strategy in order to understand the risk determinants in terms
of self-maturation in lifetime, exogenous influence by macroeconomic conditions, and
heterogeneity induced from vintage originations. The intrinsic identification problem
is discussed for both VDA and DtSA. Specifically, we consider VDA under Gaus-
sian process models, provide an efficient MEV backfitting algorithm and assess its
performance with both simulation and real examples.
x
DtSA on Lexis diagram is of particular importance in credit risk modeling where
the default events could be triggered by both endogenous and exogenous hazards.
We consider nonparametric estimators, first-passage-time parameterization and semi-
parametric Cox regression. These developments extend the family of models for both
credit risk modeling and survival analysis. We demonstrate the application of DtSA
to credit card and mortgage risk analysis in retail banking, and shed some light on
understanding the ongoing credit crisis.
xi
CHAPTER I
An Introduction to Credit Risk Modeling
Credit risk is a critical area in banking and is of concern to a variety of stakehold-
ers: institutions, consumers and regulators. It has been the subject of considerable
research interest in banking and finance communities, and has recently drawn the
attention of statistical researchers. By Wikipedia’s definition,
“Credit risk is the risk of loss due to a debtor’s non-payment of a loan
or other line of credit.” (Wikipedia.org, as of March 2009)
Central to credit risk is the default event, which occurs if the debtor is unable to
meet its legal obligation according to the debt contract. The examples of default
event include the bond default, the corporate bankruptcy, the credit card charge-
off, and the mortgage foreclosure. Other forms of credit risk include the repayment
delinquency in retail loans, the loss severity upon the default event, as well as the
unexpected change of credit rating.
An enormous literature in credit risk has been fostered by both academics in
finance and practitioners in industry. There are two parallel worlds based upon a
simple dichotomous rule of data availability: (a) the direct measurements of credit
performance and (b) the prices observed from credit market. The data availability
leads to two streams of credit risk modeling that have key distinctions.
1
1.1 Two Worlds of Credit Risk
The two worlds of credit risk can be simply characterized by the types of default
probability, one being actual and the other being implied. The former corresponds
to the direct observations of defaults, also known as the physical default probability
in finance. The latter refers to the risk-neutral default probability implied from the
credit market data, e.g. corporate bond yields.
The academic literature of corporate credit risk has been inclined to the study of
the implied defaults, which is yet a puzzling world.
1.1.1 Credit Spread Puzzle
The credit spread of a defaultable corporate bond is its excess yield over the
default-free Treasury bond of the same time to maturity. Consider a zero-coupon
corporate bond with unit face value and maturity date T. The yield-to-maturity at
t < T is defined by
Y (t, T) = −
log B(t, T)
T −t
(1.1)
where B(t, T) is the bond price of the form
B(t, T) = E
_
exp
_

_
T
t
(r(s) + λ(s))ds
__
, (1.2)
given the independent term structures of the interest rate r() and the default rate
λ(). Setting λ(t) ≡ 0 gives the benchmark price B
0
(t, T) and the yield Y
0
(t, T) of
Treasury bond. Then, the credit spread can be calculated as
Spread(t, T) = Y (t, T) −Y
0
(t, T) = −
log(B(t, T)/B
0
(t, T))
T −t
= −
log(1 −q(t, T))
T −t
(1.3)
where q(t, T) = E
_
e

R
T
t
λ(t)ds
¸
is the conditional default probability P[τ ≤ T[τ > t] and
τ denotes the time-to-default, to be detailed in the next section.
2
The credit spread is supposed to co-move with the default rate. For illustration,
Figure 1.1 (top panel) plots the Moody’s-rated corporate default rates and Baa-
Aaa bond spreads ranging from 1920-2008. The shaded backgrounds correspond to
NBER’s latest announcement of recession dates. Most spikes of the default rates and
the credit spreads coincide with the recessions, but it is clear that the movements of
two time series differ in both level and change. Such a lack of match is the so-called
credit spread puzzle in the latest literature of corporate finance; the actual default
rates could be successfully implied from the market data of credit spreads by none of
the existing structural credit risk models.
The default rates implied from credit spreads mostly overestimate the expected
default rates. One factor that helps explain the gap is the liquidity risk – a secu-
rity cannot be traded quickly enough in the market to prevent the loss. Figure 1.1
(bottom panel) plots, in fine resolution, the Moody’s speculative-grade default rates
versus the high-yield credit spreads: 1994-2008. It illustrates the phenomenon where
the spread changed in response to liquidity-risk events, while the default rates did not
react significantly until quarters later; see the liquidity crises triggered in 1998 (Rus-
sian/LTCM shock) and 2007 (Subprime Mortgage meltdown). Besides the liquidity
risk, there exist other known (e.g. tax treatments of corporate bonds vs. government
bonds) and unknown factors that make incomplete the implied world of credit risk.
As of today, we lack a thorough understanding of the credit spread puzzle; see e.g.
Chen, Lesmond and Wei (2007) and references therein.
The shaky foundation of the default risk implied from market credit spreads with-
out looking at the historical defaults leads to further questions about the credit
derivatives, e.g. Credit Default Swap (CDS) and Collateralized Debt Obligation
(CDO). The 2007-08 collapse of credit market in Wall Street is partly due to over-
complication in “innovating” such complex financial instruments on the one hand,
and over-simplification in quantifying their embedded risks on the other.
3
Figure 1.1: Moody’s-rated corporate default rates, bond spreads and NBER-
dated recessions. Data sources: a) Moody’s Baa & Aaa cor-
porate bond yields (http://research.stlouisfed.org/fred2/categories/119);
b) Moody’s Special Comment on Corporate Default and Recovery
Rates, 1920-2008 (http://www.moodys.com/); c) NBER-dated recessions
(http://www.nber.org/cycles/).
4
1.1.2 Actual Defaults
The other world of credit risk is the study of default probability bottom up from
the actual credit performance. It includes the popular industry practices of
a) credit rating in corporate finance, by e.g. the three major U.S. rating agencies:
Moody’s, Standard & Poor’s, and Fitch.
b) credit scoring in consumer lending, by e.g. the three major U.S. credit bureaus:
Equifax, Experian and TransUnion.
Both credit ratings and scores represent the creditworthiness of individual corpora-
tions and consumers. The final evaluations are based on statistical models of the
expected default probability, as well as judgement by rating/scoring specialists. Let
us describe very briefly some rating and scoring basics related to the thesis.
The letters Aaa and Baa in Figure 1.1 are examples of Moody’s rating system,
which use Aaa, Aa, A, Baa, Ba, B, Caa, Ca, C to represent the likelihood of default
from the lowest to the highest. The speculative grade in Figure 1.1 refers to Ba and
the worse ratings. The speculative-grade corporate bonds are sometimes said to be
high-yield or junk.
FICO, developed by Fair Isaac Corporation, is the best-known consumer credit
score and it is the most widely used by U.S. banks and other credit card or mortgage
lenders. It ranges from 300 (very poor) to 850 (best), and intends to represent the
creditworthiness of a borrower such that he or she will repay the debt. For the same
borrower, the three major U.S. credit bureaus often report inconsistent FICO scores
based on their own proprietary models.
Compared to either the industry practices mentioned above or the academics of
the implied default probability, the academic literature based on the actual defaults
is much smaller, which we believe is largely due to the limited access for an academic
researcher to the proprietary internal data of historical defaults. A few academic
5
works will be reviewed later in Section 1.3. In this thesis, we make an attempt
to develop statistical methods based on the actual credit performance data. For
demonstration, we shall use the synthetic or tweaked samples of retail credit portfolios,
as well as the public release of corporate default rates by Moody’s.
1.2 Credit Risk Models
This section reviews the finance literature of credit risk models, including both
structural and intensity-based approaches. Our focus is placed on the probability of
default and the hazard rate of time-to-default.
1.2.1 Structural Approach
In credit risk modeling, structural approach is also known as the firm-value ap-
proach since a firm’s inability to meet the contractual debt is assumed to be deter-
mined by its asset value. It was inspired by the 1970s Black-Scholes-Merton method-
ology for financial option pricing. Two classic structural models are the Merton model
(Merton, 1974) and the first-passage-time model (Black and Cox, 1976).
The Merton model assumes that the default event occurs at the maturity date
of debt if the asset value is less than the debt level. Let D be the debt level with
maturity date T, and let V (t) be the latent asset value following a geometric Brownian
motion
dV (t) = µV (t)dt + σV (t)dW(t), (1.4)
with drift µ, volatility σ and the standard Wiener process W(t). Recall that EW(t) =
0, EW(t)W(s) = min(t, s). Given the initial asset value V (0) > D, by It´ o’s lemma,
V (t)
V (0)
= exp
_
_
µ −
1
2
σ
2
_
t + σW
t
_
∼ Lognormal
_
_
µ −
1
2
σ
2
_
t, σ
2
t
_
, (1.5)
from which one may evaluate the default probability P(V (T) ≤ D).
6
The notion of distance-to-default facilitates the computation of conditional default
probability. Given the sample path of asset values up to t, one may first estimate the
unknown parameters in (1.4) by maximum likelihood method. According to Duffie
and Singleton (2003), let us define the distance-to-default X(t) by the number of
standard deviations such that log V
t
exceeds log D, i.e.
X(t) = (log V (t) −log D)/σ. (1.6)
Clearly, X(t) is a drifted Wiener process of the form
X(t) = c + bt + W(t), t ≥ 0 (1.7)
with b =
µ−σ
2
/2
σ
and c =
log V (0)−log D
σ
. Then, it is easy to verify that the conditional
probability of default at maturity date T is
P(V (T) ≤ D[V (t) > D) = P(X(T) ≤ 0[X(t) > 0) = Φ
_
X(t) + b(T −t)

T −t
_
, (1.8)
where Φ() is the cumulative normal distribution function.
The first-passage-time model by Black and Cox (1976) extends the Merton model
so that the default event could occur as soon as the asset value reaches a pre-specified
debt barrier. Figure 1.2 (left panel) illustrates the first-passage-time of a drifted
Wiener process by simulation, where we set the parameters b = −0.02 and c = 8 (s.t.
a constant debt barrier). The key difference of Merton model and Black-Cox model
lies in the green path (the middle realization viewed at T), which is treated as default
in one model but not the other.
By (1.7) and (1.8), V (t) hits the debt barrier once the distance-to-default process
X(t) hits zero. Given the initial distance-to-default c ≡ X(0) > 0, consider the
7
Figure 1.2: Simulated drifted Wiener process, first-passage-time and hazard rate.
first-passage-time
τ = inf¦t ≥ 0 : X(t) ≤ 0¦, (1.9)
where inf ∅ = ∞ as usual. It is well known that τ follows the inverse Gaussian
distribution (Schr¨ odinger, 1915; Tweedie, 1957; Chhikara and Folks, 1989) with the
density
f(t) =
c


t
−3/2
exp
_

(c + bt)
2
2t
_
, t ≥ 0. (1.10)
See also other types of parametrization in Marshall and Olkin (2007; ¸13). The
survival function S(t) is defined by P(τ > t) for any t ≥ 0 and is given by
S(t) = Φ
_
c + bt

t
_
−e
−2bc
Φ
_
−c + bt

t
_
. (1.11)
The hazard rate, or the conditional default rate, is defined by the instantaneous
rate of default conditional on the survivorship,
λ(t) = lim
∆t↓0
1
∆t
P(t ≤ τ < t + ∆t[τ ≥ t) =
f(t)
S(t)
. (1.12)
Using the inverse Gaussian density and survival functions, we obtain the form of the
8
first-passage-time hazard rate:
λ(t; c, b) =
c

2πt
3
exp
_

(c + bt)
2
2t
_
Φ
_
c + bt

t
_
−e
−2bc
Φ
_
−c + bt

t
_, c > 0. (1.13)
This is one of the most important forms of hazard function in structural approach
to credit risk modeling. Figure 1.2 (right panel) plots λ(t; c, b) for b = −0.02 and
c = 4, 6, 8, 10, 12, which resemble the default rates from low to high credit qualities
in terms of credit ratings or FICO scores. Both the trend parameter b and the initial
distance-to-default parameter c provide insights to understanding the shape of the
hazard rate; see the details in Chapter IV or Aalen, Borgan and Gjessing (2008; ¸10).
Modern developments of structural models based on Merton and Black-Cox mod-
els can be referred to Bielecki and Rutkowski (2004; ¸3). Later in Chapter IV, we
will discuss the dual-time extension of first-passage-time parameterization with both
endogenous and exogenous hazards, as well as non-constant default barrier and in-
complete information about structural parameters.
1.2.2 Intensity-based Approach
The intensity-based approach is also called the reduced-form approach, proposed
independently by Jarrow and Turnbull (1995) and Madan and Unal (1998). Many
follow-up papers can be found in Lando (2004), Bielecki and Rutkowski (2004) and
references therein. Unlike the structural approach that assumes the default to be
completely determined by the asset value subject to a barrier, the default event in the
reduced-form approach is governed by an externally specified intensity process that
may or may not be related to the asset value. The default is treated as an unexpected
event that comes ‘by surprise’. This is a practically appealing feature, since in the
real world the default event (e.g. Year 2001 bankruptcy of Enron Corporation) is
9
often all of a sudden happening without announcement.
The default intensity corresponds to the hazard rate λ(t) = f(t)/S(t) defined in
(1.12) and it has roots in statistical reliability and survival analysis of time-to-failure.
When S(t) is absolutely continuous with f(t) = d(1 −S(t))/dt, we have that
λ(t) =
−dS(t)
S(t)dt
= −
d[log S(t)]
dt
, S(t) = exp
_

_
t
0
λ(s)ds
_
, t ≥ 0.
In survival analysis, λ(t) is usually assumed to be a deterministic function in time.
In credit risk modeling, λ(t) is often treated as stochastic. Thus, the default time
τ is doubly stochastic. Note that Lando (1998) adopted the term “doubly stochastic
Poisson process” (or, Cox process) that refers to a counting process with possibly
recurrent events. What matters in modeling defaults is only the first jump of the
counting process, in which case the default intensity is equivalent to the hazard rate.
In finance, the intensity-based models are mostly the term-structure models bor-
rowed from the literature of interest-rate modeling. Below is an incomplete list:
Vasicek: dλ(t) = κ(θ −λ(t))dt + σdW
t
Cox-Ingersoll-Roll: dλ(t) = κ(θ −λ(t))dt + σ
_
λ(t)dW
t
(1.14)
Affine jump: dλ(t) = µ(λ(t))dt + σ(λ(t))dW
t
+ dJ
t
with reference to Vasicek (1977), Cox, Ingersoll and Roll (1985) and Duffie, Pan and
Singleton (2000). The last model involves a pure-jump process J
t
, and it covers both
the mean-reverting Vasicek and CIR models by setting µ(λ(t)) = κ(θ − λ(t)) and
σ(λ(t)) =
_
σ
2
0
+ σ
2
1
λ(t).
The term-structure models provide straightforward ways to simulate the future
default intensity for the purpose of predicting the conditional default probability.
However, they are ad hoc models lacking fundamental interpretation of the default
event. The choices (1.14) are popular because they could yield closed-form pricing
10
formulas for (1.2), while the real-world default intensities deserve more flexible and
meaningful forms. For instance, the intensity models in (1.14) cannot be used to
model the endogenous shapes of first-passage-time hazards illustrated in Figure 1.2.
The dependence of default intensity on state variables z(t) (e.g. macroeconomic
covariates) is usually treated through a multivariate term-structure model for the
joint of (λ(t), z(t)). This approach essentially presumes a linear dependence in the
diffusion components, e.g. by correlated Wiener processes. In practice, the effect of
a state variable on the default intensity can be non-linear in many other ways.
The intensity-based approach also includes the duration models for econometric
analysis of actual historical defaults. They correspond to the classical survival analysis
in statistics, which opens another door for approaching credit risk.
1.3 Survival Models
Credit risk modeling in finance is closely related to survival analysis in statis-
tics, including both the first-passage-time structural models and the duration type
of intensity-based models. In a rough sense, default risk models based on the actual
credit performance data exclusively belong to survival analysis, since the latter by
definition is the analysis of time-to-failure data. Here, the failure refers to the default
event in either corporate or retail risk exposures; see e.g. Duffie, Saita and Wang
(2007) and Deng, Quigley and Van Order (2000).
We find that the survival models have at least four-fold advantages in approach
to credit risk modeling:
1. flexibility in parametrizing the default intensity,
2. flexibility in incorporating various types of covariates,
3. effectiveness in modeling the credit portfolios, and
11
4. being straightforward to enrich and extend the family of credit risk models.
The following materials are organized in a way to make these points one by one.
1.3.1 Parametrizing Default Intensity
In survival analysis, there are a rich set of lifetime distributions for parametrizing
the default intensity (i.e. hazard rate) λ(t), including
Exponential: λ(t) = α
Weibull: λ(t) = αt
β−1
(1.15)
Lognormal: λ(t) =
1

φ(log t/α)
Φ(−log t/α)
Log-logistic: λ(t) =
(β/α)(t/α)
β−1
1 + (t/α)
β
, α, β > 0
where φ(x) = Φ

(x) is the density function of normal distribution. Another example
is the inverse Gaussian type of hazard rate function in (1.13). More examples of
parametric models can be found in Lawless (2003) and Marshall and Olkin (2007)
among many other texts. They all correspond to the distributional assumptions of the
default time τ, while the inverse Gaussian distribution has a beautiful first-passage-
time interpretation.
The log-location-scale transform can be used to model the hazard rates. Given a
latent default time τ
0
with baseline hazard rate λ
0
(t) and survival function S
0
(t) =
exp
_

_
t
0
λ
0
(s)ds
_
, one may model the firm-specific default time τ
i
by
log τ
i
= µ
i
+ σ
i
log τ
0
, −∞ < µ
i
< ∞, σ
i
> 0 (1.16)
with individual risk parameters (µ
i
, σ
i
). Then, the firm-specific survival function
takes the form
S
i
(t) = S
0
_
_
t
e
µ
i
_
1/σ
i
_
(1.17)
12
and the firm-specific hazard rate is
λ
i
(t) = −
d log S
i
(t)
dt
=
1
σ
i
t
_
t
e
µ
i
_
1/σ
i
λ
0
_
_
t
e
µ
i
_
1/σ
i
_
. (1.18)
For example, given the Weibull baseline hazard rate λ
0
(t) = αt
β−1
, the log-location-
scale transformed hazard rate is given by
λ
i
(t) =
α
σ
i
t
β
σ
i
−1
exp
_

βµ
i
σ
i
_
. (1.19)
1.3.2 Incorporating Covariates
Consider firm-specific (or consumer-specific) covariates z
i
∈ R
p
that are either
static or time-varying. The dependence of default intensity on z
i
can be studied by
regression models. In the context of survival analysis, there are two popular classes
of regression models, namely
1. the multiplicative hazards regression models, and
2. the accelerated failure time (AFT) regression models.
The hazard rate is taken as the modeling basis in the first class of regression models:
λ
i
(t) = λ
0
(t)r(z
i
(t)) (1.20)
where λ
0
(t) is a baseline hazard function and r(z
i
(t)) is a firm-specific relative-risk
multiplier (both must be positive). For example, one may use the inverse Gaussian
hazard rate (1.13) or pick one from (1.15) as the baseline λ
0
(t). The relative risk term
is often specified by r(z
i
(t)) = exp¦θ
T
z
i
(t)¦ with parameter θ ∈ R
p
. Then, the hazard
ratio between any two firms, λ
i
(t)/λ
j
(t) = exp¦θ
T
([z
i
(t) − z
j
(t)]¦, is constant if the
covariates difference z
i
(t)−z
j
(t) is constant in time, thus ending up with proportional
hazard rates. Note that the covariates z
i
(t) are sometimes transformed before entering
13
the model. For example, an econometric modeler usually takes difference of Gross
Domestic Product (GDP) index and uses the GDP growth as the entering covariate.
For a survival analyst, the introduction of the multiplicative hazards model (1.20)
is incomplete without introducing the Cox’s proportional hazards (CoxPH) model.
The latter is among the most popular models in modern survival analysis. Rather
than using the parametric form of baseline hazard, Cox (1972, 1975) considered
λ
i
(t) = λ
0
(t) exp¦θ
T
z
i
(t)¦ by leaving the baseline hazard λ
0
(t) unspecified. Such
semiparametric modeling could reduce the estimation bias of the covariate effects due
to baseline misspecification. One may use the partial likelihood to estimate θ, and
the nonparametric likelihood to estimate λ
0
(t). The estimated baseline
ˆ
λ
0
(t) can be
further smoothed upon parametric, term-structure or data-driven techniques.
The second class of AFT regression models are based on the aforementioned log-
location-scale transform of default time. Suppose that z
i
is not time-varying. Given
τ
0
with baseline hazard rate λ
0
(t), consider (1.16) with the location µ
i
= θ
T
z
i
and
the scale σ
i
= 1 (for simplicity), i.e. log τ
i
= θ
T
z
i
+log τ
0
. Then, it is straightforward
to get the survival function and hazard rate by (1.17) and (1.18):
S
i
(t) = S
0
_
t exp¦−θ
T
z
i
¦
_
, λ
i
(t) = λ
0
_
t exp¦−θ
T
z
i
¦
_
exp¦−θ
T
z
i
¦ (1.21)
An interesting phenomenon is when using the Weibull baseline, by (1.19), we have
that λ
i
(t) = λ
0
(t) exp¦−βθ
T
z
i
¦. This is equivalent to use the multiplicative hazards
model (1.20), as a unique property of the Weibull lifetime distribution. For more
details about AFT regression, see e.g. Lawless (2003; ¸6).
Thus, the covariates z
i
could induce the firm-specific heterogeneity via either a
relative-risk multiplier λ
i
(t)/λ
0
(t) = e
θ
T
z
i
(t)
or the default time acceleration τ
i

0
=
e
θ
T
z
i
. In situations where certain hidden covariates are not accessible, the unexplained
variation also leads to heterogeneity. The frailty models introduced by Vaupel, et al.
14
(1979) are designed to model such unobserved heterogeneity as a random quantity,
say, Z ∼ Gamma(δ
−1
, δ) with shape δ
−1
and scale δ > 0, where a Gamma(k, δ)
distribution has the density
p(z; k, δ) =
z
k−1
exp¦−z/δ¦
Γ(k)δ
k
, z > 0 (1.22)
with mean kδ and variance kδ
2
. Then, the proportional frailty model extends (1.20)
to be
λ
i
(t) = Z λ
0
(t)r(z
i
(t)), (1.23)
for each single obligor i subject to the odd effect Z; see Aalen, et al. (2008; ¸6).
On the other hand, the frailty models can be also used to characterize the default
correlation among multiple obligors, which is the next topic.
1.3.3 Correlating Credit Defaults
The issue of default correlation embedded in credit portfolios has drawn intense
discussions in the recent credit risk literature, in particular along with today’s credit
crisis involving problematic basket CDS or CDO financial instruments; see e.g. Bluhm
and Overbeck (2007). In eyes of a fund manager, the positive correlation of defaults
would increase the level of total volatility given the same level of total expectation (cf.
Markowitz portfolio theory). In academics, among other works, Das, et al. (2007)
performed an empirical analysis of default times for U.S. corporations and provided
evidence for the importance of default correlation.
The default correlation could be effectively characterized by multivariate survival
analysis. In a broad sense, there exists two different approaches:
1. correlating the default intensities through the common covariates,
2. correlating the default times through the copulas.
15
The first approach is better known as the conditionally independent intensity-based
approach, in the sense that the default times are independent conditional on the com-
mon covariates. Examples of common covariates include the market-wide variables,
e.g. the GDP growth, the short-term interest rate, and the house-price appreciation
index. Here, the default correlation induced by the common covariates is in contrast
to the aforementioned default heterogeneity induced by the firm-specific covariates.
As announced earlier, the frailty models (1.23) can also be used to capture the
dependence of multi-obligor default intensities in case of hidden common covariates.
They are usually rephrased as shared frailty models, for differentiation from single-
obligor modeling purpose. The frailty effect Z is usually assumed to be time-invariant.
One may refer to Hougaard (2000) for a comprehensive treatment of multivariate
survival analysis from a frailty point of view. See also Aalen et al. (2008; ¸11) for
some intriguing discussion of dynamic frailty modeled by diffusion and L´evy processes.
In finance, Duffie, et al. (2008) applied the shared frailty effect across firms, as well
as a dynamic frailty effect to represent the unobservable macroeconomic covariates.
The copula approach to default correlation considers the default times as the
modeling basis. A copula C : [0, 1]
n
→ [0, 1] is a function used to formulate the
multivariate joint distribution based on the marginal distributions, e.g.,
Gaussian: C
Σ
(u
1
, . . . , u
n
) = Φ
Σ

−1
(u
1
), . . . , Φ
−1
(u
n
))
Student-t: C
Σ,ν
(u
1
, . . . , u
n
) = Θ
Σ,ν

−1
ν
(u
1
), . . . , Θ
−1
ν
(u
n
)) (1.24)
Archimedean: C
Ψ
(u
1
, . . . , u
n
) = Ψ
−1
_
n

i=1
Ψ(u
i
)
_
where Σ ∈ R
n×n
, Φ
Σ
denotes the multivariate normal distribution, Θ
Σ,ν
denotes the
multivariate Student-t distribution with degrees of freedom ν, and Ψ is the generator
of Archimedean copulas; see Nelsen (2006) for details. When Ψ(u) = −log u, the
Archimedean copula reduces to

n
i=1
u
i
(independence copula). By Sklar’s theorem,
16
for a multivariate joint distribution, there always exists a copula that can link the
joint distribution to its univariate marginals. Therefore, the joint survival distribution
of (τ
1
, . . . , τ
n
) can be characterized by S
joint
(t
1
, . . . , t
n
) = C(S
1
(t
1
), . . . , S
n
(t
n
)), upon
an appropriate selection of copula C.
Interestingly enough, the shared frailty models correspond to the Archimedean
copula, as observed by Hougaard (2000; ¸13) and recast by Aalen, et al. (2008; ¸7).
For each τ
i
with underlying hazard rates Zλ
i
(t), assume λ
i
(t) to be deterministic and
Z to be random. Then, the marginal survival distributions are found by integrating
over the distribution of Z,
S
i
(t) = E
Z
_
exp
_

_
t
0
λ
i
(s)ds
__
= L(Λ
i
(t)), i = 1, . . . , n (1.25)
where L(x) = E
Z
[exp¦−xZ¦] is the Laplace transform of Z and Λ
i
(t) is the cumula-
tive hazard. Similarly, the joint survival distribution is given by
S
joint
(t
1
, . . . , t
n
) = L
_
n

i=1
Λ
i
(t
i
)
_
= L
_
n

i=1
L
−1
(S
i
(t
i
))
_
, (1.26)
where the second equality follows (1.25) and leads to an Archimedean copula. For
example, given the Gamma frailty Z ∼ Gamma(δ
−1
, δ), we have that L(x) = (1 +
δx)
−1/δ
, so
S
joint
(t
1
, . . . , t
n
) =
_
1 + δ
n

i=1
_
t
i
0
λ
i
(s)ds
_
−1/δ
.
In practice of modeling credit portfolios, one must first of all check the appropri-
ateness of the assumption behind either the frailty approach or the copula approach.
An obvious counter example is the recent collapse of the basket CDS and CDO market
for which the rating agencies have once abused the use of the over-simplified Gaus-
sian copula. Besides, the correlation risk in these complex financial instruments is
highly contingent upon the macroeconomic conditions and black-swan events, which
17
are rather dynamic, volatile and unpredictable.
Generalization
The above survey of survival models is too brief to cover the vast literature of
survival analysis that deserves potential applications in credit risk modeling. To be
selective, we have only cited Hougaard (2000), Lawless (2003) and Aalen, et al. (2008),
but there are numerous texts and monographs on survival analysis and reliability.
It is straightforward to enrich and extend the family of credit risk models, by
either the existing devices in survival analysis, or the new developments that would
benefit both fields. The latter is one of our objectives in this thesis, and we will focus
on the dual-time generalization.
1.4 Scope of the Thesis
Both old and new challenges arising in the context of credit risk modeling call
for developments of statistical methodologies. Let us fist preview the complex data
structures in real-world credit risk problems before defining the scope of the thesis.
The vintage time series are among the most popular versions of economic and
risk management data; see e.g. ALFRED digital archive
1
hosted by U.S. Federal
Reserve Bank of St. Louis. For risk management in retail banking, consider for in-
stance the revolving exposures of credit cards. Subject to the market share of the
card-issuing bank, thousands to hundred thousands of new cards are issued monthly,
ranging from product types (e.g. speciality, reward, co-brand and secured cards),
acquisition channels (e.g. banking center, direct mail, telephone and internet), and
other distinguishable characteristics (e.g. geographic region where the card holder
resides). By convention, the accounts originated from the same month constitute a
monthly vintage (quarterly and yearly vintages can be defined in the same fashion),
1
http://alfred.stlouisfed.org/
18
Figure 1.3: Illustration of retail credit portfolios and vintage diagram.
where “vintage” is a borrowed term from wine-making industry to account for the
sharing origination. Then, a vintage time series refers to the longitudinal observa-
tions and performance measurements for the cohort of accounts that share the same
origination date and therefore the same age.
Figure 1.3 provides an illustration of retail credit portfolios and the scheme of
vintage data collection, where we use only the geographic MSA (U.S. Metropolitan
Statistical Area) as a representative of segmentation variables. Each segment consists
of multiple vintages that have different origination dates, and each vintage consists of
varying numbers of accounts. The most important feature of the vintage data is its
dual-time coordinates. Let the calendar time be denoted by t, each vintage be denoted
by its origination time v, then the lifetime (i.e. age, or called the “month-on-book”
for a monthly vintage) is calculated by m = t − v for t ≥ v. Such (m, t; v) tuples
lie in the dual-time domain, called the vintage diagram, with x-axes representing the
calendar time t and y-axis representing the lifetime m. Unlike a usual 2D space, the
vintage diagram consists of unit-slope trajectories each representing the evolvement of
a different vintage. On a vintage diagram, all kinds of data can be collected including
the continuous, binary or count observations. In particular, when the dual-time
19
observations are the individual lives together with birth and death time points, and
when these observations are subject to truncation and censoring, one may draw a line
segment for each individual to represent the vintage data graphically. Such a graph
is known as the Lexis diagram in demography and survival analysis; see Figure 4.1
based on a dual-time-to-default simulation.
For corporate bond and loan issuers, the observations of default events or default
rates share the same vintage data structure. Table 1.1 shows the Moody’s speculative-
grade default rates for annual cohorts 1970-2008, which are calculated from the cu-
mulative rates released recently by Moody’s Global Credit Policy (February 2009).
Each cohort is observed by year end of 2008, and truncated by 20 years maximum in
lifetime. Thus, the performance window for the cohorts originated in the latest years
becomes shorter and shorter. The default rates in Table 1.1 are aligned in lifetime,
while they can be also aligned in calendar time. To compare the cohort performances
in dual-time coordinates, the marginal plots in both lifetime and calendar time are
given Figure 1.4, together with the side-by-side box-plots for checking the vintage (i.e.
cohort) heterogeneity. It is clear that the average default rates gradually decrease in
lifetime, but very volatile in calendar time. Note that the calendar dynamics follow
closely the NBER-dated recessions shown in Figure 1.1. Besides, there seems to be
heterogeneity among vintages. These projection views are only an initial exploration.
Further analysis of this vintage data will be performed in Chapter III.
The main purpose of the thesis is to develop a dual-time modeling framework
based on observations on the vintage diagram, or “dual-time analytics”. First of all,
it is worth mentioning that the vintage data structure is by no means an exclusive
feature of economic and financial data. It also appears in clinical trials with staggered
entry, demography, epidemiology, dendrochronology, etc. As an interesting example
in dendrochronology, Esper, Cook and Schweingruber (Science, 2002) analyzed the
long historical tree-ring records in a dual-time manner in order to reconstruct the
20
Figure 1.4: Moody’s speculative-grade default rates for annual cohorts 1970-2008:
projection views in lifetime, calendar and vintage origination time.
past temperature variability. We believe that the dual-time analytics developed in
the thesis can be also applied to non-financial fields.
The road map in Figure 1.5 outlines the scope of the thesis. In Chapter II, we
discuss the adaptive smoothing spline (AdaSS) for heterogeneously smooth function
estimation. Two challenging issues are evaluation of reproducing kernel and determi-
nation of local penalty, for which we derive an explicit solution based upon piecewise
type of local adaptation. Four different examples, each having a specific feature of
heterogeneous smoothness, are used to demonstrate the new AdaSS technique. Note
that the AdaSS may estimate ‘smooth’ functions including possible jumps, and it
plays a key role in the subsequent work in the thesis.
In Chapter III, we develop the vintage data analysis (VDA) for continuos type
of dual-time responses (e.g. loss rates). We propose an MEV decomposition frame-
work based on Gaussian process models, where M stands for the maturation curve,
E for the exogenous influence and V for the vintage heterogeneity. The intrinsic
identification problem is discussed. Also discussed is the semiparametric extension
of MEV models in the presence of dual-time covariates. Such models are motivated
from the practical needs in financial risk management. An efficient MEV Backfitting
algorithm is provided for model estimation, and its performance is assessed by a sim-
21
'
&
$
%
Chapter I: Introduction
Credit Risk Modeling
Survival Analysis
'
&
$
%
Chapter II: AdaSS
Smoothing Spline
Local Adaptation
'
&
$
%
Chapter IV: DtSA
Nonparametric Estimation
Structural Parametrization
Dual-time Cox Regression
'
&
$
%
Chapter III: VDA
MEV Decomposition
Gaussian Process Models
Semiparametric Regression
? ?

@
@
@
@
@
@
@R
Figure 1.5: A road map of thesis developments of statistical methods in credit risk
modeling.
ulation study. Then, we apply the MEV risk decomposition strategy to analyze both
corporate default rates and retail loan loss rates.
In Chapter IV, we study the dual-time survival analysis (DtSA) for time-to-default
data observed on Lexis diagram. It is of particular importance in credit risk mod-
eling where the default events could be triggered by both endogenous and exoge-
nous hazards. We consider (a) nonparametric estimators under one-way, two-way
and three-way underlying hazards models, (b) structural parameterization via the
first-passage-time triggering system with an endogenous distance-to-default process
associated with exogenous time-transformation, and (c) dual-time semiparametric
Cox regression with both endogenous and exogenous baseline hazards and covari-
ate effects, for which the method of partial likelihood estimation is discussed. Also
discussed is the random-effect vintage heterogeneity modeled by the shared frailty.
Finally, we demonstrate the application of DtSA to credit card and mortgage risk
analysis in retail banking, and shed some light on understanding the ongoing credit
crisis from a new dual-time analytic perspective.
22
T
a
b
l
e
1
.
1
:
M
o
o
d
y

s
s
p
e
c
u
l
a
t
i
v
e
-
g
r
a
d
e
d
e
f
a
u
l
t
r
a
t
e
s
.
D
a
t
a
s
o
u
r
c
e
:
M
o
o
d
y

s
s
p
e
c
i
a
l
c
o
m
m
e
n
t
(
r
e
l
e
a
s
e
:
F
e
b
r
u
a
r
y
2
0
0
9
)
o
n
c
o
r
p
o
-
r
a
t
e
d
e
f
a
u
l
t
a
n
d
r
e
c
o
v
e
r
y
r
a
t
e
s
,
1
9
2
0
-
2
0
0
8
(
h
t
t
p
:
/
/
w
w
w
.
m
o
o
d
y
s
.
c
o
m
/
)
a
n
d
a
u
t
h
o
r

s
c
a
l
c
u
l
a
t
i
o
n
s
.
C
o
h
o
r
t
Y
e
a
r
1
Y
e
a
r
2
Y
e
a
r
3
Y
e
a
r
4
Y
e
a
r
5
Y
e
a
r
6
Y
e
a
r
7
Y
e
a
r
8
Y
e
a
r
9
Y
e
a
r
1
0
Y
e
a
r
1
1
Y
e
a
r
1
2
Y
e
a
r
1
3
Y
e
a
r
1
4
Y
e
a
r
1
5
Y
e
a
r
1
6
Y
e
a
r
1
7
Y
e
a
r
1
8
Y
e
a
r
1
9
Y
e
a
r
2
0
1
9
7
0
8
.
7
7
2
1
.
0
8
2
1
.
8
6
6
0
.
7
7
8
0
.
8
0
4
0
.
8
4
0
0
.
4
4
3
0
.
9
4
1
1
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
.
2
5
0
3
.
3
2
9
0
.
7
1
1
0
.
0
0
0
1
.
6
1
2
3
.
5
0
0
1
.
9
7
5
0
.
0
0
0
1
.
2
0
5
1
9
7
1
1
.
1
5
2
1
.
9
8
9
0
.
8
2
9
0
.
8
5
9
0
.
8
9
8
0
.
4
7
4
1
.
0
0
9
1
.
0
7
2
0
.
0
0
0
0
.
0
0
0
1
.
3
3
9
3
.
5
7
3
0
.
7
6
1
0
.
0
0
0
1
.
7
2
1
4
.
6
5
9
2
.
1
1
0
0
.
0
0
0
1
.
2
9
3
0
.
0
0
0
1
9
7
2
1
.
9
5
7
0
.
8
1
2
0
.
8
4
0
0
.
8
7
8
0
.
4
6
3
0
.
9
7
7
1
.
0
3
4
0
.
0
0
0
0
.
0
0
0
1
.
2
6
6
4
.
0
4
9
0
.
7
1
9
0
.
0
0
0
2
.
4
3
9
4
.
3
8
2
1
.
9
6
7
0
.
0
0
0
1
.
1
8
5
1
.
2
6
3
6
.
6
4
7
1
9
7
3
1
.
2
7
1
0
.
8
7
6
0
.
9
1
6
0
.
4
8
4
1
.
0
1
8
1
.
0
6
9
0
.
0
0
0
0
.
0
0
0
1
.
2
8
9
4
.
7
5
5
0
.
7
1
8
0
.
0
0
0
1
.
6
0
8
4
.
3
4
3
2
.
9
3
4
0
.
0
0
0
1
.
1
5
7
1
.
2
3
1
7
.
6
9
8
1
.
3
7
2
1
9
7
4
1
.
3
3
0
0
.
9
3
1
0
.
4
9
3
1
.
0
3
7
1
.
0
9
0
0
.
0
0
0
0
.
0
0
0
1
.
3
0
3
4
.
7
9
4
0
.
7
2
6
0
.
0
0
0
1
.
6
2
0
6
.
0
9
8
2
.
9
3
0
0
.
0
0
0
1
.
1
7
7
2
.
5
0
7
7
.
8
5
5
1
.
4
0
6
1
.
5
7
8
1
9
7
5
1
.
7
3
5
0
.
9
1
2
1
.
4
3
9
1
.
0
0
7
0
.
0
0
0
0
.
0
0
0
1
.
2
0
5
4
.
4
6
2
0
.
6
7
6
0
.
0
0
0
1
.
5
2
1
5
.
7
2
1
2
.
7
4
1
1
.
0
0
1
1
.
0
7
8
2
.
2
8
3
7
.
1
2
5
1
.
2
6
6
1
.
3
8
6
0
.
0
0
0
1
9
7
6
0
.
8
6
4
1
.
3
6
1
1
.
4
2
4
0
.
0
0
0
0
.
5
3
1
1
.
1
3
4
3
.
6
0
7
0
.
6
3
9
0
.
0
0
0
1
.
4
3
0
5
.
3
4
8
2
.
5
6
1
0
.
9
4
3
1
.
0
0
9
2
.
1
2
4
7
.
8
1
4
1
.
2
0
4
1
.
3
0
8
0
.
0
0
0
0
.
0
0
0
1
9
7
7
1
.
3
3
9
1
.
4
0
6
0
.
0
0
0
1
.
0
5
2
1
.
1
2
8
3
.
5
9
9
0
.
6
3
8
0
.
0
0
0
1
.
4
1
9
5
.
3
0
5
2
.
5
3
6
0
.
9
2
7
0
.
9
8
3
2
.
0
7
0
7
.
6
5
0
1
.
1
8
6
1
.
2
9
7
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
9
7
8
1
.
7
9
8
0
.
0
0
0
1
.
0
1
2
1
.
0
8
6
3
.
4
6
3
1
.
2
2
3
0
.
6
5
1
2
.
7
0
9
6
.
4
9
6
2
.
4
1
1
0
.
8
8
4
0
.
9
4
3
2
.
9
9
4
7
.
4
8
6
1
.
1
7
3
2
.
5
7
5
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
9
7
9
0
.
4
2
0
0
.
8
9
7
0
.
9
5
8
3
.
0
3
1
2
.
1
2
8
3
.
3
9
6
2
.
9
6
3
7
.
5
7
9
2
.
0
8
7
0
.
7
6
5
0
.
8
2
4
3
.
5
0
7
6
.
6
6
8
1
.
0
4
5
2
.
2
7
6
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
9
8
0
1
.
6
1
3
0
.
8
5
9
4
.
0
4
5
1
.
8
8
4
3
.
9
9
3
3
.
1
3
8
7
.
2
7
2
2
.
4
5
1
0
.
6
8
0
2
.
2
2
2
3
.
9
2
5
6
.
9
6
6
1
.
9
6
7
2
.
1
2
5
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
.
2
2
3
0
.
0
0
0
0
.
0
0
0
1
9
8
1
0
.
7
0
1
4
.
0
5
3
1
.
9
3
2
3
.
6
7
5
3
.
4
2
3
7
.
7
9
6
2
.
5
1
4
0
.
5
6
0
2
.
4
7
0
4
.
7
0
2
6
.
8
5
5
1
.
7
5
2
1
.
9
0
6
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
.
1
2
0
1
.
1
5
4
0
.
0
0
0
0
.
0
0
0
1
9
8
2
3
.
5
7
1
4
.
1
0
4
2
.
9
0
2
3
.
4
0
3
7
.
6
7
3
2
.
2
1
3
0
.
4
9
5
2
.
2
0
8
4
.
2
6
6
6
.
2
5
6
1
.
6
0
3
1
.
7
3
9
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
.
1
0
3
1
.
1
6
9
0
.
0
0
0
0
.
0
0
0
2
.
5
7
5
1
9
8
3
3
.
8
2
4
3
.
1
5
3
3
.
6
8
1
7
.
3
0
8
2
.
8
7
1
2
.
3
1
5
3
.
1
8
0
6
.
0
1
3
6
.
9
0
4
1
.
6
3
1
1
.
8
3
3
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
.
1
8
1
1
.
2
6
1
1
.
3
0
6
1
.
3
5
5
2
.
7
8
3
2
.
8
6
4
1
9
8
4
3
.
3
3
3
3
.
7
9
7
8
.
1
9
9
2
.
7
9
3
3
.
5
3
5
4
.
0
5
2
7
.
1
4
8
5
.
7
8
8
1
.
3
4
9
2
.
2
7
8
0
.
0
0
0
0
.
0
0
0
0
.
0
0
0
1
.
0
5
0
1
.
1
5
7
1
.
1
9
4
1
.
2
3
4
1
.
2
8
0
2
.
6
2
3
0
.
0
0
0
1
9
8
5
3
.
4
4
8
6
.
6
6
8
3
.
8
6
0
3
.
7
5
3
5
.
0
4
1
7
.
4
5
4
5
.
6
9
6
1
.
6
5
7
1
.
8
7
3
0
.
0
0
0
1
.
4
7
7
0
.
0
0
0
0
.
8
5
6
0
.
9
5
4
1
.
0
1
4
1
.
0
7
1
2
.
2
7
5
2
.
3
2
6
0
.
0
0
0
0
.
0
0
0
1
9
8
6
5
.
6
4
4
4
.
5
6
3
3
.
2
8
2
4
.
6
0
3
7
.
4
6
0
6
.
4
5
8
2
.
9
6
6
2
.
4
3
5
1
.
0
8
4
1
.
1
9
4
0
.
0
0
0
1
.
4
1
9
0
.
8
1
3
1
.
7
7
3
1
.
8
6
1
3
.
9
6
0
1
.
0
3
0
0
.
0
0
0
0
.
0
0
0
1
.
1
5
0
1
9
8
7
4
.
2
2
2
3
.
8
9
4
5
.
4
4
1
8
.
2
6
1
7
.
5
4
7
3
.
6
4
4
2
.
6
5
3
1
.
1
3
2
1
.
2
6
4
0
.
4
7
1
1
.
0
6
5
1
.
2
3
9
1
.
3
7
6
2
.
8
8
9
3
.
0
7
2
4
.
1
1
3
0
.
0
0
0
0
.
0
0
0
1
.
0
1
6
0
.
0
0
0
1
9
8
8
3
.
5
8
0
5
.
8
9
0
8
.
3
3
5
8
.
0
3
2
3
.
6
1
8
3
.
0
7
8
1
.
1
9
1
1
.
6
7
8
0
.
7
5
4
1
.
2
6
8
2
.
4
3
5
1
.
6
4
0
2
.
3
6
4
3
.
8
7
1
4
.
2
1
0
0
.
0
0
0
0
.
0
0
0
0
.
8
9
0
0
.
0
0
0
2
.
1
1
9
1
9
8
9
5
.
7
9
4
9
.
9
8
0
8
.
1
7
6
3
.
9
5
5
3
.
2
3
8
1
.
2
8
0
1
.
7
5
9
0
.
9
9
7
1
.
8
9
0
2
.
6
3
1
1
.
4
7
7
2
.
6
7
4
3
.
5
2
7
5
.
1
3
4
0
.
0
0
0
0
.
0
0
0
0
.
7
9
8
0
.
0
0
0
1
.
8
8
7
3
.
0
2
0
1
9
9
0
9
.
9
7
6
8
.
7
3
2
4
.
6
5
4
3
.
2
3
7
1
.
3
4
9
1
.
8
1
1
0
.
8
7
6
1
.
9
8
2
2
.
3
1
0
1
.
7
3
0
2
.
8
1
5
4
.
1
3
9
4
.
5
3
4
0
.
0
0
0
0
.
0
0
0
0
.
7
3
5
0
.
0
0
0
1
.
6
6
3
3
.
5
0
1
1
9
9
1
9
.
3
7
0
5
.
0
5
9
3
.
3
7
1
1
.
4
0
1
2
.
1
5
5
0
.
9
1
4
2
.
0
7
8
2
.
4
4
0
1
.
8
5
9
3
.
0
6
9
4
.
5
3
4
4
.
9
5
1
0
.
0
0
0
0
.
7
5
9
0
.
8
2
9
0
.
0
0
0
0
.
9
2
3
3
.
8
4
9
1
9
9
2
5
.
1
5
4
3
.
4
9
5
1
.
5
9
6
2
.
3
6
2
1
.
1
8
5
2
.
0
1
7
2
.
3
3
8
2
.
2
2
5
2
.
9
4
9
4
.
2
9
0
5
.
1
7
0
0
.
0
0
0
0
.
7
1
2
0
.
7
9
1
0
.
0
0
0
0
.
9
2
6
4
.
0
4
9
1
9
9
3
3
.
0
7
2
1
.
8
4
8
3
.
3
4
0
1
.
3
9
9
1
.
6
0
8
1
.
8
7
1
3
.
1
4
6
3
.
1
0
1
4
.
2
1
0
4
.
0
5
6
0
.
5
0
1
0
.
5
5
9
0
.
6
0
7
1
.
3
0
3
0
.
7
1
5
3
.
1
9
4
1
9
9
4
2
.
0
7
3
2
.
8
6
9
1
.
9
0
1
1
.
2
6
5
2
.
3
2
7
3
.
8
8
3
3
.
5
5
3
4
.
1
9
7
3
.
8
9
6
1
.
4
7
4
1
.
2
6
9
0
.
4
7
1
1
.
5
6
8
0
.
5
8
6
3
.
2
5
5
1
9
9
5
2
.
9
2
0
1
.
7
8
5
1
.
8
7
4
2
.
7
6
3
3
.
8
6
2
3
.
9
6
3
6
.
2
9
7
4
.
9
6
0
2
.
6
9
6
1
.
2
4
7
0
.
6
9
5
1
.
1
6
5
0
.
8
7
7
2
.
4
4
8
1
9
9
6
1
.
6
3
7
2
.
1
4
5
3
.
5
4
4
4
.
1
8
7
4
.
1
1
9
6
.
1
5
2
4
.
8
9
4
2
.
7
8
0
1
.
3
6
2
0
.
6
1
7
1
.
0
5
7
1
.
2
1
3
2
.
7
1
5
1
9
9
7
2
.
0
2
8
3
.
5
7
0
4
.
5
9
7
4
.
7
2
7
6
.
9
2
1
4
.
6
9
7
2
.
6
1
7
1
.
4
4
4
0
.
4
6
8
1
.
3
5
1
1
.
2
6
1
2
.
8
2
9
1
9
9
8
3
.
1
5
2
5
.
6
4
2
5
.
9
3
3
8
.
3
2
4
5
.
0
6
8
3
.
7
9
2
2
.
4
1
5
0
.
6
2
0
1
.
4
6
6
1
.
5
2
2
2
.
4
6
8
1
9
9
9
5
.
3
8
4
6
.
6
6
7
8
.
7
4
3
6
.
2
9
9
3
.
7
2
3
2
.
4
4
9
0
.
8
3
2
1
.
3
9
8
1
.
4
9
4
2
.
4
8
2
2
0
0
0
6
.
3
3
9
9
.
4
4
0
7
.
0
4
8
4
.
1
0
5
2
.
5
6
2
1
.
3
6
5
1
.
7
4
4
1
.
3
4
8
2
.
6
0
7
2
0
0
1
1
0
.
1
2
4
7
.
7
1
3
4
.
9
4
7
2
.
7
3
0
1
.
5
4
2
1
.
8
5
1
1
.
3
4
1
2
.
7
8
0
2
0
0
2
7
.
9
2
1
5
.
3
7
5
2
.
9
2
7
1
.
6
9
1
2
.
2
4
7
1
.
2
8
1
3
.
1
5
9
2
0
0
3
5
.
1
2
3
2
.
9
5
5
1
.
6
5
4
2
.
0
5
0
1
.
2
2
4
3
.
3
9
2
2
0
0
4
2
.
3
4
6
1
.
6
5
2
1
.
9
4
3
1
.
0
1
3
3
.
0
7
0
2
0
0
5
1
.
7
3
2
1
.
8
9
5
1
.
1
4
1
3
.
8
8
0
2
0
0
6
1
.
6
8
8
1
.
1
6
9
4
.
7
1
7
2
0
0
7
0
.
9
1
8
4
.
7
9
2
2
0
0
8
4
.
1
2
9
23
CHAPTER II
Adaptive Smoothing Spline
2.1 Introduction
For observational data, statisticians are generally interested in smoothing rather
than interpolation. Let the observations be generated by the signal plus noise model
y
i
= f(t
i
) + ε(t
i
), i = 1, . . . , n (2.1)
where f is an unknown smooth function with unit interval support [0, 1], and ε(t
i
) is
the random error. Function estimation from noisy data is a central problem in mod-
ern nonparametric statistics, and includes the methods of kernel smoothing [Wand
and Jones (1995)], local polynomial smoothing [Fan and Gijbels (1996)] and wavelet
shrinkage [Vidakovic (1999)]. In this thesis, we concentrate on the method of smooth-
ing spline that has an elegant formulation through the mathematical device of repro-
ducing kernel Hilbert space (RKHS). There is a rich body of literature on smoothing
spline since the pioneering work of Kimeldorf and Wahba (1970); see the mono-
graphs by Wahba (1990), Green and Silverman (1994) and Gu (2002). Ordinarily,
the smoothing spline is formulated by the following variation problem
min
f∈Wm
_
1
n
n

i=1
w(t
i
)[y
i
−f(t
i
)]
2
+ λ
_
1
0
[f
(m)
(t)]
2
dt
_
, λ > 0 (2.2)
24
where the target function is an element of the m-order Sobolev space
J
m
= ¦f : f, f

, . . . , f
(m−1)
absolutely continuous, f
(m)
∈ L
2
[0, 1]¦. (2.3)
The objective in (2.2) is a sum of two functionals, the mean squared error MSE =
1
n

n
i=1
w(t
i
)[y
i
− f(t
i
)]
2
and the roughness penalty PEN =
_
1
0
[f
(m)
(t)]
2
dt; hence the
tuning parameter λ controls the trade-off between fidelity to the data and the smooth-
ness of the estimate. Let us call the ordinary smoothing spline (2.2) OrdSS in short.
The weights w(t
i
) in MSE come from the non-stationary error assumption of (2.1)
where ε(t) ∼ N(0, σ
2
/w(t)). They can be also used for aggregating data with repli-
cates. Suppose the data are generated by (2.1) with stationary error but time-varying
sampling frequency r
k
for k = 1, . . . , K distinct points, then
MSE =
1
n
K

k=1
r
k

j=1
[y
kj
−f(t
k
)]
2
=
1
K
K

k=1
w(t
k
)
_
¯ y
k
−f(t
k
)
¸
2
+ Const. (2.4)
where ¯ y
k
∼ N(f(t
k
), σ
2
/r
k
) for k = 1, . . . , K are the pointwise averages. To train
OrdSS, the raw data can be replaced by ¯ y

k
s together with the weights w(t
k
) ∝ r
k
.
The OrdSS assumes that f(t) is homogeneously smooth (or rough), then performs
regularization based on the simple integral of squared derivatives [f
(m)
(t)]
2
. It ex-
cludes the target functions with spatially varying degrees of smoothness, as noted
by Wahba (1985) and Nycha (1988). In practice, there are various types of func-
tions of non-homogeneous smoothness, and four such scenarios are illustrated in Fig-
ure 2.1. The first two are the simulation examples following exactly the same setting
as Donoho and Johnstone (1994), by adding the N(0, 1) noise to n = 2048 equally
spaced signal responses that are re-scaled to attain signal-to-noise ratio 7. The last
two are the sampled data from automotive engineering and financial engineering.
1. Doppler function: f(t) =
_
t(1 −t) sin(2π(1 + a)/(t + a)), a = 0.05, t ∈ [0, 1]. It
25
Figure 2.1: Heterogeneous smooth functions: (1) Doppler function simulated with
noise, (2) HeaviSine function simulated with noise, (3) Motorcycle-Accident
experimental data, and (4) Credit-Risk tweaked sample.
has both time-varying magnitudes and time-varying frequencies.
2. HeaviSine function: f(t) = 3 sin 4πt − sgn(t − 0.3) − sgn(0.72 − t), t ∈ [0, 1]. It
has a downward jump at t = 0.3 and an upward jump at 0.72.
3. Motorcycle-Accident experiment: a classic data from Silverman (1985). It con-
sists of accelerometer readings taken through time from a simulated motorcycle
crash experiment, in which the time points are irregularly spaced and the noises
vary over time. In Figure 2.1, a hypothesized smooth shape is overlaid on the
observations, and it illustrates the fact that the acceleration stays relatively
constant until the crash impact.
4. Credit-Risk tweaked sample: retail loan loss rates for the revolving credit expo-
sures in retail banking. For academic use, business background is removed and
26
the sample is tweaked. The tweaked sample includes monthly vintages booked
in 7 years (2000-2006), where a vintage refers to a group of credit accounts
originated from the same month. They are all observed up to the end of 2006,
so a vintage booked earlier has a longer observation length. In Figure 2.1, the
x-axis represents the months-on-book, and the circles are the vintage loss rates.
The point-wise averages are plotted as solid dots, which are wiggly on large
months-on-book due to decreasing sample sizes. Our purpose is to estimate an
underlying shape (illustrated by the solid line), whose degree of smoothness, by
assumption, increases upon growth.
The first three examples have often been discussed in the nonparametric analysis
literature, where function estimation with non-homogeneous smoothness has been
of interest for decades. A variety of adaptive methods has been proposed, such as
variable-bandwidth kernel smoothing (M¨ uller and Stadtm¨ uller, 1987), multivariate
adaptive regression splines (Friedman, 1991), adaptive wavelet shrinkage (Donoho
and Johnstone, 1994), local-penalty regression spline (Ruppert and Carroll, 2000), as
well as the treed Gaussian process modeling (Gramacy and Lee, 2008). In the context
of smoothing spline, Luo and Wahba (1997) proposed a hybrid adaptive spline with
knot selection, rather than using every sampling point as a knot. More naturally,
Wahba (1995), in a discussion, suggested to use the local penalty
PEN =
_
1
0
ρ(t)[f
(m)
(t)]
2
dt, s.t. ρ(t) > 0 and
_
1
0
ρ(t)dt = 1 (2.5)
where the add-on function ρ(t) is adaptive in response to the local behavior of f(t) so
that heavier penalty is imposed to the regions of lower curvature. This local penalty
functional is the foundation of the whole chapter. Relevant works include Abramovich
and Steinberg (1996) and Pintore, Speckman and Holmes (2006).
This chapter is organized as follows. In Section 2.2 we formulate the AdaSS (adap-
27
tive smoothing spline) through the local penalty (2.5), derive its solution via RKHS
and generalized ridge regression, then discuss how to determine both the smoothing
parameter and the local penalty adaptation by the method of cross validation. Sec-
tion 2.3 covers some basic properties of the AdaSS. The experimental results for the
aforementioned examples are presented in Section 2.4. We summarize the chapter
with Section 2.5 and give technical proofs in the last section.
2.2 AdaSS: Adaptive Smoothing Spline
The AdaSS extends the OrdSS (2.2) based on the local penalty (2.5). It is formu-
lated as
min
f∈Wm
_
1
n
n

i=1
w(t
i
)
_
y
i
−f(t
i
)
¸
2
+ λ
_
1
0
ρ(t)
_
f
(m)
(t)
¸
2
dt
_
, λ > 0. (2.6)
The following proposition has its root in Kimeldorf and Wahba (1971); or see Wahba
(1990; ¸1).
Proposition 2.1 (AdaSS). Given a bounded integrable positive function ρ(t), define
the r.k.
K(t, s) =
_
1
0
ρ
−1
(u)G
m
(t, u)G
m
(s, u)du, t, s ∈ [0, 1] (2.7)
where G
m
(t, u) =
(t−u)
m−1
+
(m−1)!
. Then, the solution of (2.6) can be expressed as
f(t) =
m−1

j=0
α
j
φ
j
(t) +
n

i=1
c
i
K(t, t
i
) ≡ α
T
φ(t) +c
T
ξ
n
(t) (2.8)
where φ
j
(t) = t
j
/j! for j = 0, . . . , m−1, α ∈ R
m
, c ∈ R
n
.
Since the AdaSS has a finite-dimensional linear expression (2.8), it can be solved
by generalized linear regression. Write y = (y
1
, . . . , y
n
)
T
, B = (φ(t
1
), , φ(t
n
))
T
,
W = diag(w(t
1
), . . . , w(t
n
)) and Σ = [K(t
i
, t
j
)]
n×n
. Substituting (2.8) into (2.6), we
28
have the regularized least squares problem
min
α,c
_
1
n
_
_
y −Bα−Σc
_
_
2
W
+ λ
_
_
c
_
_
2
Σ
_
, (2.9)
where |x|
2
A
= x
T
Ax for A ≥ 0. By setting partial derivatives w.r.t. α and c to be
zero, we obtain the equations
B
T
W
_
Bα+Σc
_
= B
T
Wy, ΣW
_
Bα+ (Σ+ nλW
−1
)c
_
= ΣWy,
or
Bα+ (Σ+ nλW
−1
)c = y, B
T
c = 0. (2.10)
Consider the QR-decomposition B = (Q
1
.
.
.Q
2
)(R
T
.
.
.0
T
)
T
= Q
1
R with orthogonal
(Q
1
.
.
.Q
2
)
n×n
and upper-triangular R
m×m
, where Q
1
is n m and Q
2
is n (n −m).
Multiplying both sides of y = Bα + (Σ + nλW
−1
)c by Q
T
2
and using the fact
that c = Q
2
Q
T
2
c (since B
T
c = 0), we have that Q
T
2
y = Q
T
2
(Σ + nλW
−1
)c =
Q
T
2
(Σ+ nλW
−1
)Q
2
Q
T
2
c. Then, simple algebra yields
´c = Q
2
_
Q
T
2
(Σ+ nλW
−1
)Q
2
_
−1
Q
T
2
y, ´ α = R
−1
Q
T
1
_
y −(Σ+ nλW
−1
)´c
_
. (2.11)
The fitted values at the design points are given by ˆ y = B´ α + Σ´c. By (2.10),
y = B´ α+ (Σ+ nλW
−1
)´c, so the residuals are given by
e = y − ˆ y = nλW
−1
´c ≡
_
I −S
λ,ρ
_
y (2.12)
where S
λ,ρ
= I −nλW
−1
Q
2
_
Q
T
2
(Σ+nλW
−1
)Q
T
2
_
−1
Q
T
2
is often called the smoothing
matrix in the sense that
ˆ y = S
λ,ρ
y = nλW
−1
Q
2
_
Q
T
2
(Σ+ nλW
−1
)Q
T
2
_
−1
Q
T
2
y. (2.13)
29
We use the notation S
λ,ρ
to denote its explicit dependence on both the smoothing
parameter λ and the local adaptation ρ(), where ρ is needed for the evaluation of
the matrix Σ = [K(t
i
, t
j
)] based on the r.k. (2.7).
The AdaSS is analogous to OrdSS in constructing the confidence interval under a
Bayesian interpretation by Wahba (1983). For the sampling time points in particular,
we may construct the 95% interval pointwise by
ˆ
f(t
i
) ±z
0.025
_
´
σ
2
w
−1
(t
i
)S
λ,ρ
[i, i], i = 1, . . . , n (2.14)
where S
λ,ρ
[i, i] represents the i-th diagonal element of the smoothing matrix. The
estimate of noise variance is given by
´
σ
2
=
_
_
(I − S
λ,ρ
)y
_
_
2
W
_
(n − p), where p =
trace(S
λ,ρ
) can be interpreted as the equivalent degrees of freedom.
2.2.1 Reproducing Kernels
The AdaSS depends on the evaluation of the r.k. of the integral form (2.7). For
general ρ(t), we must resort to numerical integration techniques. In what follows we
present a closed-form expression of K(t, s) when ρ
−1
(t) is piecewise linear.
Proposition 2.2 (Reproducing Kernel). Let ρ
−1
(t) = a
k
+b
k
t, t ∈ [τ
k−1
, τ
k
) for some
fixed (a
k
, b
k
) and the knots ¦0 ≡ τ
0
< τ
1
< < τ
K
≡ 1¦. Then, the AdaSS kernel
(2.7) can be explicitly evaluated by
K(t, s) =
κt∧s

k=1
m−1

j=0
(−1)
j
_
(a
k
+ b
k
u)
(u −t)
m−1−j
(u −s)
m+j
(m−1 −j)!(m + j)!
−b
k
m−1−j

l=0
(−1)
l
(u −t)
m−1−j−l
(u −s)
m+1+j+l
(m−1 −j −l)!(m + 1 + j + l)!

¸
¸
¸
¸
t∧s∧τ

k
τ
k−1
(2.15)
where κ
t∧s
= min¦k : τ
k
> t ∧ s¦, κ
1∧1
= K, and ¦F(u)¦[
τ

τ
= F(τ

) −F(τ).
Proposition 2.2 is general enough to cover many types of r.k.’s. For example,
30
setting b
k
≡ 0 gives the piecewise-constant results for ρ
−1
(t) obtained by Pintore,
et al. (2006). For another example, set m = 2, K = 1 and ρ
−1
(t) = a + bt for
t ∈ [0, 1]. (By (2.5), a = b/(e
b
−1) and b ∈ R.) Then,
K(t, s) =
_
¸
_
¸
_
a
6
(3t
2
s −t
3
) +
b
12
(2t
3
s −t
4
), if t ≤ s,
a
6
(3ts
2
−s
3
) +
b
12
(2ts
3
−s
4
), otherwise.
(2.16)
When b = 0, (2.16) reduces to the OrdSS kernel for fitting cubic smoothing splines.
The derivatives of the r.k. are also of interest. In solving boundary value dif-
ferential equations, the Green’s function G
m
(t, u) has the property that f(t) =
_
1
0
G
m
(t, u)f
(m)
(u)du for f ∈ J
m
with zero f
(j)
(0), j = 0, . . . , m − 1. The equa-
tion (2.7) is such an example of applying G
m
(t, ) to K(t, s) (s fixed), implying that

m
K(t, s)
∂t
m
= ρ
−1
(t)G
m
(s, t) =
_
¸
_
¸
_
(a
κt
+ b
κt
t)
(s−t)
m−1
(m−1)!
, if t ≤ s
0, otherwise.
(2.17)
The well-defined higher-order derivatives of the kernel can be derived from (2.17),
and they should be treated from knot to knot if ρ
−1
(t) is piecewise.
Lower-order l-th derivatives (l < m) of (2.15) can be obtained by straightforward
but tedious calculations. Here, we present only the OrdSS kernels (by setting ρ(t) ≡ 1
in Proposition 2.2) and their derivatives to be used in next section:
K(t, s) =
m−1

j=0
(−1)
j
t
m−1−j
s
m+j
(m−1 −j)!(m + j)!
+
(−1)
m
(s −t)
2m−1
(2m−1)!
I(t ≤ s), (2.18)

l
K(t, s)
∂t
l
=
m−1−l

j=0
(−1)
j
t
m−1−j−l
s
m+j
(m−1 −j −l)!(m + j)!
+
(−1)
m+l
(s −t)
2m−1−l
(2m−1 −l)!
I(t ≤ s).(2.19)
Note that the kernel derivative coincides with (2.17) when l = m.
Remarks: Numerical methods are inevitable for approximating (2.7) when ρ
−1
(t)
31
takes general forms. Rather than directly approximating (2.7), one may approximate
ρ
−1
(t) first by a piecewise function with knot selection, then use the closed-form
results derived in Proposition 2.2.
2.2.2 Local Penalty Adaptation
One of the most challenging problems for the AdaSS is selection of smoothing
parameter λ and local penalty adaptation function ρ(t). We use the method of cross-
validation based on the leave-one-out scheme.
Conditional on (λ, ρ()), let
ˆ
f(t) denote the AdaSS trained on the complete sample
of size n, and
ˆ
f
[k]
(t) denote the AdaSS trained on the leave-one-out sample ¦(t
i
, y
i

i=k
for k = 1, . . . , n. Define the score of cross-validation by
CV(λ, ρ) =
1
n
n

i=1
w(t
i
)
_
y
i

ˆ
f
[i]
(t
i
)
_
2
. (2.20)
For the linear predictor (2.13) with smoothing matrix S
λ,ρ
, Craven and Wahba (1979)
justified that
ˆ
f(t
i
) −
ˆ
f
[i]
(t
i
) = S
λ,ρ
(y
i

ˆ
f
[i]
(t
i
)) for i = 1, . . . , n. It follows that
CV(λ, ρ) =
1
n
n

i=1
w(t
i
)(y
i

ˆ
f(t
i
))
2
(1 −S
λ,ρ
[i, i])
2
=
1
n
y
T
W(I −diag(S
λ,ρ
))
−2
(I −S
λ,ρ
)
2
y, (2.21)
where diag(S
λ,ρ
) is the diagonal part of the smoothing matrix. Replacing the elements
of diag(S
λ
) by their average n
−1
trace(S
λ,ρ
), we obtain the so-called generalized cross-
validation score
GCV(λ, ρ) =
1
n
n

i=1
w(t
i
)(y
i

ˆ
f
λ
(t
i
))
2
(1 −n
−1
trace(S
λ,ρ
))
2
. (2.22)
For fixed ρ(), the best tuning parameter λ is usually found by minimizing (2.21) or
(2.22) on its log scale, i.e. by varying log λ on the real line.
To select ρ(), consider the rules: (1) choose a smaller value of ρ(t) if f() is less
smooth at t; and (2) the local roughness of f() is quantified by the squared m-th
32
derivative
_
f
(m)
(t)
¸
2
. If the true f
(m)
(t) were known, it is reasonable to determine
the shape of ρ(t) by
ρ
−1
(t) ∝
_
f
(m)
(t)
¸
2
, (2.23)
Since the true function f(t) is unknown a priori, so are its derivatives. Let us consider
the estimation of the derivatives. In the context of smoothing splines, nonparametric
derivative estimation is usually obtained by taking derivatives of the spline estimate.
Using that as an initial estimate of local roughness, we may adopt a two-step proce-
dure for selection of ρ(t), as follows.
Step 1: run the OrdSS with ˜ m = m + a (a = 0, 1, 2) and cross-validated λ to
obtain the initial function estimate:
˜
f(t) =

m+l−1
j=0
˜ α
j
φ
j
(t) +

n
i=1
˜ c
i
¯
K(t, t
i
),
where
¯
K(t, s) =
_
1
0
G
m+a
(t, u)G
m+a
(s, u)du.
Step 2: calculate
˜
f
(m)
(t) and use it to determine the shape of ρ
−1
(t).
In Step 1, with no prior knowledge, we start with OrdSS upon the improper
adaptation ˜ ρ(t) ≡ 1. The extra order a = 0, 1, 2 imposed on penalization of derivatives
_
1
0
[f
(m+a)
(t)]
2
dt is to control the smoothness of the m-th derivative estimate. So, the
choice of a depends on how smooth the estimated
˜
f
(m)
(t) is desired to be.
In Step 2, the m-th derivative of
˜
f(t) can be obtained by differentiating the bases,
˜
f
(m)
(t) =
a−1

j=0
˜ α
m+j
φ
j
(t) +
n

i=1
˜ c
i

m
∂t
m
¯
K(t, t
i
) (2.24)
where

m
∂t
m
¯
K(t, t
i
) is calculated according to (2.19) after replacing m by m + a. The
goodness of the spline derivative depends largely on the spline estimate itself. Rice
and Rosenblatt (1983) studied the large-sample properties of spline derivatives (2.24)
that have slower convergence rates than the spline itself. Given the finite sample, we
find by simulation study that (2.24) tends to under-smooth, but it may capture the
rough shape of the derivatives upon standardization. Figure 2.2 demonstrates the
33
Figure 2.2: OrdSS function estimate and its 2nd-order derivative (upon standardiza-
tion): scaled signal from f(t) = sin(ωt) (upper panel) and f(t) = sin(ωt
4
)
(lower panel), ω = 10π, n = 100 and snr = 7. The sin(ωt
4
) signal resem-
bles the Doppler function in Figure 2.1; both have time-varying frequency.
estimation of the 2nd-order derivative for f(t) = sin(ωt) and f(t) = sin(ωt
4
) using
the OrdSS with m = 2, 3. In both sine-wave examples, the order-2 OrdSS could yield
rough estimates of derivative, and the order-3 OrdSS could give smooth estimates but
large bias on the boundary. Despite these downside effects, the shape of the derivative
can be more or less captured, meeting our needs for estimating ρ(t) below.
In determining the shape of ρ
−1
(t) from
˜
f
(m)
(t), we restrict ourselves to the piece-
wise class of functions, as in Proposition 2.2. By (2.23), we suggest to estimate ρ
−1
(t)
piecewise by the method of constrained least squares:
¸
¸ ˜
f
(m)
(t)
¸
¸

+ ε = a
0
ρ
−1
(t) = a
κt
+ b
κt
t, κ
t
= min¦k : τ
k
≥ t¦ (2.25)
subject to the constraints (1) positivity: ρ
−1
(t) > 0 for t ∈ [0, 1], (2) continuity: a
k
+
34
b
k
τ
k
= a
k+1
+ b
k+1
τ
k
for k = 1, . . . , K −1 and (3) normalization: a
0
= a
0
_
1
0
ρ(t)dt =

K
k=1
_
τ

k
τ
k−1
dt
a
k
+b
k
t
> 0, for given γ ≥ 1 and pre-specified knots ¦0 ≡ τ
0
< τ
1
< . . . <
τ
K−1
< 1¦ and τ

K
= 1. Clearly, ρ
−1
(t) is an example of linear regression spline. If the
continuity condition is relaxed and b
k
≡ 0, we obtain the piecewise-constant ρ
−1
(t).
The power transform
¸
¸ ˜
f
(m)
(t)
¸
¸

, γ ≥ 1 is used to stretch [
˜
f
(m)
(t)]
2
, in order to
make up for (a) overestimation of [f
(m)
(t)]
2
in slightly oscillating regions and (b)
underestimation of [f
(m)
(t)]
2
in highly oscillating regions, since the initial OrdSS is
trained under the improper adaptation ˜ ρ(t) ≡ 1. The lower-right panel of Figure 2.2
is such an example of overestimating the roughness by the OrdSS.
The knots ¦τ
k
¦
K
k=1
∈ [0, 1] can be specified either regularly or irregularly. In most
situations, we may choose equally spaced knots with K to be determined. For the
third example in Figure 2.1 that shows flat vs. rugged heterogeneity, one may choose
sparse knots for the flat region and dense knots for the rugged region, and may also
choose regular knots such that ρ
−1
(t) is nearly constant in the flat region. However,
for the second example in Figure 2.1 that shows jump-type heterogeneity, one should
shrink the size of the interval containing the jump.
Given the knots, the parameters (λ, γ) can be jointly determined by the cross-
validation criteria (2.21) or (2.22).
Remarks: The selection of the local adaptation ρ
−1
(t) is a non-trivial problem, for
which we suggest to use ρ
−1
(t) ∝
¸
¸ ˜
f
(m)
(t)
¸
¸

together with piecewise approximation.
This can be viewed as a joint force of Abramovich and Steinberg (1996) and Pintore,
et al. (2006), where the former directly executed (2.23) on each sampled t
i
, and the
latter directly impose the piecewise constant ρ
−1
(t) without utilizing the fact (2.23).
Note that the approach of Abramovich and Steinberg (1996) may have the estimation
bias of
¸
¸ ˜
f
(m)
(t)
¸
¸
2
as discussed above, and the r.k. evaluation by approximation based
on a discrete set of ρ
−1
(t
i
) remains another issue. As for Pintore, et al. (2006),
the high-dimensional parametrization of ρ
−1
(t) requires nonlinear optimization whose
35
stability is usually of concern. All these issues can be resolved by our approach, which
therefore has both analytical and computational advantages.
2.3 AdaSS Properties
Recall that the OrdSS (2.2) has many interesting statistical properties, e.g.
(a) it is a natural polynomial spline of degree (2m−1) with knots placed t
1
, . . . , t
n
(hence, the OrdSS with m = 2 is usually referred to the cubic smoothing spline);
(b) it has a Bayesian interpretation through the duality between RKHS and Gaus-
sian stochastic processes (Wahba, 1990, ¸1.4-1.5);
(c) its large sample properties of consistency and rate of convergence depends on
the smoothing parameter (Wahba, 1990, ¸4 and references therein);
(d) it has an equivalent kernel smoother with variable bandwidth (Silverman, 1984).
Similar properties are desired to be established for the AdaSS. We leave (c) and (d) to
future investigation. For (a), the AdaSS properties depend on the r.k. (2.7) and the
structure of local adaptation ρ
−1
(t). Pintore, et al. (2006) investigated the piecewise-
constant case of ρ
−1
(t) and argued that the AdaSS is also a piecewise polynomial
spline of degree (2m − 1) with knots on t
1
, . . . , t
n
, while it allows for more flexible
lower-order derivatives than the OrdSS. The piecewise-linear case of ρ
−1
(t) can be
analyzed in a similar manner. For the AdaSS (2.8) with the r.k. (2.15),
f(t) =
m−1

j=0
α
j
φ
j
(t) +

i:t
i
≤t
c
i
K(t, t
i
) +

i:t
i
>t
c
i
K(t, t
i
)
with degree (m − 1) in the first two terms, and degree 2m in the third term (after
one degree increase due to the trend of ρ
−1
(t)).
36
For (b), the AdaSS based on the r.k. (2.7) corresponds to the following zero-mean
Gaussian process
Z(t) =
_
1
0
ρ
−1/2
(u)G
m
(t, u)dW(u), t ∈ [0, 1] (2.26)
where W(t) denotes the standard Wiener process. The covariance kernel is given by
E[Z(t)Z(s)] = K(t, s). When ρ(t) ≡ 1, the Gaussian process (2.26) is the (m− 1)-
fold integrated Wiener process studied by Shepp (1966). Following Wahba (1990;
¸1.5), consider a random effect model F(t) = α
T
φ(t) +b
1/2
Z(t), then the best linear
unbiased estimate of F(t) from noisy data Y (t) = F(t) +ε corresponds to the AdaSS
solution (2.8), provided that the smoothing parameter λ = σ
2
/nb. Moreover, one
may set an improper prior on the coefficients α, derive the posterior of F(t), then
obtain a Bayesian type of confidence interval estimate; see Wahba (1990; ¸5). Such
confidence intervals on the sampling points are given in (2.14).
The inverse function ρ
−1
(t) in (2.26) can be viewed as the measurement of non-
stationary volatility (or, variance) of the m-th derivative process such that
d
m
Z(t)
dt
m
= ρ
−1/2
(t)
dW(t)
dt
. (2.27)
By the duality between Gaussian processes and RKHS, d
m
Z(t)/dt
m
corresponds to
f
(m)
(t) in (2.6). This provides an alternative reasoning for (2.23) in the sense that
ρ
−1
(t) can be determined locally by the second-order moments of d
m
Z(t)/dt
m
.
To illustrate the connection between the AdaSS and the Gaussian process with
nonstationary volatility, consider the case m = 1 and piecewise-constant ρ
−1
(t) =
a
k
, t ∈ [τ
k−1
, τ
k
) for some a
k
> 0 and pre-specified knots 0 ≡ τ
0
< τ
1
< < τ
K
≡ 1.
By (2.15), the r.k. is given by
K(t, s) =
κt∧s

k=1
a
k
_
t ∧ s −τ
k−1
_
, t, s ∈ [0, 1]. (2.28)
37
By (2.26) and (2.27), the corresponding Gaussian process satisfies that
dZ(t) =

a
k
dW(t), for t ∈ [τ
k−1
, τ
k
), k = 1, . . . , K (2.29)
where a
k
is the local volatility from knot to knot. Clearly, Z(t) reduces to the standard
Wiener process W(t) when a
k
≡ 1. As one of the most appealing features, the
construction (2.29) takes into account the jump diffusion, which occurs in [τ
k−1
, τ
k
)
when the interval shrinks and the volatility a
k
blows up.
2.4 Experimental Results
The implementation of OrdSS and AdaSS differs only in the formation of Σ
through the r.k. K(t, s). It is most crucial for the AdaSS to select the local adaptation
ρ
−1
(t). Consider the four examples in Figure 2.1 with various scenarios of heteroge-
neous smoothness. The curve fitting results by the OrdSS with m = 2 (i.e. cubic
smoothing spline) are shown in Figure 2.3, as well as the 95% confidence intervals for
the last two cases. In what follows, we discuss the need of local adaptation in each
case, and compare the performance of OrdSS and AdaSS.
Simulation Study.
The Doppler and HeaviSine functions are often taken as the benchmark examples
for testing adaptive smoothing methods. The global smoothing by the OrdSS (2.2)
may fail to capture the heterogeneous smoothness, and as a consequence, over-smooth
the region where the signal is relatively smooth and under-smooth where the signal is
highly oscillating. For the Doppler case in Figure 2.3, the underlying signal increases
its smoothness with time, but the OrdSS estimate shows lots of wiggles for t > 0.5.
In the HeaviSine case, both jumps at t = 0.3 and t = 0.72 could be captured by the
OrdSS, but the smoothness elsewhere is sacrificed as a compromise.
The AdaSS fitting results with piecewise-constant ρ
−1
(t) are shown in Figure 2.4.
38
Figure 2.3: OrdSS curve fitting with m = 2 (shown by solid lines). The dashed lines
represent 95% confidence intervals. In the credit-risk case, the log loss
rates are considered as the responses, and the time-dependent weights are
specified proportional to the number of replicates.
Initially, we fitted the OrdSS with cross-validated tuning parameter λ, then estimated
the 2nd-order derivatives by (2.24). To estimate the piecewise-constant local adapta-
tion ρ
−1
(t), we pre-specified 6 equally spaced knots (including 0 and 1) for the Doppler
function, and pre-specified the knots (0, 0.295, 0.305, 0.5, 0.715, 0.725, 1) for the Heavi-
Sine function. (Note that such knots can be determined graphically based on the plots
of the squared derivative estimates.) Given the knots, the function ρ
−1
(t) is estimated
by constrained least squares (2.25) conditional on the power transform parameter γ,
where the best γ

is found jointly with λ by minimizing the score of cross-validation
(2.21). In our implementation, we assumed that both tuning parameters are bounded
such that log λ ∈ [−40, 10] and γ ∈ [1, 5], then performed the bounded nonlinear op-
timization. Table 2.1 gives the numerical results in both simulation cases, including
the cross-validated (λ, γ), the estimated variance of noise (cf. true σ
2
= 1), and the
39
Figure 2.4: Simulation study of Doppler and HeaviSine functions: OrdSS (blue),
AdaSS (red) and the heterogeneous truth (light background).
40
Table 2.1: AdaSS parameters in Doppler and HeaviSine simulation study.
Simulation n snr λ γ ˆ σ
2
dof CV
Doppler 2048 7 1.21e-012 2.13 0.9662 164.19 2180.6
HeaviSine 2048 7 1.86e-010 4.95 1.0175 35.07 2123.9
equivalent degrees of freedom.
Figure 2.4 (bottom panel) shows the improvement of AdaSS over OrdSS after
zooming in. The middle panel shows the data-driven estimates of the adaptation
function ρ(t) plotted at log scale. For the Doppler function, the imposed penalty
ρ(t) in the first interval [0, 0.2) is much less than that in the other four intervals,
resulting in heterogenous curve fitting. For the HeaviSine function, low penalty is
assigned to where the jump occurs, while the relatively same high level of penalty is
assigned to other regions. Both simulation examples demonstrate the advantage of
using the AdaSS. A similar simulation study of the Doppler function can be found in
Pintore, et al.(2006) who used a small sample size (n = 128) and high-dimensional
optimization techniques.
Motorcycle-Accident Data.
Silverman (1985) originally used the motorcycle-accident experimental data to
test the non-stationary OrdSS curve fitting; see the confidence intervals in Figure 2.3.
Besides error non-stationarity, smoothness non-homogeneity is another important fea-
ture of this data set. To fit such data, the treatment from a heterogenous smoothness
point of view is more appropriate than using only the non-stationary error treatment.
Recall that the data consists of acceleration measurements (subject to noise) of the
head of a motorcycle rider during the course of a simulated crash experiment. Upon
the crash occurrence, it decelerates to about negative 120 gal (a unit gal is 1 centimeter
per second squared). The data indicates clearly that the acceleration rebounds well
above its original level before setting back. Let us focus on the period prior to the
crash happening at t ≈ 0.24 (after time re-scaling), for which it is reasonable to
41
Figure 2.5: Non-stationary OrdSS and AdaSS for Motorcycle-Accident Data.
Figure 2.6: OrdSS, non-stationary OrdSS and AdaSS: performance in comparison.
42
assume the acceleration stays constant. However, the OrdSS estimate in Figure 2.3
shows a slight bump around t = 0.2, a false discovery.
In presence of non-stationary errors N(0, σ
2
/w(t
i
)) with unknown weights w(t
i
),
a common approach is to begin with an unweighted OrdSS and estimate w
−1
(t
i
) from
the local residual sums of squares; see Silverman (1985). The local moving averaging
(e.g. loess) is usually performed; see the upper-left panel in Figure 2.5. Then, the
smoothed weight estimates are plugged into (2.2) to run the non-stationary OrdSS. To
perform AdaSS (2.6) with the same weights, we estimated the local adaptation ρ
−1
(t)
from the derivatives of non-stationary OrdSS. Both results are shown in Figure 2.5.
Compared to Figure 2.3, both non-stationary spline models result in tighter estimates
of 95% confidence intervals than the unweighted OrdSS.
Figure 2.6 compares the performances of stationary OrdSS, non-stationary OrdSS
and non-stationary AdaSS. After zooming in around t ∈ [0.1, 0.28], it is clear that
AdaSS performs better than both versions of OrdSS in modeling the constant accel-
eration before the crash happening. For the setting-back process after attaining the
maximum acceleration at t ≈ 0.55, AdaSS gives a more smooth estimate than OrdSS.
Besides, they differ in estimating the minimum of the acceleration curve at t ≈ 0.36,
where the non-stationary OrdSS seems to have underestimated the magnitude of
deceleration, while the estimate by the non-stationary AdaSS looks reasonable.
Credit-Risk Data.
The last plot in Figure 2.1 represents a tweaked sample in retail credit risk man-
agement. It is of our interest to model the growth or the maturation curve of the log
of loss rates in lifetime (month-on-book). Consider the sequence of vintages booked in
order from 2000 to 2006. Their monthly loss rates are observed up to December 2006
(right truncation). Aligning them to the same origin, we have the observations accu-
mulated more in smaller month-on-book and less in larger month-on-book, i.e., the
sample size decreases in lifetime. As a consequence, the simple pointwise or moving
43
Figure 2.7: AdaSS estimate of maturation curve for Credit-Risk sample: piecewise-
constant ρ
−1
(t) (upper panel) and piecewise-linear ρ
−1
(t) (lower panel).
averages have increasing variability. Nevertheless, our experiences tell that the mat-
uration curve is expected to have increasing degrees of smoothness once the lifetime
exceeds certain month-on-book. This is also a desired property of the maturation
curve for the purpose of extrapolation.
By (2.4), spline smoothing may work on the pointwise averages subject to non-
stationary errors N(0, σ
2
/w(t
i
)) with weights determined by the number of replicates.
The non-stationary OrdSS estimate is shown in Figure 2.3, which however does not
smooth out the maturation curve for large month-on-book. The reasons of the unsta-
bility in the rightmost region could be (a) OrdSS has the intrinsic boundary bias for
especially small w(t
i
), and (b) the raw inputs for training purpose are contaminated
and have abnormal behavior. The exogenous cause of contamination in (b) will be
explained in Chapter III.
Alternatively, we applied the AdaSS with judgemental prior on heterogeneous
smoothness of the target function. For demonstration, we purposefully specified
44
ρ
−1
(t) to take the forms of Figure 2.7, including (a) piecewise-constant and (b)
piecewise-linear. Both specifications are non-decreasing in lifetime and they span
about the same range. The curve fitting results for both ρ
−1
(t) are nearly identical
upon visual inspection. Compared with the OrdSS performance in the large month-
on-book region, the AdaSS yields smoother estimate of maturation curve, as well
as tighter confidence intervals based on judgemental prior. The AdaSS estimate of
maturation curve, transformed to the original scale, is plotted by the red solid line in
Figure 2.1.
This credit-risk sample will appear again in Chapter III of vintage data analysis.
The really interesting story is yet to unfold.
2.5 Summary
In this chapter we have studied the adaptive smoothing spline for modeling het-
erogeneously ‘smooth’ functions including possible jumps. In particular, we studied
the AdaSS with piecewise adaptation of local penalty and derived the closed-form re-
producing kernels. To determine the local adaptation function, we proposed a shape
estimation procedure based the function derivatives, together with the method of
cross-validation for parameter tuning. Four different examples, each having a specific
feature of heterogeneous smoothness, were used to demonstrate the improvement of
AdaSS over OrdSS.
Future works include the AdaSS properties w.r.t. (c) and (d) listed in Section 2.3.
Its connection with a time-warping smoothing spline is also under our investigation.
However, it is not clear how the AdaSS can be extended to multivariate smoothing
in a tensor-product space, as an analogy of the smoothing spline ANOVA models in
Gu (2002). In next chapter, we will consider the application of AdaSS in a dual-time
domain, where the notion of smoothing spline will be generalized via other Gaussian
processes.
45
2.6 Technical Proofs
Proof to Proposition 2.1: by minor modification on Wahba (1990; ¸1.2-1.3).
By Taylor’s theorem with remainder, any function in J
m
(2.3) can be expressed by
f(t) =
m−1

j=0
t
j
j!
f
(j)
(0) +
_
1
0
G
m
(t, u)f
(m)
(u)du (2.30)
where G
m
(t, u) =
(t−u)
(m−1)
+
(m−1)!
is the Green’s function for solving the boundary value
problem: if f
(m)
1
= g with f
1
∈ B
m
= ¦f : f
(k)
(0) = 0, for k = 0, 1, . . . , m−1¦, then
f
1
(t) =
_
1
0
G
m
(t, u)g(u)du. The remainder functions f
1
(t) =
_
1
0
G
m
(t, u)f
(m)
(u)du lie
in the subspace H = J
m
∩ B
m
. Given the bounded integrable function ρ(t) > 0 on
t ∈ [0, 1], associate H with the inner product
¸f
1
, g
1
)
H
=
_
1
0
f
(m)
1
(u)g
(m)
1
(u)λ(u)du, (2.31)
and the induced square norm |f
1
|
2
H
=
_
1
0
λ(u)[f
(m)
1
(u)]
2
du. By similar arguments to
Wahba (1990; ¸1.2), one may verify that H is an RKHS with r.k. (2.7), where the
reproducing property
¸f
1
(), K(t, ))
H
= f
1
(t), ∀f
1
∈ H (2.32)
follows that

m
∂t
m
K(t, s) = ρ(t)
−1
G
m
(s, t).
Then, using the arguments of Wahba (1990; ¸1.3) would prove the finite-dimensional
expression (2.8) that minimizes the AdaSS objective functional (2.6).
Proof to Proposition 2.2: The r.k. with piecewise ρ
−1
(t) equals
K(t, s) =
κt∧s

k=1
_
t∧s∧τ

k
τ
k−1
(a
k
+ b
k
u)
(u −t)
m−1
(m−1)!
(u −s)
m−1
(m−1)!
du.
46
For any τ
k−1
≤ t
1
≤ t
2
< τ
k
, using the successive integration by parts,
_
t
2
t
1
(a
k
+ b
k
u)
(u −t)
m−1
(m−1)!
(u −s)
m−1
(m−1)!
du =
_
t
2
t
1
(a
k
+ b
k
u)
(u −t)
m−1
(m−1)!
d
(u −s)
m
m!
= (a
k
+ b
k
u)
(u −t)
m−1
(m−1)!
(u −s)
m
m!
¸
¸
¸
t
2
t
1

_
t
2
t
1
(a
k
+ b
k
u)
(u −t)
m−2
(m−2)!
(u −s)
m
m!
du
−b
k
_
t
2
t
1
(u −t)
m−1
(m−1)!
(u −s)
m
m!
du = ,
it is straightforward (but tedious) to derive that
_
t
2
t
1
(a
k
+ b
k
u)
(u −t)
m−1
(m−1)!
(u −s)
m−1
(m−1)!
du
=
m−1

j=0
(−1)
j
(a
k
+ b
k
u)
(u −t)
m−1−j
(m−1 −j)!
(u −s)
m+j
(m + j)!
¸
¸
¸
t
2
t
1
−b
k
m−1

j=0
(−1)
j
T
j
in which T
j
denotes the integral evaluations for j = 0, 1, . . . , m−1
T
j

_
t
2
t
1
(u −t)
m−1−j
(m−1 −j)!
(u −s)
m+j
(m + j)!
du =
m−1−j

l=0
(−1)
l
(u −t)
m−1−j−l
(u −s)
m+1+j+l
(m−1 −j −l)!(m + 1 + j + l)!
¸
¸
¸
t
2
t
1
.
Choosing τ
1
= τ
k−1
, τ
2
= t ∧ s ∧ τ

k
for k = 1, . . . , κ
t∧s
and adding them all gives the
final r.k. evaluation by
K(t, s) =
κt∧s

k=1
m−1

j=0
(−1)
j
_
(a
k
+ b
k
u)
(u −t)
m−1−j
(u −s)
m+j
(m−1 −j)!(m + j)!
−b
k
m−1−j

l=0
(−1)
l
(u −t)
m−1−j−l
(u −s)
m+1+j+l
(m−1 −j −l)!(m + 1 + j + l)!

¸
¸
¸
¸
t∧s∧τ

k
τ
k−1
.
47
CHAPTER III
Vintage Data Analysis
3.1 Introduction
The vintage data represents a special class of functional, longitudinal or panel data
with dual-time characteristics. It shares the cross-sectional time series feature such
that there are J subjects each observed at calendar time points ¦t
jl
, l = 1, . . . , L
j
¦,
where L
j
is the total number of observations on the j-th subject. Unlike purely
longitudinal setup, each subject j we consider corresponds to a vintage originated at
time v
j
such that v
1
< v
2
< . . . < v
J
and the elapsed time m
jl
= t
jl
− v
j
measures
the lifetime or age of the subject. To this end, we denote the vintage data by
_
y(m
jl
, t
jl
; v
j
), l = 1, . . . , L
j
, j = 1, . . . , J
_
, (3.1)
where the dual-time points are usually sampled from the regular grids such that for
each vintage j, m
jl
= 0, 1, 2, . . . and t
jl
= v
j
, v
j
+ 1, v
j
+ 2, . . .. We call the grids of
such (m
jl
, t
jl
; v
j
) a vintage diagram. Also observed on the vintage diagram could be
the covariate vectors x(m
jl
, t
jl
; v
j
) ∈ R
p
. In situations where more than one segment
of dual-time observations are available, we may denote the (static) segmentation vari-
ables by z
k
∈ R
q
, and denote the vintage data by ¦y(m
kjl
, t
kjl
; v
kj
), x(m
kjl
, t
kjl
; v
kj

for segment k = 1, . . . , K. See Figure 1.3 for an illustration of vintage diagram and
48
dual-time collection of observations, as well as the hierarchical structure of multi-
segment credit portfolios.
The vintage diagram could be truncated in either age or time, and several in-
teresting prototypes are illustrated in Figure 3.1. The triangular prototype on the
top-left panel is the most typical as the time series data are often subject to the
systematic truncation from right (say, observations up to today). Each vintage series
evolves in the 45

direction such that for the same vintage v
j
, the calendar time t
j
and the lifetime m
j
move at the same speed along the diagonal — therefore, the
vintage diagram corresponds to the hyperplane ¦(t, m, v) : v = t − m¦ in the 3D
space. In practice, there exist other prototypes of vintage data truncation. The rect-
angular diagram is obtained after truncation in both lifetime and calendar time. The
corporate default rates released by Moody’s, shown in Table 1.1, are truncated in
lifetime to be 20 years maximum. The other trapezoidal prototype corresponds to
the time-snapshot selection common to credit risk modeling exercises. Besides, the
U.S. Treasury rates released for multiple maturity dates serve as an example of the
parallelogram prototype.
The age, time and vintage dimensions are all of potential interest in financial risk
management. The empirical evidences tell that the age effects of self-maturation na-
ture are often subject to smoothness in variation. The time effects are often dynamic
and volatile due to macroeconomic environment. Meanwhile, the vintage effects are
conceived as the forward-looking measure of performance since origination, and they
could be treated as random effects upon appropriate structural assumption. Bearing
in mind these practical considerations, we aim to develop a formal statistical frame-
work for vintage data analysis (VDA), by integrating ideas from functional data
analysis (FDA), longitudinal data analysis (LDA)
1
and time series analysis (TSA).
To achieve this, we will take a Gaussian process modeling approach, which is flexible
1
Despite the overlap between FDA and LDA, we try to differentiate them methodologically by
tagging FDA with nonparametric smoothing and tagging LDA with random-effect modeling.
49
Figure 3.1: Vintage diagram upon truncation and exemplified prototypes.
50
enough to cover a diversity of scenarios.
This chapter develops the VDA framework for continuous type of responses (usu-
ally, rates). It is organized as follows. In Section 3.2, we discuss the maturation-
exogenous-vintage (MEV) decomposition in general. The MEV models based on
Gaussian processes are presented in Section 3.3, where we discuss Kriging, smoothing
spline and kernel methods. An efficient backfitting algorithm is provided for model
estimation. Section 3.4 covers the semiparametric regression given the dual-time
covariates and segmentation variables. Computational results are presented in Sec-
tion 3.5, including a simulation study for assessing the MEV decomposition technique,
and real applications in both corporate and retail credit risk cases. We conclude the
chapter in Section 3.6 and give technical proofs in the last section.
3.2 MEV Decomposition Framework
Let 1 = ¦v
j
: 0 ≡ v
1
< v
2
< . . . < v
J
< ∞¦ be a discrete set of origination times
of J vintages, and define the dual-time domain
Ω =
_
(m, t; v) : m ≥ 0, t = v + m, v ∈ 1
_
,
where m denotes the lifetime and t denotes the calendar time. Unlike a usual 2D
(m, t) space, the dual-time domain Ω consists of isolated 45

lines ¦(m, t) : t − m =
v
j
, m ≥ 0¦ for each fixed vintage v
j
∈ 1. The triangular vintage diagram in Figure 3.1
is the discrete counterpart of Ω upon right truncation in calendar time.
Let Ω be the support of the vintage data (3.1), for which we propose the MEV
(maturation-exogenous-vintage) decomposition modeling framework
η(y(m, t; v)) = f(m) + g(t) + h(v) + ε, (3.2)
51
where η is a pre-specified transform function, ε is the random error, and the three
MEV components have the following practical interpretations:
a) f(m) represents the maturation curve characterizing the endogenous growth,
b) g(t) represents the exogenous influence by macroeconomic conditions,
c) h(v) represents the vintage heterogeneity measuring the origination quality.
For the time being, let f, g, h be open to flexible choices of parametric, nonparametric
or stochastic process models. We begin with some general remarks about model
identification.
Suppose the maturation curve f(m) in (3.2) absorbs the intercept effect, and the
constraints Eg = Eh = 0 are set as usual:
E
_
tmax
t
min
g(t)dt = E
_
V
h(v)dv = 0. (3.3)
The question is, do these constraints suffice to ensure the model identifiability? It
turns out for the dual-time domain Ω, all the MEV decomposition models are subject
to the linear dependency, as stated formally in the following lemma.
Lemma 3.1 (Identification). For the MEV models (3.2) defined on the dual-time
domain Ω, write f = f
0
+ f
1
with the linear part f
0
and the nonlinear part f
1
and
similarly write g = g
0
+ g
1
, h = h
0
+ h
1
. Then, one of f
0
, g
0
, h
0
is not identifiable.
The identification (or collinearity) problem in Lemma 3.1 is incurred by the hy-
perplane nature of the dual-time domain Ω in the 3D space: ¦(t, m, v) : v = t − m¦
(see Figure 3.1 top right panel). We need to break up such linear dependency in order
to make the MEV models estimable. A practical way is to perform binning in age,
time or vintage direction. For example, given the vintage data of triangular proto-
type, one may group the large m region, the small t region and the large v region, as
52
these regions have few observations. A commonly used binning procedure for h(v) is
to assume the piecewise-constant vintage effects by grouping monthly vintages every
quarter or every year.
An alternative strategy for breaking up the linear dependency is to directly remove
the trend for one of f, g, h functions. Indeed, this strategy is simple to implement,
provided that f, g, h all could be expanded by certain basis functions. Then, one only
needs to fix the linear bases for any two of f, g, h. By default, we remove the trend
of h and let it measure only the nonlinear heterogeneity. A word of caution is in
order in situations where there does exist an underlying trend h
0
(v), then it would be
absorbed by f(m) and g(t). One may retrieve h
0
(v) = γv by simultaneously adjusting
f(m) → f(m) + γm and g(t) → g(t) − γt. To determine the slope γ requires extra
knowledge about the trend of at least one of f, g, h. The trend removal for h(v) is
needed only when there is no such extra knowledge. The MEV decomposition with
zero-trend vintage effects may be referred to as the ad hoc approach.
In what follows we review several existing approaches that are related to MEV
decomposition. The first example is a conventional age-period-cohort (APC) model
in social sciences of demography and epidemiology. The second example is the gener-
alized additive model (GAM) in nonparametric statistics, where the smoothing spline
is used as the scatterplot smoother. The third example is an industry GAMEED ap-
proach that may reduce to a sequential type of MEV decomposition. The last example
is another industrial DtD technique involving the interactions of MEV components.
They all can be viewed as one or another type of vintage data analysis.
The APC Model.
Suppose there are I distinct m-values and L distinct t-values in the vintage data.
By (3.2), setting f(m) = µ +

I
i=1
α
i
I(m = m
i
), g(t) =

L
l=1
β
l
I(t = t
l
) and h(v) =
53

J
j=1
γ
j
I(v = v
j
) gives us the the APC (age-period-cohort) model
η(y(m
i
, t
l
; v
j
)) = µ + α
i
+ β
l
+ γ
j
+ ε, (3.4)
where the vintage effects ¦γ

j
s¦ used to be called the cohort effects; see Mason, et al.
(1973) and Mason and Wolfinger (2002). Assume

i
α
i
=

l
β
l
=

j
v
j
= 0, as
usual; then one may reformulate the APC model as y = Xθ + ε, where y is the the
vector consisting of η(y(m
i
, t
l
; v
j
)) and
X =
_
1, x
(α)
1
, . . . , x
(α)
I−1
, x
(β)
1
, . . . , x
(β)
L−1
, x
(γ)
1
, . . . , x
(γ)
J−1
_
with dummy variables. For example, x
(α)
i
is defined by x
(α)
i
(m) = I(m = m
i
)−I(m =
m
I
) for i = 1, . . . , I − 1. Without loss of generality, let us select the factorial levels
(m
I
, t
L
, v
J
) to satisfy m
I
−t
L
+ v
J
= 0 (after reshuffling the indices of v).
The identification problem of the APC model (3.4) follows Lemma 3.1, since the
regression matrix is linearly dependent through
I−1

i=1
(m
i
− ¯ m)x
(α)
i

L−1

l=1
(t
l

¯
t )x
(β)
l
+
J−1

j=1
(v
j
− ¯ v)x
(γ)
j
= ¯ m−
¯
t + ¯ v, (3.5)
where ¯ m =

I
i=1
m
i
/I,
¯
t =

L
l=1
t
l
/L and ¯ v =

J
j=1
v
j
/J. So, the ordinary least
squares (OLS) cannot be used for fitting y = Xθ + ε. This is a key observation by
Kupper (1985) that generated long-term discussions and debates on the estimability
of the APC model. Among others, Fu (2000) suggested to estimate the APC model
by ridge regression, whose limiting case leads to a so-called intrinsic estimator
´
θ =
(X
T
X)
+
X
T
y, where A
+
is the Moore-Penrose generalized inverse of A satisfying that
AA
+
A = A.
The discrete APC model tends to massively identify the effects at the grid level and
often ends up with unstable estimates. Since the vintage data are always unbalanced
54
in m, t or v projection, there would be very limited observations for estimating
certain factorial effects in (3.4). The APC model is formulated over-flexibly in the
sense that it does not regularize the grid effect by neighbors. Fu (2008) suggested to
use the spline model for h(v); however, such spline regularization would not bypass
the identification problem unless it does not involve a linear basis.
The Additive Splines.
If f, g, h are all assumed to be unspecified smooth functions, the MEV models
would become the generalized additive models (GAM); see Hastie and Tibshirani
(1990). For example, using the cubic smoothing spline as the scatterplot smoother,
we have the additive splines model formulated by
arg min
f,g,h
_
J

j=1
L
j

l=1
_
η(y
jl
) −f(m
jl
) −g(t
jl
) −h(v
j
)
_
2
+ λ
f
_
_
f
(2)
(m)
_
2
dm

g
_
_
g
(2)
(t)
_
2
dt + λ
h
_
_
h
(2)
(v)
_
2
dv
_
, (3.6)
where λ
f
, λ
g
, λ
h
> 0 are the tuning parameters separately controlling the smoothness
of each target function; see (2.2).
Usually, the optimization (3.6) is solved by the backfitting algorithm that iterates
ˆ
f ← S
f
_
_
η(y
jl
) − ˆ g(t
jl
) −
ˆ
h(v
j
)
_
j,l
_
ˆ g ← S
g
_
_
η(y
jl
) −
ˆ
f(m
jl
) −
ˆ
h(v
j
)
_
j,l
_
(centered) (3.7)
ˆ
h ← S
h
_
_
η(y
jl
) −
ˆ
f(m
jl
) − ˆ g(t
jl
)
_
j,l
_
(centered)
after initializing ˆ g =
ˆ
h = 0. See Chapter II about the formation of smoothing matrices
S
f
, S
g
, S
h
, for which the corresponding smoothing parameters can be selected by the
method of generalized cross-validation. However, the linear parts of f, g, h are not
identifiable, by Lemma 3.1. To get around the identification problem, we may take
55
the ad hoc approach to remove the trend of h(v). The trend-removal can be achieved
by appending to every iteration of (3.7):
ˆ
h(v) ←
ˆ
h(v) −v
_
J

j=1
v
2
j
_
−1
J

j=1
v
j
ˆ
h(v
j
).
We provide more details on (3.6) to shed light on its further development in the
next section. By Proposition 2.1 in Chapter II, each of f, g, h can be expressed by
the basis expansion of the form (2.8). Taking into account the identifiability of both
the intercept and the trend, we may express f, g, h by
f(m) = µ
0
+ µ
1
m +

I
i=1
α
i
K(m, m
i
)
g(t) = µ
2
t +

L
l=1
β
l
K(t, t
l
) (3.8)
h(v) =

J
j=1
γ
j
K(v, v
j
).
Denote µ = (µ
0
, µ
1
, µ
2
)
T
, α = (α
1
, . . . , α
I
)
T
, β = (β
1
, . . . , β
L
)
T
, and γ = (γ
1
, . . . , γ
J
)
T
.
We may equivalently write (3.6) as the regularized least squares problem
min
_
1
n
_
_
y −Bµ −
¯
Σ
f
α−
¯
Σ
g
β −
¯
Σ
h
γ
_
_
2
+ λ
f
_
_
α
_
_
2
Σ
f
+ λ
g
_
_
β
_
_
2
Σg
+ λ
h
_
_
γ
_
_
2
Σ
h
_
,
(3.9)
where y is the vector of n =

J
j=1
L
j
observations ¦η(y
jl
)¦, B,
¯
Σ
f
, Σ
f
are formed by
B =
__
(1, m
jl
, t
jl
)
_
j,l
¸
n×3
,
¯
Σ
f
=
_
K(m
jl
, m
i
)
¸
n×I
, Σ
f
=
_
K(m
i
, m
i
)
¸
I×I
and
¯
Σ
g
, Σ
g
,
¯
Σ
h
, Σ
h
are formed similarly.
The GAMEED Approach.
The GAMEED refers to an industry approach, namely the generalized additive
maturation and exogenous effects decomposition, adopted by Enterprise Quantitative
Risk Management, Bank of America, N.A. (Sudjianto, et al., 2006). It has also been
56
documented as part of the patented work “Risk and Reward Assessment Mechanism”
(USPTO: 20090063361). Essentially, it has two practical steps:
1. Estimate the maturation curve
ˆ
f(m) and the exogenous effect ˆ g(t) from
η(y(m, t; v)) = f(m; α) + g(t; β) + ε, Eg = 0
where f(m; α) can either follow a spline model or take a parametric form sup-
ported by prior knowledge, and g(t; β) can be specified like in (3.4) upon neces-
sary time binding while retaining the flexibility in modeling the macroeconomic
influence. If necessary, one may further model ˆ g(t; β) by a time-series model in
order to capture the autocorrelation and possible jumps.
2. Estimate the vintage-specific sensitivities to
ˆ
f(m) and ˆ g(t) by performing the
regression fitting
η(y(m
jl
, t
jl
; v
j
)) = γ
j
+ γ
(f)
j
ˆ
f(m
jl
) + γ
(g)
j
ˆ g(t
jl
) + ε
for each vintage j (upon necessary bucketing).
The first step of GAMEED is a reduced type of MEV decomposition without
suffering theidentification problem. The estimates of f(m) and g(t) can be viewed
as the common factors to all vintages. In the second step, the vintage effects are
measured through not only the main (intercept) effects, but also the interactions
with
ˆ
f(m) and ˆ g(t). Clearly, when the interaction sensitivities γ
(f)
j
= γ
(g)
j
= 1, the
industry GAMEED approach becomes a sequential type of MEV decomposition.
The DtD Technique.
The DtD refers to the dual-time dynamics, adopted by Strategic Analytics Inc.,
for consumer behavior modeling and delinquency forecasting. It was brought to the
57
public domain by Breeden (2007). Using our notations, the DtD model takes the
form
η(y(m
jl
, t
jl
; v
j
)) = γ
(f)
j
f(m
jl
) + γ
(g)
j
g(t
jl
) + ε (3.10)
and η
−1
() corresponds to the so-called superposition function in Breeden (2007).
Clearly, the DtD model involves the interactions between the vintage effects and
(f(m), g(t)). Unlike the sequential GAMEED above, Breeden (2007) suggested to
iteratively fit f(m), g(t) and γ
(f)
j
, γ
(g)
j
in a simultaneous manner.
However, the identification problem could as well happen to the DtD technique,
unless it imposes structural assumptions that could break up the trend dependency.
To see this, one may use Taylor expansion for each nonparametric function in (3.10)
then apply Lemma 3.1 to the linear terms. Unfortunately, such identification problem
was not addressed by Breeden (2007), and the stability of the iterative estimation
algorithm might be an issue.
3.3 Gaussian Process Models
In MEV decomposition framework (3.2), the maturation curve f(m), the exoge-
nous influence g(t) and the vintage heterogeneity h(v) are postulated to capture the
endogenous, exogenous and origination effects on the dual-time responses. In this
section, we propose a general approach to MEV decomposition modeling based on
Gaussian processes and provide an efficient algorithm for model estimation.
3.3.1 Covariance Kernels
Gaussian processes are rather flexible in modeling a function Z(x) through a mean
function µ() and a covariance kernel K(, ),
EZ(x) = µ(x), E[Z(x) −µ(x)][Z(x

) −µ(x

)] = σ
2
K(x, x

) (3.11)
58
where σ
2
is the variance scale. For ease of discussion, we will write µ(x) separately
and only consider the zero-mean Gaussian process Z(x). Then, any finite-dimensional
collection of random variables Z = (Z(x
1
), . . . , Z(x
n
)) follows a multivariate normal
distribution Z ∼ N
n
(0, σ
2
Σ) with Σ = [K(x
i
, x
j
)]
n×n
.
The covariance kernel K(, ) plays a key role in constructing Gaussian processes. It
is symmetric K(x, x

) = K(x

, x) and positive definite:
_
x,x

K(x, x

)f(x)f(x

)dxdx

>
0 for any non-zero f ∈ L
2
. The kernel is said to be stationary if K(x, x

) depends only
on [x
i
−x

i
[, i = 1, . . . , d for x ∈ R
d
. It is further said to be isotropic or radial if K(x, x

)
depends only on the distance |x−x

|. For x ∈ R, a stationary kernel is also isotropic.
There is a parsimonious approach to using stationary kernels K(, ) to model certain
nonstationary Gaussian processes such that Cov[Z(x), Z(x

)] = σ(x)σ(x

)K(x, x

),
where the variance scale σ
2
(x) requires a separate model of either parametric or non-
parametric type. For example, Fan, Huang and Li (2007) used the local polynomial
approach to smoothing σ
2
(x).
Several popular classes of covariance kernels are listed below, including the adap-
tive smoothing spline (AdaSS) kernels discussed in Chapter II.
Exponential family: K(x, x

) = exp
_

_
[x −x

[
φ
_
κ
_
, 0 < κ ≤ 2, φ > 0
Mat´ern family: K(x, x

) =
2
1−κ
Γ(κ)
_
[x −x

[
φ
_
κ
K
κ
_
[x −x

[
φ
_
, κ, φ > 0 (3.12)
AdaSS r.k. family: K(x, x

) =
_
X
φ
−1
(u)G
κ
(x, u)G
κ
(x

, u)du, φ(x) > 0.
More examples of covariance kernels can be referred to Stein (1999) and Rasmussen
and Williams (2006).
Both the exponential and Mat´ern families consist of stationary kernels, for which
φ is the scale parameter and κ can be viewed as the shape parameter. In the Mat´ern
family, K
κ
() denotes the modified Bessel function of order κ; see Abramowitz and
Stegun (1972). Often used are the half-integer order κ = 1/2, 3/2, 5/2 and the corre-
59
sponding Mat´ern covariance kernels take the following forms
K(x, x

) = e

|x−x

|
φ
, κ = 1/2
K(x, x

) = e

|x−x

|
φ
_
1 +
[x −x

[
φ
_
, κ = 3/2 (3.13)
K(x, x

) = e

|x−x

|
φ
_
1 +
[x −x

[
φ
+
[x −x

[
2

2
_
, κ = 5/2.
It is interesting that these half-integer Mat´ern kernels correspond to autocorrelated
Gaussian processes of order κ + 1/2; see Stein (1999; p. 31). In particular, the
κ = 1/2 Mat´ern kernel, which is equivalent to the exponential kernel with κ = 1, is
the covariance function of the Ornstein-Uhlenbeck (OU) process.
The AdaSS family of reproducing kernels (r.k.’s) are based on the Green’s function
G
κ
(x, u) = (x −u)
κ−1
+
/(κ −1)! with order κ and local adaptation function φ(x) > 0.
In Chapter 2.2.1, we have derived the explicit r.k. expressions with piecewise φ
−1
(x).
They are nonstationary covariance kernels with the corresponding Gaussian processes
discussed in Chapter 2.3. In this chapter, we concentrate on a particular class of
AdaSS kernels
K(x, x

) =
min{k:τ
k
>x∧x

}

k=1
φ
k
_
x ∧ x

−τ
k−1
_
(3.14)
with piecewise-constant adaptation φ
−1
(x) = φ
k
for x ∈ [τ
k
−1, τ
k
) and knots x

min

τ
0
< τ
1
< < τ
K
≡ x
+
max
. By (2.29), the Gaussian process with kernel (3.14)
generalizes the standard Wiener process with nonstationary volatility, which is useful
for modeling the heterogeneous behavior of an MEV component, e.g. g(t).
3.3.2 Kriging, Spline and Kernel Methods
Using Gaussian processes to model the component functions in (3.2) would give
us a class of Gaussian process MEV models. Among others, we consider
η(y(m, t; v)) = µ(m, t; v) + Z
f
(m) + Z
g
(t) + Z
h
(v) + ε (3.15)
60
where ε ∼ N(0, σ
2
) is the random error, Z
f
(m), Z
g
(t), Z
h
(v) are three zero-mean
Gaussian processes with covariance kernels σ
2
f
K
f
, σ
2
g
K
g
, σ
2
h
K
h
, respectively. To bypass
the identification problem in Lemma 3.1, the overall mean function is postulated to
be µ(m, t; v) ∈ H
0
on the hyperplane of the vintage diagram, where
H
0
⊆ span¦1, m, t¦ ∩ null¦Z
f
(m), Z
g
(t), Z
h
(v)¦, (3.16)
among other choices. For example, choosing cubic spline kernels for all f, g, h al-
lows for H
0
= span¦0, m, t¦, then µ(m, t; v) = µ
0
+ µ
1
m + µ
2
t. In other situations,
the functional space H
0
might shrink to be span¦1¦ in order to satisfy (3.16), then
µ(m, t; v) = µ
0
. In general, let us denote µ(m, t; v) = µ
T
b(m, t; v), for µ ∈ R
d
.
For simplicity, let us assume that Z
f
(m), Z
g
(t), Z
h
(v) are mutually independent
for x ≡ (m, t, v) ∈ R
3
. Consider the sum of univariate Gaussian processes
Z(x) = Z
f
(m) + Z
g
(t) + Z
h
(v), (3.17)
with mean zero and covariance Cov[Z(x), Z(x

)] = σ
2
K(x, x

). By the independence
assumption,
K(x, x

) = λ
−1
f
K
f
(m, m

) + λ
−1
g
K
g
(t, t

) + λ
−1
h
K
h
(v, v

). (3.18)
with λ
f
= σ
2

2
f
, λ
g
= σ
2

2
g
and λ
h
= σ
2

2
h
. Then, given the vintage data (3.1)
with n =

J
j=1
L
j
observations, we may write (3.15) in the vector-matrix form
y = Bµ +¯z, Cov[¯z] = σ
2
_
Σ+I
¸
(3.19)
where y =
_
η(y
jl
)
¸
n×1
, B =
_
b
T
jl
¸
n×d
, ¯z =
_
Z(x
jl
) + ε
¸
n×1
and Σ =
_
K(x
jl
, x
j

l
)
¸
n×n
.
By either GLS (generalized least squares) or MLE (maximum likelihood estimation),
61
µ is estimated to be
´ µ =
_
B
T
_
Σ+I
¸
−1
B
_
−1
B
T
_
Σ+I
¸
−1
y. (3.20)
where the invertibility of Σ + I is guaranteed by large values of λ
f
, λ
g
, λ
h
as in the
context of ridge regression.
In the Gaussian process models there are additional unknown parameters for defin-
ing the covariance kernel (3.18), namely (λ, θ) = (λ
f
, λ
g
, λ
h
, θ
f
, θ
g
, θ
h
), as well as the
unknown variance parameter σ
2
(a.k.a. the nugget effect). The λ
f
, λ
g
, λ
h
, as the ra-
tios of σ
2
to the variance scales of Z
f
, Z
g
, Z
h
, are usually referred to as the smoothing
parameters. The θ
f
, θ
g
, θ
h
denote the structural parameters, e.g scale parameter φ
and shape parameter κ in exponential or Mat´ern family of kernels, or local adap-
tation φ() in AdaSS kernel. By (3.19) and after dropping the constant term, the
log-likelihood function is given by
(µ, λ, θ, σ
2
) = −
n
2
log(σ
2
) −
1
2
log
¸
¸
Σ+I
¸
¸

1

2
(y −Bµ)
T
(Σ+I)
−1
(y −Bµ)
in which Σ depends on (λ, θ). Whenever (λ, θ) are fixed, the µ estimate is given by
(3.20), then one may estimate the variance parameter to be
ˆ σ
2
=
1
n −p
(y −B´ µ)Σ
−1
(y −B´ µ), (3.21)
where p = rank(B) is plugged in order to make ˆ σ
2
an unbiased estimator.
Usually, numerical procedures like Newton-Raphson algorithm are needed to itera-
tively estimate (λ, θ) that admit no closed-form solutions; see Fang, Li and Sudjianto
(2006, ¸5). Such a standard maximum likelihood estimation method is computation-
ally intensive. In the next subsection, we will discuss how to iteratively estimate the
parameters by an efficient backfitting algorithm.
62
MEV Kriging means the prediction based on the Gaussian process MEV model
(3.15), where the naming of ‘Kriging’ follows the geostatistical convention; see Cressie
(1993) in spatial statistics. See also Fang, Li and Sudjianto (2006) for the use of Krig-
ing in the context of computer experiments. Kriging can be equally treated by either
multivariate normal distribution theory or Bayesian approach, since the conditional
expectation of multivariate normal distribution equals the Bayesian posterior mean.
The former approach is taken here for making prediction from the noisy observations,
for which the Kriging behaves as a smoother rather than an interpolator.
Let us denote the prediction of η(ˆ y(x)) at x = (m, t, v) by ζ(x) = ´ µ
T
b(x) +
E
_
Z
f
(m) +Z
g
(t) +Z
h
(v) +ε
¸
¸
¯z
_
, where E[Z[¯z] denotes the conditional expectation of
Z given ¯z = y −B´ µ. By the property of multivariate normal distribution,
ˆ
ζ(x) = ´ µ
T
b(x) +ξ
T
n
(x)
_
Σ+I
¸
−1
_
y −B´ µ
_
(3.22)
where ξ
n
(x) denotes the vector of
_
K(x, x
jl
)
¸
n×1
evaluated between x and each sam-
pling point.
The Kriging (3.22) is known to be a best linear unbiased predictor. It has also
a smoothing spline reformulation that covers the additive cubic smoothing splines
model (3.6) as a special case. Formally, we have the following proposition.
Proposition 3.2 (MEV Kriging as Spline Estimator). The MEV Kriging (3.22) can
be formulated as a smoothing spline estimator
arg min
ζ∈H
_
J

j=1
L
j

l=1
_
η(y
jl
) −ζ(x
jl
)
_
2
+
_
_
ζ
_
_
2
H
1
_
(3.23)
where H = H
0
⊕H
1
with H
0
⊆ span¦1, m, t¦ and H
1
is the reproducing kernel Hilbert
space induced by (3.18) such that H
0
⊥ H
1
.
63
The spline solution to (3.23) can be written as a finite-dimensional expression
ζ(x) = µ
T
b(x) +
J

j=1
L
j

l=1
c
jl
K(x, x
jl
) (3.24)
with coefficients determined by the regularized least squares,
min
_
_
_
y −Bµ −Σc
_
_
2
+
_
_
c
_
_
Σ
_
. (3.25)
Furthermore, we have the following simplification.
Proposition 3.3 (Additive Separability). The MEV Kriging (3.22) can be expressed
additively as
ζ(x) = µ
T
b(x) +
I

i=1
α
i
K
f
(m, m
i
) +
L

l=1
β
l
K
g
(t, t
l
) +
J

j=1
γ
j
K
g
(v, v
j
) (3.26)
based on I distinct m-values, L distinct t-values and J distinct v-values in (3.1). The
coefficients can be estimated by
min
_
_
_
y −Bµ −
¯
Σ
f
α−
¯
Σ
g
β −
¯
Σ
h
γ
_
_
2
+ λ
f
_
_
α
_
_
2
Σ
f
+ λ
g
_
_
β
_
_
2
Σg
+ λ
h
_
_
γ
_
_
2
Σ
h
_
,
(3.27)
where the matrix notations are the same as in (3.9) up to mild modifications.
Proposition 3.3 converts the MEV Kriging (3.22) to a kernelized additive model
of the form (3.26). It connects the kernel methods in machine learning that advocates
the use of covariance kernels for mapping the original inputs to some high-dimensional
feature space. Based on our previous discussions, such kernelized feature space cor-
responds to the family of Gaussian processes, or the RKHS; see e.g. Rasmussen and
Williams (2006) for more details.
Proposition 3.3 breaks down the high-dimensional parameter estimation and it
makes possible the development of a backfitting algorithm to deal with three uni-
64
variate Gaussian processes separately. Such backfitting procedure provides a more
efficient procedure than the standard method of maximum likelihood estimation.
Besides, it can easily deal with the collinearity problem among Z
f
(m), Z
g
(t), Z
h
(v)
caused by the hyperplane dependency of vintage diagram.
3.3.3 MEV Backfitting Algorithm
For given (λ, θ), the covariance kernels in (3.18) are defined explicitly, so one
may find the optimal solution to (3.27) by setting partial derivatives to zero and
solving the linear system of algebraic equations. There are however complications for
determining (λ, θ). In this section we propose a backfitting algorithm to minimize the
regularized least squares criterion (3.27), together with a generalized cross-validation
(GCV) procedure for selecting the smoothing parameters λ
f
, λ
g
, λ
h
, as well as the
structural parameters θ
f
, θ
g
, θ
h
used for defining the covariance kernels. Here, the
GCV criterion follows our discussion in Chapter 2.2.2; see also Golub, Heath and
Wahba (1979) in the context of ridge regression.
Denote the complete set of parameters by Ξ = ¦µ; α, λ
f
, θ
f
; β, λ
g
, θ
g
; γ, λ
h
, θ
h
; σ
2
¦
where λ’s are smoothing parameter and θ’s are structural parameters for defining the
covariance kernels. Given the vintage data (3.1) with notations y, B in (3.6):
Step 1: Set the initial value of µ to be the ordinary least squares (OLS) estimate
(B
T
B)
−1
B
T
y; Set the initial values of β, γ to be zero.
Step 2a: Given Ξ with (α, λ
f
, θ
f
) unknown, get the pseudo data ¯ y = y −Bµ−
¯
Σ
g
β −
¯
Σ
h
γ. For each input (λ
f
, θ
f
), estimate α by ridge regression
´ α =
_
¯
Σ
T
f
¯
Σ
f
+ λ
f
Σ
f
_
−1
¯
Σ
T
f
¯ y (3.28)
where
¯
Σ
f
=
_
K
f
(m
jl
, m
i
; θ
f
)
¸
n×I
and Σ
f
=
_
K
f
(m
i
, m
j
; θ
f
)
¸
I×I
. Determine the
65
best choice of (λ
f
, θ
f
) by minimizing
GCV(λ
f
, θ
f
) =
1
n
_
_
_(I −
¯
S(λ
f
, θ
f
))¯ y
_
_
_
2
__
1 −
1
n
Trace
_
¯
S(λ
f
, θ
f
)
_
_
2
, (3.29)
where
¯
S(λ
f
, θ
f
) =
¯
Σ
f
_
¯
Σ
T
f
¯
Σ
f
+ λ
f
Σ
f
_
−1
¯
Σ
T
f
.
Step 2b: Run Step 2a with the unknown parameters replaced by (β, λ
g
, θ
g
) and
¯ y = y−Bµ−
¯
Σ
f
α−
¯
Σ
h
γ. Form
¯
Σ
g
, Σ
g
and
¯
S(λ
g
, θ
g
). Select (λ
g
, θ
g
) by GCV,
then estimate β by ridge regression.
Step 2c: Run Step 2a with the unknown parameters replaced by (γ, λ
h
, θ
h
) and
¯ y = y − Bµ −
¯
Σ
f
α −
¯
Σ
g
β. (For the ad hoc approach, remove the trend of ¯ y
in the vintage direction.) Form
¯
Σ
h
, Σ
h
and
¯
S(λ
h
, θ
h
). Select (λ
h
, θ
h
) by GCV.
Estimate γ by ridge regression.
Step 2d: re-estimate the µ by GLS (3.20) for given (λ
f
, λ
g
, λ
h
, θ
f
, θ
g
, θ
h
).
Step 3: Repeat steps 2a–2d until convergence, say, when the estimates of
µ, α, β, γ change less than a pre-specified tolerance. Obtain ˆ σ
2
by (3.21).
For future reference, the above iterative procedures are named as the MEV Back-
fitting algorithm. It is efficient and can automatically take into account selection of
both smoothing and structural parameters. After obtaining the parameter estimates,
we may reconstruct the vintage data by ´ y = B´ µ+
¯
Σ
f
´ α+
¯
Σ
g
´
β+
¯
Σ
h
´ γ. The following
are several remarks for practical implementation:
1. In order to capture a diversity of important features, a combination of different
covariance kernels can be included in the same MEV model. The exponential,
Mat´ern and AdaSS r.k. families listed in (3.12), (3.13), (3.14) are of our primary
interest here, whereas other types of kernels can be readily employed. When
the underlying function is heterogeneously smooth, e.g. involving jumps, one
may choose the AdaSS kernel (3.14).
66
2. For the ridge regression (3.28), the matrix
¯
Σ
f
contains row-wise replicates and
can be reduced by weighted least squares, in which case the response vector
¯ y is reduced to the pointwise averages; see (2.4). The evaluation of the GCV
criterion (3.29) can be modified accordingly; see (2.22).
3. When the mean function µ(m, t; v) is specified to be the intercept effect, i.e.
µ(m, t; v) = µ
0
, one may replace µ
0
by the mean of y in step 1 and step 2d. In
this case, we may write the MEV Kriging as ζ(x) =
ˆ
f(m) + ˆ g(t) +
ˆ
h(v) with
ˆ
f(m) = ˆ µ
0
+
I

i=1
ˆ α
i
K
f
(m, m
i
), ˆ g(t) =
L

l=1
ˆ
β
l
K
g
(t, t
l
),
ˆ
h(v) =
J

j=1
ˆ γ
j
K
h
(v, v
j
).
4. In principal, both ˆ g(t) and
ˆ
h(v) in step 2b and step 2c have mean zero, but in
practice, they may need the centering adjustment due to machine rounding.
3.4 Semiparametric Regression
Having discussed the MEV decomposition models based on Gaussian processes,
we consider in this section the inclusion of covariates in the following two scenarios:
1. x(m
jl
, t
jl
; v
j
) ∈ R
p
(or x
jl
in short) for a fixed segment, where x
jl
denotes the
covariates of j-th vintage at time t
jl
and age m
jl
= t
jl
−v
j
;
2. x(m
kjl
, t
kjl
; v
kj
) ∈ R
p
and z
k
∈ R
q
for multiple segments that share the same
vintage diagram, where z
k
represents the static segmentation variables.
Of our interest are the the semiparametric MEV regression models with both non-
parametric f(m), g(t), h(v) and parametric covariate effects.
After a brief discussion of the first situation with a single segment, we will spend
more time in the second situation with multiple segments. For simplicity, the covari-
ates x
jl
or x
kjl
are assumed to be deterministic, i.e. no error in variables.
67
3.4.1 Single Segment
Given a single segment with dual-time observations ¦y(m
jl
, t
jl
; v
j
), x(m
jl
, t
jl
; v
j

for j = 1, . . . , J and l = 1, . . . , L
j
, we consider the partial linear model that adds the
linear covariate effects to the MEV decomposition (3.2),
η(y
jl
) = f(m
jl
) + g(t
jl
) + h(v
j
) +π
T
x
jl
+ ε (3.30)
= µ
T
b
jl

T
x
jl
+ Z
f
(m
jl
) + Z
g
(t
jl
) + Z
h
(v
j
) + ε
where x
jl
≡ x(m
jl
, t
jl
; v
j
), (µ, π) are the parameters for the mean function, and f, g, h
are nonparametric and modeled by Gaussian processes.
The semiparametric regression model (3.30) corresponds to the linear mixed-effect
modeling, and the covariate effects π can be simply estimated in the same way we
estimated µ in (3.20). That is, conditional on Σ,
(´ µ; ´ π) =
_
¯
B
T
_
Σ+I
¸
−1
¯
B
_
−1
¯
B
T
_
Σ+I
¸
−1
y. (3.31)
where
¯
B = [B
.
.
.X] with X = [x
jl
]
n×p
. This GLS estimator can be incorporated with
the established MEV Backfitting algorithm discussed in Section 3.3.3. Only step 1
and step 2d need modified to be:
Step 1: Set (µ, π) initially to be the OLS estimate, i.e. (3.31) with zero Σ.
Step 2d: re-estimate (µ; π) by (3.31) for given (λ
f
, λ
g
, λ
h
, θ
f
, θ
g
, θ
h
).
Then, the whole set of parameters in (3.30) can be iteratively estimated by the mod-
ified backfitting algorithm. There is a word of caution about multicollinearity, since
the orthogonality condition in Proposition 3.2 may be not satisfied between the co-
variate space span¦x¦ and the Gaussian processes. In this case, instead of starting
from the fully nonparametric approach, we may assume the f, g, h components in
(3.30) can be approximated by the kernel basis expansion through (3.26), then run
68
backfitting. Fortunately, the smoothing parameters λ
f
, λ
g
, λ
h
may guard against pos-
sible multicollinearity and give the shrinkage estimate of the covariate effects π. As
a trade-off, part of Gaussian process components might be over-smoothed.
3.4.2 Multiple Segments
Suppose we are given K segments of data that share the same vintage diagram,
_
y(m
kjl
, t
kjl
; v
kj
), x(m
kjl
, t
kjl
; v
kj
), z
k
_
, (3.32)
for k = 1, . . . , K, j = 1, . . . , J and l = 1, . . . , L
j
, where x
kjl
∈ R
p
and z ∈ R
q
.
Denote by N the total number of observations across K segments. There are various
approaches to modeling cross-segment responses, e.g.
a. η(y
kjl
) = f(m
kjl
) + g(t
kjl
) + h(v
kj
) +π
T
x
kjl

T
z
k
+ ε
b. η(y
kjl
) = f
k
(m
kjl
) + g
k
(t
kjl
) + h
k
(v
kj
) +π
T
k
x
kjl
+ ε (3.33)
c. η(y
kjl
) = u
(k)
f
f(m
kjl
) + u
(k)
g
g(t
kjl
) + u
(k)
h
h(v
kj
) +π
T
k
x
kjl
+ W(z
k
) + ε.
Here, the first approach assumes that the segments all perform the same MEV com-
ponents and covariate effect π, and differ only through ω
T
z
k
. The second approach
takes the other extreme assumption such that each segment performs marginally
differently in terms of f, g, h and covariate effects, therefore equivalent to K sepa-
rate single-segment models. Clearly, both of these approaches can be treated by the
single-segment modeling technique discussed in the last section.
Of our interest of study is the third approach that allows for multiple segments to
have different sensitivities u
(k)
f
, u
(k)
g
, u
(k)
h
∈ R to the same underlying f(m), g(t), h(v).
In another word, the functions f(m), g(t), h(v) represent the common factors across
segments, while the coefficients u
(k)
f
, u
(k)
g
, u
(k)
h
represent the idiosyncratic multipliers.
Besides, it allows for the segment-specific add-on effects W(z
k
) and segment-specific
69
covariate effect π
k
. Thus, the model (c) is postulated as a balance between the
over-stringent model (a) and the over-flexible model (b).
Different strategies can be applied to model the segment-specific effects W(z
k
).
Among others, we consider (c1) the linear modeling W(z
k
) = ω
T
z
k
with parameter
ω ∈ R
q
and (c2) the Gaussian process modeling
EW(z
k
) = 0, EW(z
k
)W(z
k
) = λ
W
K
W
(z
k
, z
k
) (3.34)
for the smoothing parameter λ
W
> 0 and some pre-specificed family of covariance
kernels K
W
. For example, we may use the squared exponential covariance kernel in
(3.12). The estimation of these two types of segment effects are detailed as follows.
Linear Segment Effects.
Given the multi-segment vintage data (3.32), consider
η(y
kjl
) = u
(k)
f
f(m
kjl
) + u
(k)
g
g(t
kjl
) + u
(k)
h
h(v
kj
) +π
T
k
x
kjl

T
z
k
+ ε (3.35)
with basis approximation
f(m) = µ
0
+
I

i=1
α
i
K
f
(m, m
i
), g(t) =
L

l=1
β
l
K
g
(t, t
l
), h(v) =
J

j=1
γ
j
K
g
(v, v
j
)
through the exponential or Mat´ern kernels. (The smoothing spline kernels can be
also used after appropriate modification of the mean function.) Using the matrix
notations, the model takes the form
_
u
f
, Bµ +
¯
Σ
f
α
_
+
_
u
g
,
¯
Σ
g
β
_
+
_
u
h
,
¯
Σ
h
γ
_
+Xπ +Zω
where the underscored vectors and matrices are formed based on multi-segment ob-
servations, (u
f
, u
g
, u
h
) are the extended vector of segment-wise multipliers, X has
70
K p columns corresponding to the vectorized coefficients π = (π
1
; . . . ; π
K
), and
Z is as usual. When both the kernel parameters (λ, θ) for K
f
, K
g
, K
h
and the (con-
densed K-vector) multipliers u
f
, u
g
, u
h
are fixed, the regularized least squares can be
used for parameter estimation, using the same penalty terms as in (3.27).
Iterative procedures can be used to estimate all the parameters simultaneously,
including ¦µ, α, β, γ, λ, θ¦ for the underlying f, g, h functions and u
f
, u
g
, u
h
, π, ω
across segments. The following algorithm is constructed by utilizing the MEV Back-
fitting algorithm as an intermediate routine.
Step 1: Set the initial values of (π, ω) to be the OLS estimates for fitting
y = Xπ +Zω +ε based on all N observations across K segments.
Step 2: Run the MEV Backfitting algorithm for the pseudo data y −X´ π −Z´ ω
for obtaining the estimates of
ˆ
f(m), ˆ g(t),
ˆ
h(v)
Step 3: Update (u
f
, u
g
, u
h
) and (π, ω) by OLS estimates for fitting
y = Fu
f
+Gu
g
+Hu
h
+Xπ +Zω +ε
where F, G, H are all N K regression sub-matrices constructed based on
ˆ
f(m
kjl
), ˆ g(t
kjl
),
ˆ
h(v
kj
), respectively.
Step 4: Repeat step 2 and step 3 until convergence.
Note that in Step 2, the common factors
ˆ
f(m), ˆ g(t) and
ˆ
h(v) are estimated first
by assuming u
k
= Eu
k
= 1 for each of f, g, h. In Step 3, the segment-wise multipliers
u
k
’s are updated by regressing over
ˆ
f, ˆ g and
ˆ
h. In so doing, the concern of stability
is one reason. Disregarding the segment variability in Step 2 is also reasonable, since
f(m), g(t), h(v) are by definition the common factors in the overall sense. Then,
including the segment variability in Step 3 improves both the estimation of (π, ω)
71
and the estimation of f, g, h in the next iteration of Step 2. Such fitting procedure is
practically interesting and computationally tractable.
Spatial Segment Heterogeneity.
We move to consider the segment effect modeling by a Gaussian process
η(y
kjl
) = u
(k)
f
f(m
kjl
) + u
(k)
g
g(t
kjl
) + u
(k)
h
h(v
kj
) + W(z
k
) +π
T
k
x
kjl
+ ε (3.36)
where f, g, h are approximated in the same way as in (3.35) and W(z
k
) is assumed to
follow (3.34). The orthogonality constraint between W(z) and ¦f(m), g(t), h(v)¦
can be justified from a tensor-product space point of view, but the orthogonal-
ity is not guaranteed between W(z
k
) and the span of x
kjl
. Despite of the or-
thogonality constraint, let us directly approximate W(z) by the basis expansion
W(z) =

K
k=1
ω
k
K
W
(z, z
k
) associated with the unknown scale parameter (denoted
by θ
W
). Then, we have the following two iterative stages of model estimation.
Stage 1: for given π
k
and u
k
≡ 1, k = 1, . . . , K, estimate the common-factor
functions f(m), g(t), h(v) and the spatial segment effect W(z
k
) in
η(y
kjl
) −π
T
k
x
kjl
= µ
0
+
I

i=1
α
i
K
f
(m
kjl
, m
i
) +
L

i=1
β
i
K
g
(t
kjl
, t
i
)
+
J

i=1
γ
i
K
g
(v
kj
, v
i
) +
K

i=1
ω
i
K
W
(z
k
, z
i
) + ε
by the regularized least squares
min
_
_
_
¯ y −Bµ −
¯
Σ
f
α−
¯
Σ
g
β −
¯
Σ
h
γ −
¯
Σ
W
ω
_
_
2

f
_
_
α
_
_
2
Σ
f
+ λ
g
_
_
β
_
_
2
Σg
+ λ
h
_
_
γ
_
_
2
Σ
h
+ λ
W
_
_
ω
_
_
2
Σ
W
_
, (3.37)
where ¯ y is the pseudo response vector of η(y
kjl
) − π
T
k
x
kjl
for all (k, j, l). The
regression sub-matrices B,
¯
Σ
f
,
¯
Σ
g
,
¯
Σ
h
,
¯
Σ
W
are constructed accordingly, all hav-
72
ing N rows. The matrices used for regularization are of size I I, LL, J J
and K K, respectively. Both the smoothing parameters λ
f
, λ
g
, λ
h
, λ
W
> 0
and the structural parameters θ
f
, θ
g
, θ
h
, θ
W
need to be determined.
Stage 2: for given f, g, h, estimate the parameters (u
f
, u
g
, u
h
) and π based on
y = Fu
f
+Gu
g
+Hu
h
+Xπ + ¯ w, Cov[ ¯ w] = σ
2
_
λ
W
Σ
W
+I
¸
where Σ
W
=
_
K
W
(z
k
, z
k
)
¸
N×N
is obtained by evaluating every pair of obser-
vations either within or between segments. The parameter estimation can be
obtained by the generalized least squares.
Obviously, the regularized least squares in Stage 1 extends the MEV Backfitting
algorithm to entertain a fourth kernel component K
W
. Let us call the extended
algorithm MEVS Backfitting, where the added letter “S” means the segment effect.
Iterate Stage 1 and Stage 2 until convergence, then we obtain the following list of
effect estimates (as a summary):
1. Common-factor maturation curve:
ˆ
f(m) = ˆ µ
0
+

I
i=1
ˆ α
i
K
f
(m, m
i
)
2. Common-factor exogenous influence: ˆ g(t) =

L
l=1
ˆ
β
l
K
g
(t, t
l
)
3. Common-factor vintage heterogeneity:
ˆ
h(v) =

J
j=1
ˆ γ
j
K
h
(v, v
j
)
4. Idiosyncratic multipliers to (
ˆ
f, ˆ g,
ˆ
h): u
(k)
f
, u
(k)
g
, u
(k)
h
for k = 1, . . . , K
5. Segment-specific covariate effects: π
k
, for k = 1, . . . , K
6. Spatial segment heterogeneity:
´
W(z) =

K
k=1
ˆ ω
k
K
W
(z, z
k
).
3.5 Applications in Credit Risk Modeling
This section presents the applications of MEV decomposition methodology to
the real data of (a) Moody’s speculative-grade corporate default rates shown in Ta-
73
ble 1.1, and (b) the tweaked sample of retail loan loss rates discussed in Chapter II.
Both examples demonstrate the rising challenges for analysis of the credit risk data
with dual-time coordinates, to which we make only an initial attempt based on the
MEV Backfitting algorithm. Our focus is on understanding (or reconstruction) of the
marginal effects first, then perform data smoothing on the dual-time domain. To test
the MEV decomposition modeling, we begin with a simulation study based on the
synthetic vintage data.
3.5.1 Simulation Study
Let us synthetize the year 2000 to 2008 monthly vintages according to the diagram
shown in Figure 1.3. Assume each vintage is originated at the very beginning of
the month, and its follow-up performance is observed at each month end. Let the
horizontal observation window be left truncated from the beginning of 2005 (time
0) and right truncated by the end of 2008 (time 48). Vertically, all the vintages
are observed up to 60 months-on-book. It thus ends with the rectangular diagram
shown in Figure 3.1, consisting of vintages j = −60, . . . , −1 originated before 2005
and j = 0, 1, . . . , 47 originated after 2005. For each vintage j with origination time
v
j
, the dual-time responses are simulated by
log(y
jl
) = f
0
(m
jl
) + g
0
(t
jl
) + h
0
(v
j
) + ε, ε ∼ N(0, σ
2
) (3.38)
where t
jl
runs through max¦v
j
, 0¦ to 48 and m
jl
= t
jl
− v
j
. Regardless of the noise,
the responses y
jl
is the product of three marginal effects, exp(f
0
(m)) exp(g
0
(t))
exp(h
0
(v)). In our simulation setting, the underlying truth exp(f
0
(m)) takes the
smooth form of inverse Gaussian hazard rate function (1.13) with parameter c = 6
and b = −0.02, the underlying exp(g
0
(t)) is volatile with exponential growth and
a jump at t = 23, and the underlying h
0
(v) is assumed to be a dominating sine
74
wave within [0, 40] and relatively flat elsewhere. See their plots in Figure 3.2 (top
panel). Adding the noise with σ = 0.1, we obtain the synthetic vintage data, whose
projection views in lifetime, calendar and vintage origination time are shown in the
second panel of Figure 3.2. One may check the pointwise averages (or medians in the
vintage box-plots) in each projection view, compare them with the underlying truths
in order to see the contamination effects by other marginals.
This synthetic data can be used to demonstrate the flexibility of MEV decompo-
sition technique such that a combination of different kernels can be employed. We
manually choose the order-3/2 Mat´ern kernel (3.13) for f(m), the AdaSS r.k. (3.14)
for g(t), and the squared exponential kernel (3.12) for h(v), then run the MEV Back-
fitting algorithm based on the log-transformed data. Since our data has very few
observations in both ends of the vintage direction, we performed binning for v ≥ 40,
as well as binning for v ≤ 0 based on the prior knowledge that pre-2005 vintages
can be treated as a single bucket. Then, the MEV Backtting algorithm outputs the
estimated marginal effects plotted in Figure 3.2 (bottom panel). They match the
underlying truth except for slight boundary bias.
The GCV-selected structural and smoothing parameters are tabulated in Ta-
ble 3.1. In fitting g(t) with AdaSS, we have specified the knots τ
k
to handle the
jump by diagnostic from the raw marginal plots; for simplicity we have enforced φ
k
to be the same for regions without jump. It is interesting to check the ordering of
cross-validated smoothing parameters. By (3.18), the smoothing parameters corre-
spond to the reciprocal ratios of their variance scales to the nugget effect σ
2
, therefore,
the smaller the smoothing parameter is, the larger variation the corresponding Gaus-
sian process holds. Thus, the ordering of cross-validated λ
f
> λ
h
> λ
g
confirms with
the smoothness assumption for f
0
, g
0
, h
0
in our simulation setting.
Combining the three marginal estimates, and taking the inverse transform gives
the fitted rates; see Figure 3.3 for the projection views. The noise-removed recon-
75
Figure 3.2: Synthetic vintage data analysis: (top) underlying true marginal effects;
(2nd) simulation with noise; (3nd) MEV Backfitting algorithm upon con-
vergence; (bottom) Estimation compared to the underlying truth.
76
Table 3.1: Synthetic vintage data analysis: MEV modeling exercise with GCV-
selected structural and smoothing parameters.
MEV Kernel choice Structural parameter Smoothing parameter
f(m) Mat´ern (κ = 3/2) φ = 2.872 20.086
g(t) AdaSS r.k. Eq.(3.14) τ
k
= 21, 25, 47 1.182
φ
k
= 0.288, 8.650, 0.288
h(v) Squared exponential φ = 3.000 7.389
Figure 3.3: Synthetic vintage data analysis: (top) projection views of fitted values;
(bottom) data smoothing on vintage diagram.
77
struction can reveal the prominent marginal features. It is clear that the MEV decom-
position modeling performs like a smoother on the dual-time domain. One may also
compare the level plots of the raw data versus the reconstructed data in Figure 3.3
to see the dual-time smoothing effect on the vintage diagram.
3.5.2 Corporate Default Rates
In the Figure 1.4 of Chapter 1 we have previewed the annual corporate default
rates (in percentage) of Moody’s speculative-grade cohorts: 1970-2008. Each yearly
cohort is regarded as a vintage. For vintages originated earlier than 1988, the default
rates are observed up to 20 years maximum in lifetime; for vintages originated after
1988, the default rates are observed up to the end of year 2008. Such truncation in
both lifetime and calendar time corresponds to the trapezoidal prototype of vintage
diagram illustrated in Figure 3.1.
Figure 1.4 gives the projection views of the empirical default rates in lifetime,
calendar and vintage origination time. By checking the marginal averages or medians
plotted on top of each view, it is evident that exogenous influence by macroeconomic
environment is more fluctuating than both the maturation and vintage effects. How-
ever, such marginal plots are contaminated by each other; see e.g. the many spikes
shown in the lifetime projection actually correspond to the exogenous effects. It is
our purpose to separate these marginal effects.
We made slight data preprocessing. First, the transform function is pre-specified
to be η(y) = log(y/100) where y is the raw percentage measurements and zero re-
sponses are treated as missing values (one may also match them to a small positive
value). Second, since there are very limited number of observations for small calendar
time, the exogenous effects g(t) for t = 2, 3 are merged to t = 4, and the leftmost
observation at t = 1 is removed because of its jumpy behavior. Besides, the vintage
effects are merged for large v = 30, . . . , 39.
78
Figure 3.4: Results of MEV Backfitting algorithm upon convergence (right panel)
based on the squared exponential kernels. Shown in the left panel is the
GCV selection of smoothing and structural parameters.
Figure 3.5: MEV fitted values: Moody’s-rated Corporate Default Rates
79
To run the MEV Backfitting algorithm, K
f
, K
g
, K
h
are all chosen to be the squared
exponential covariance kernels, for which the mean function is fixed to be the intercept
effect. For each univariate Gaussian process fitting, say Z
f
(m), the GCV criterion
selects the smoothing parameter λ
f
, as well as the structural parameter θ
f
(i.e. the
scale parameter φ in (3.12)). For demonstration, we used the grid search on the
log scale of λ with only 3 different choices of φ = 1, 2, 3 (φ = 3 is set to be the
largest scale for stability, otherwise the reciprocal conditional number of the ridged
regression matrix would vanish). The backfitting results upon convergence are shown
in Figure 3.4, where GCV selects the largest scale parameter in each iteration. An
interesting result is the GCV-selected smoothing parameter λ, which turns out to
be 9.025, 0.368, 6.050 for Z
f
, Z
g
, Z
h
, respectively. By (3.18), it is implied that the
exogenous influence shares the largest variation, followed by the vintage heterogeneity,
while the maturation curve has the least variation. Furthermore, the changes of
estimated exogenous effects in Figure 3.4 follow the NBER recession dates shown in
Figure 1.1.
The inverse transform η
−1
() maps the fitted values to the original percentage scale.
Figure 3.5 plots the fitted values in both lifetime and calendar views. Compared to
the raw plots in Figure 1.4, the MEV modeling could retain some of most interesting
features, while smoothing out others.
3.5.3 Retail Loan Loss Rates
Recall the tweaked credit-risk sample discussed in Chapter II. It was used as a
motivating example for the development of adaptive smoothing spline in fitting the
pointwise averages that suppose to smooth out in lifetime (month-on-book, in this
case). However, the raw responses behave rather abnormally in large month-on-book,
partly because of the small size of replicates, and partly because of the contamination
by the exogenous macroeconomic influence. In this section, the same tweaked sample
80
is analyzed on the dual-time domain, for which we model not only the maturation
curve, but also the exogenous influence and the vintage heterogeneity effects.
The tweaked sample of dual-time observations are supported on the triangular
vintage diagram illustrated in Figure 3.1, which is mostly typical in retail risk man-
agement. They measure the loss rates in retail revolving exposures from January 2000
to December 2006. Figure 3.6 (top panel) shows the projection views in lifetime, cal-
endar and vintage origination time, where the new vintages originated in 2006 are
excluded since they have too short window of performance measurements. Some of
the interesting features for such vintage data are listed below:
1. The rates are fixed to be zero for small month-on-book m ≤ 4.
2. There are less and less observations when the month-on-book m increases.
3. There are less and less observations when the calendar month t decreases.
4. The rates behave relatively smooth in m, from the lifetime view.
5. The rates behave rather dynamic and volatile in t, from the calendar view.
6. The rates show heterogeneity across vintages, from the vintage view.
To model the loss rates, timescale binding was performed first for large month-
on-book m ≥ 72 and small calendar time t ≤ 12. We pre-specified the log transform
for the positive responses (upon removal of zero responses). The MEV decomposition
model takes the form of log(y
jl
) = f(m
jl
) +g(t
jl
) +h(v
j
) +ε, subject to Eg = Eh = 0.
In this way, the original responses are decomposed multiplicatively to be e
f(m)
, e
g(t)
and e
h(v)
, where the exponentiated maturation curve could be regarded as the baseline,
together with the exponentiated exogenous and vintage heterogeneity multipliers.
In this MEV modeling exercise, the Mat´ern kernels were chosen to run the MEV
Backfitting algorithm, and the GCV criterion was used to select the smoothing and
structural parameters. The fitting results are presented in Figure 3.6 (bottom panel),
81
Figure 3.6: Vintage data analysis of retail loan loss rates: (top) projection views of
emprical loss rates in lifetime m, calendar t and vintage origination time
v; (bottom) MEV decomposition effects
ˆ
f(m), ˆ g(t) and
ˆ
h(v) (at log scale).
which shows the estimated
ˆ
f(m), ˆ g(t),
ˆ
h(v) in the log response scale. Each of these es-
timated functions follows the general dynamic pattern of the corresponding marginal
averages in the empirical plots above. The differences in local regions reflect the
improvements in each marginal direction after removing the contamination of other
marginals. Compared to both the exogenous and vintage heterogeneity effects, the
maturation curve is rather smooth. It is also worth pointing out that the exogenous
spike could be captured relatively well by the Mat´ern kernel.
3.6 Discussion
For the vintage data that emerge in credit risk management, we propose an MEV
decomposition framework to estimate the maturation curve, exogenous influence and
82
vintage heterogeneity on the dual-time domain. One difficulty associated with the
vintage data is the intrinsic identification problem due to their hyperplane nature,
which also appears in the conventional cohort analysis under the age-period-cohort
model. To regularize the three-way marginal effects, we have studied nonparametric
smoothing based on Gaussian processes, and demonstrated its flexibility through
choosing covariance kernels, structural and smoothing parameters. An efficient MEV
Backfitting algorithm is provided for model estimation. The new technique is then
tested through a simulation study and applied to analyze both examples of corporate
default rates and retail loss rates, for which some preliminary results are presented.
Beyond our initial attempt made in this chapter, there are other open problems
associated with the vintage data analysis (VDA) worthy of future investigation.
1. The dual-time model assessment and validation remains an issue. In the MEV
Backfitting algorithm we have used the leave-one-out cross-validation for fitting
each marginal component function. It is however not directly performing cross-
validation on the dual-time domain, which would be an interesting subject of
future study.
2. The model forecasting is often a practical need in financial risk management,
as it is concerning about the future uncertainty. To make prediction within the
MEV modeling framework, one needs to extrapolate f(m), g(t), h(v) in each
marginal direction in order to know about their behaviors out of the train-
ing bounds of m, t, v. By Kriging theory, for any x = (m, t; v) on the vintage
diagram, the formula (3.22) provides the conditional expectation given the his-
torical performance. Furthermore, it is straightforward to simulate the random
paths for a sequence of inputs (x
1
, . . . , x
p
) based on the multivariate conditional
normal distribution N
p

(c)
, Σ
(c)
), where the conditional mean and covariance
83
are given by
_
¸
_
¸
_
µ
(c)
= B´ µ + Υ
_
Σ+I
¸
−1
_
y −B´ µ
_
Σ
(c)
= Σ−Υ
_
Σ+I
¸
−1
Υ
T
with
_
¸
_
¸
_
Υ =
_
K(x
i
, x
jl
)
¸
p×n
Σ =
_
K(x
i
, x
j
)
¸
p×p
(3.39)
using the aggregated kernel given by (3.18) and the notations in (3.19). How-
ever, there is a word of caution about making prediction of the future from
the historical training data. In the MEV modeling framework, extrapolating
the maturation curve f(m) in lifetime is relatively safer than extrapolating the
exogenous and vintage effects g(t) and h(v), since f(m) is of self-maturation
nature, while g(t) can be interfered by future macroeconomic changes and h(v)
is also subject to the future changes of vintage origination policy. Before making
forecasts in terms of calendar time, the out-of-sample test is recommended.
3. For extrapolation of MEV component functions, parametric modeling approach
is sometimes more tractable than using Gaussian processes. For example, if
given the prior knowledge that the maturation curve f(m) follows the shape of
the inverse Gaussian hazard rate (1.13), like our synthetic case in Section 3.5,
one may estimate the parametric f(m) with nonlinear regression technique. As
for the exogenous effect g(t) that is usually dynamic and volatile, one may use
parametric time series models; see e.g. Fan and Yao (2003). Such parametric
approach would benefit the loss forecasting as they become more and more
validated through experiences.
4. The MEV decomposition we have considered so far is simplified to exclude the
interaction effects involving two or all of age m, time t and vintage v arguments.
In formulating the models by Gaussian processes, the mutual independence
among Z
f
(m), Z
g
(t), Z
h
(v) is also postulated. One of the ideas for taking into
account the interaction effects is to cross-correlate Z
f
(m), Z
g
(t), Z
h
(v). Consider
84
the combined Gaussian process (3.17) with covariance
Cov[Z
f
(m) + Z
g
(t) + Z
h
(v), Z
f
(m

) + Z
g
(t

) + Z
h
(v

)]
= EZ
f
(m)Z
f
(m

) +EZ
f
(m)Z
g
(t

) +EZ
f
(m)Z
h
(v

)
+ EZ
g
(t)Z
f
(m

) +EZ
g
(t)Z
g
(t

) +EZ
g
(t)Z
h
(v

)
+ EZ
h
(v)Z
f
(m

) +EZ
h
(v)Z
g
(t

) +EZ
h
(v)Z
h
(v

),
which reduces to (3.18) when the cross-correlations between Z
f
(m), Z
g
(t) and
Z
h
(v) are all zero. In general, it is straightforward to incorporate the cross-
correlation structure and follow immediately the MEV Kriging procedure. For
example, one of our works in process is to assume that
EZ
g
(t)Z
h
(v) =
σ
2
_
λ
g
λ
h
K
gh
(t, v), where K
gh
(t, v) = ρI(t ≥ v), ρ ≥ 0
in order to study the interaction effect between the vintage heterogeneity and
the exogenous influence. Results will be presented by a future report.
Besides the open problems listed above, the scope of VDA can be also general-
ized to study non-continuous types of observations. For examples we considered in
Section 3.5, the Moody’s default rates were actually calculated from some raw binary
indicators (of default or not), while the loss rates in the synthetic retail data can be
viewed as the ratio of loss units over the total units. Thus, if we are given the raw
binary observations or unit counts, one may consider the generalized MEV models
upon the specification of a link function, as an analogy to the generalized linear mod-
els (Mccullagh and Nelder, 1989). The transform η in (3.2) plays a similar role as
such a link function, except for that it is functioning on the rates that are assumed
to be directly observed.
One more extension of VDA is to analyze the time-to-default data on the vintage
85
diagram, which is deferred to the next chapter where we will develop the dual-time
survival analysis.
3.7 Technical Proofs
Proof of Lemma 3.1: By that m−t +v = 0 on the vintage diagram Ω, at least
one of the linear parts f
0
, g
0
, h
0
is not identifiable, since otherwise f
0
(m) −m, g
0
(t) +
t, h
0
(v) −v would always satisfy (3.2).
Proof of Proposition 3.2: By applying the general spline theorem to (3.23), the
target function has a finite-dimensional representation ζ(x) = µ
0

1
m+µ
2
t +c
T
ξ
n
,
where c
T
ξ
n
=

J
j=1

L
j
l=1
c
jl
K(x, x
jl
). Substitute it back into (3.23) and use the
reproducing property ¸K(, x
jl
), K(, x
j

l
))
H
1
= K(x
jl
, x
j

l
), then we get
min
_
_
_
y −Bµ −Σc
_
_
2
+
_
_
c
_
_
Σ
_
using the notations in (3.19). Taking the partial derivatives w.r.t. µ and c and setting
them to zero gives the solution
´ µ =
_
B
T
(Σ+I)
−1
B
_
−1
B
T
(Σ+I)
−1
y
´c =
_
Σ+I
_
−1
(y −B´ µ),
which leads ζ(x) to have the same form as the Kriging predictor (3.22).
Proof of Proposition 3.3: By (3.18) and comparing (3.24) and (3.26), it is
not hard to derive the relationships between two sets of kernel coefficients,
α
i
= λ
−1
f

j,l
c
jl
I(m
jl
= m
i
), β
l
= λ
−1
g

j,l
c
jl
I(t
jl
= t
l
), γ
j
= λ
−1
h
L
j

l=1
c
jl
(3.40)
that aggregate c
jl
for the replicated kernel bases for every m
i
, t
l
, v
j
, respectively.
86
Then, using the new vectors of coefficients α = (α
1
, . . . , α
I
)
T
, β = (β
1
, . . . , β
L
)
T
and
γ = (γ
1
, . . . , γ
J
)
T
, we have that
Σc =
¯
Σ
f
α+
¯
Σ
g
β +
¯
Σ
h
γ
c
T
Σc = λ
f
α
T
Σ
f
α+ λ
g
β
T
Σ
g
β + λ
h
γ
T
Σ
h
γ.
Plugging them into (3.25) leads to (3.27), as asserted.
87
CHAPTER IV
Dual-time Survival Analysis
4.1 Introduction
Consider in this chapter the default events naturally observed on the dual-time
domain, lifetime (or, age) in one dimension and calendar time in the other. Of our
interest is the time-to-default from multiple origins, such that the events happening
at the same calendar time may differ in age. Age could be the lifetime of a person
in a common sense, and it could also refer to the elapsed time since initiation of a
treatment. In credit risk modeling, age refers to the duration of a credit account since
its origination (either a corporate bond or a retail loan). For many accounts having
the same origin, they share the same age at any follow-up calendar time, and group
as a single vintage.
Refer to Chapter I about basics of survival analysis in lifetime scale, where the
notion of hazard rate is also referred to be the default intensity in credit risk con-
text. As introduced in Chapter 1.4, dual-time observations of survival data subject to
truncation and censoring can be graphically represented by the Lexis diagram (Kei-
ding, 1990). To begin with, let us simulate a Lexis diagram of dual-time-to-default
observations:
(V
j
, U
ji
, M
ji
, ∆
ji
), i = 1, . . . , n
j
, j = 1, . . . , J (4.1)
88
where V
j
denotes the origination time of the jth vintage that consists of n
j
accounts.
For each account, U
ji
denotes the time of left truncation, M
ji
denotes the lifetime
of event termination, and ∆
ji
indicates the status of termination such that ∆
ji
= 1
if the default event occurs prior to getting censored and 0 otherwise. Let us fix
the rectangular vintage diagram with age m ∈ [0, 60], calendar time t ∈ [0, 48] and
vintage origination V
j
= −60, . . . , −1, 0, 1, . . . , 47, same as the synthetic vintage data
in Chapter 3.5. So, the observations are left truncated at time 0. For the censoring
mechanism, besides the systematic censoring with m
max
= 60 and t
max
= 48, an
independent random censoring with the constant hazard rate λ
att
is used to represent
the account attrition (i.e., a memoryless exponential lifetime distribution).
Denote by λ
v
(m, t) or λ
j
(m, t) the underlying dual-time hazard rate of vintage
v = V
j
at lifetime m and calendar time t. Then, one may simulate the dual-time-to-
default data (4.1) according to:
τ
ji
= inf
_
m :
_
m
0
λ
j
(s, V
j
+ s)ds ≥ −log X, X ∼ Unif
_
0, 1
¸
_
(default)
C
ji
= inf
_
m : λ
att
m ≥ −log X, X ∼ Unif
_
0, 1]
_
(attrition)
U
ji
= max¦
¯
U
ji
, V
j
¦ −V
j
, with
¯
U
ji
≡ 0 (left truncation)
Book (V
j
, U
ji
, M
ji
, ∆
ji
) if U
ji
< min¦τ
ji
, C
ji
¦ : (4.2)
M
ji
= min¦τ
ji
, C
ji
, m
max
, t
max
−V
j
¦ (termination time)

ji
= I
_
τ
ji
≤ min¦C
ji
, m
max
, t
max
−V
j
¦
_
(default indicator)
for each given V
j
= −60, . . . , −1, 0, 1, . . . , 47. For each vintage j, one may book n
j
number of origination accounts by repeatedly running (4.2) n
j
times independently.
Let us begin with n
j
≡ 1000 and specify the attrition rate λ
att
= 50bp (a basis point
is 1%). Among other types of dual-time default intensities, assume the multiplicative
form
λ
v
(m, t) = λ
f
(m)λ
g
(t)λ
h
(v) = exp
_
f(m) + g(t) + h(v)
_
(4.3)
89
Figure 4.1: Dual-time-to-default: (left) Lexis diagram of sub-sampled simulations;
(right) empirical hazard rates in either lifetime or calendar time.
where the maturation curve f(m), the exogenous influence g(t) and the unobserved
heterogeneity h(v) are interpreted in the sense of Chapter III and they are assumed
to follow Figure 3.2 (top panel). For simulating the left-truncated survival accounts,
we assume the zero exogenous influence on their historical hazards.
Figure 4.1 (left) is an illustration of the Lexis diagram for one random simulation
per vintage, where each asterisk at the end of a line segment denotes a default event
and the circle denotes the censorship. To see default behavior in dual-time coordi-
nates, we calculated the empirical hazard rates by (4.14) in lifetime and (4.15) in
calendar time, respectively, and the results are plotted in Figure 4.1 (right). Each
marginal view of the hazard rates shows clearly the pattern matching our underlying
assumption, namely
1. The endogenous (or maturation) performance in lifetime is relatively smooth;
2. The exogenous performance in calendar time is relatively dynamic and volatile.
90
This is typically the case in credit risk modeling, where the default risk is greatly
affected by macroeconomic conditions. It is our purpose to develop the dual-time
survival analysis in not only lifetime, but also simultaneously in calendar time.
Despite that the Lexis diagram has appeared since Lexis (1875), statistical litera-
ture of survival analysis has dominantly focused on the lifetime only rather than the
dual-time scales; see e.g. Andersen, et al. (1993) with nine to one chapter coverage.
In another word, lifetime has been taken for granted as the basic timescale of sur-
vival, with very few exceptions that take either calendar time as the primary scale
(e.g. Arjas (1986)) or both time scales (see the survey by Keiding (1990)). There
is also discussion on selection of the appropriate timescale in the context of multiple
timescale reduction; see e.g. Farewell and Cox (1979) about a rotation technique
based on a naive Cox proportional hazards (CoxPH) model involving only the lin-
earized time-covariate effect. Such timescale dimension reduction approach can be
also referred to Oakes (1995) and Duchesne and Lawless (2000), given the multiple
dimensions that may include usage scales like mileage of a vehicle.
The literature of regression models based on dual-time-to-default data (a.k.a. sur-
vival data with staggered entry) has also concentrated on the use of lifetime as the
basic timescale together with time-dependent covariates. Mostly used is the CoxPH
regression model with an arbitrary baseline hazard function in lifetime. Some weak
convergence results can be referred to Sellke and Siegmund (1983) by martingale ap-
proximation and Billias, Gu and Ying (1997) by empirical process theory; see also the
related works of Slud (1984) and Gu and Lai (1991) on two-sample tests. However,
such one-way baseline model specification is unable to capture the baseline hazard
in calendar time. Then, it comes to Efron (2002) with symmetric treatment of dual
timescales under a two-way proportional hazards model
λ
ji
(m, t) = λ
f
(m)λ
g
(t) exp¦θ
T
z
ji
(m)¦, i = 1, . . . , n
j
, j = 1, . . . , J. (4.4)
91
Note that Efron (2002) considered only the special case of (4.4) with cubic polynomial
expansion for log λ
f
and log λ
g
, as well as the time-independent covariates. In general,
we may allow for arbitrary (non-negative) λ
f
(m) and λ
g
(t), hence name (4.4) to be
a two-way Cox regression model.
In credit risk, there seems to be no formal literature on dual-time-to-default risk
modeling. It is our objective of this chapter to develop the dual-time survival analysis
(DtSA) for credit risk modeling, including the methods of nonparametric estimation,
structural parameterization, intensity-based semiparametric regression, and frailty
specification accounting for vintage heterogeneity effects. The aforementioned dual-
time Cox model (4.4) will be one of such developments, whereas the Efron’s approach
with polynomial baselines will fail to capture the rather different behaviors of endoge-
nous and exogenous hazards illustrated in Figure 4.1.
This chapter is organized as follows. In Section 4.2 we start with the generic form
of likelihood function based on the dual-time-to-default data (4.1), then consider the
one-way, two-way and three-way nonparametric estimators on Lexis diagram. The
simulation data described above will be used to assess the performance of nonparamet-
ric estimators. In Section 4.3 we discuss the dual-time feasibility of structural models
based on the first-passage-time triggering system, which is defined by an endogenous
distance-to-default process associated with the exogenous time-transformation. Sec-
tion 4.4 is devoted to the development of dual-time semiparametric Cox regression
with both endogenous and exogenous baseline hazards and covariate effects, for which
the method of partial likelihood estimation plays a key role. Also discussed is the
frailty type of vintage heterogeneity by a random-effect formulation. In Section 4.5,
we demonstrate the applications of both nonparametric and semiparametric dual-
time survival analysis to credit card and mortgage risk modeling, based on our real
data-analytic experiences in retail banking. We conclude the chapter in Section 4.6
and provide some supplementary materials at the end.
92
4.2 Nonparametric Methods
Denote by λ
ji
(m) ≡ λ
ji
(m, t) with t ≡ V
j
+ m the hazard rates of lifetime-to-
default τ
ji
on the Lexis diagram. Given the dual-time-to-default data (4.1) with
independent left-truncation and right-censoring, consider the joint likelihood under
either continuous or discrete setting:
J

j=1
n
j

i=1
_
p
ji
(M
ji
)
S
ji
(U
+
ji
)
_

ji
_
S
ji
(M
+
ji
)
S
ji
(U
+
ji
)
_
1−∆
ji
(4.5)
where p
ji
(m) denotes the density and S
ji
(m) denotes the survival function of τ
ji
.
On the continuous domain, m
+
= m, S
ji
(m) = e
−Λ
ji
(m)
with the cumulative hazard
Λ
ji
(m) =
_
m
0
λ
ji
(s)ds, then the log-likelihood function has the generic form
=
J

j=1
n
j

i=1

ji
log λ
ji
(M
ji
) + log S
ji
(M
ji
) −log S
ji
(U
ji
)
=
J

j=1
n
j

i=1
_

ji
log λ
ji
(M
ji
) −
_
M
ji
U
ji
λ
ji
(m)dm
_
. (4.6)
On the discrete domain of m, S
ji
(m) =

s<m
[1−λ
ji
(s)] and the likelihood function
(4.5) can be rewritten as
J

j=1
n
j

i=1
_
λ
ji
(M
ji
)
¸

ji
[1 −λ
ji
(M
ji
)]
1−∆
ji

m∈(U
ji
,M
ji
)
[1 −λ
ji
(m)] (4.7)
and the log-likelihood function is given by
=
J

j=1
n
j

i=1
_
_
_

ji
log
_
λ
ji
(M
ji
)
1 −λ
ji
(M
ji
)
_
+

m∈(U
ji
,M
ji
]
log[1 −λ
ji
(m)]
_
_
_
. (4.8)
In what follows we discuss one-way, two-way and three-way nonparametric approach
to the maximum likelihood estimation (MLE) of the underlying hazard rates.
93
4.2.1 Empirical Hazards
Prior to discussion of one-way nonparametric estimation, let us review the classical
approach to the empirical hazards vintage by vintage. For each fixed V
j
, let λ
j
(m)
be the underlying hazard rate in lifetime. Given the data (4.1),
_
¸
¸
_
¸
¸
_
number of defaults: nevent
j
(m) =
n
j

i=1
I(M
ji
= m, ∆
ji
= 1)
number of at-risk’s: nrisk
j
(m) =
n
j

i=1
I(U
ji
< m ≤ M
ji
)
(4.9)
for a finite number of lifetime points m with nevent
j
(m) ≥ 1. Then, the jth-vintage
contribution to the log-likelihood (4.8) can be rewritten as

j
=

m
nevent
j
(m) log λ
j
(m) +
_
nrisk
j
(m) −nevent
j
(m)
_
log[1 −λ
j
(m)] (4.10)
on the discrete domain of m. It is easy to show that the MLE of λ
j
(m) is given by
the following empirical hazard rates
´
λ
j
(m) =
nevent
j
(m)
nrisk
j
(m)
if nevent
j
(m) ≥ 1 and NaN otherwise. (4.11)
They can be used to express the well-known nonparametric estimators for cumulative
hazard and survival function
(Nelson-Aalen estimator)
´
Λ
j
(m) =

m
i
≤m
´
λ
j
(m
i
); (4.12)
(Kaplan-Meier estimator)
´
S
j
(m) =

m
i
≤m

_
1 −
´
λ
j
(m
i
)
¸
(4.13)
for any m ≥ 0 under either continuous or discrete setting. Clearly, the Kaplan-
Meier estimator and the Nelson-Aalen estimator are related empirically by
´
S
j
(m) =

m
i
≤m

_
1 − ∆
´
Λ(m
i
)
¸
, where the empirical hazard rates
´
λ(m
i
) ≡ ∆
´
Λ(m
i
) are also
called the Nelson-Aalen increments.
94
Now, we are ready to present the one-way nonparametric estimation based on the
complete dual-time-to-default data (4.1). In the lifetime dimension (vertical of Lexis
diagram), assume λ
ji
(m, t) = λ(m) for all (j, i), then we may immediately extend
(4.11) by simple aggregation to obtain the one-way estimator in lifetime:
´
λ(m) =

J
j=1
nevent
j
(m)

J
j=1
nrisk
j
(m)
, (4.14)
whenever the numerator is positive. Alternatively, in the calendar time (horizontal of
Lexis diagram) assume λ
ji
(m, t) = λ(t) for all (j, i), then we may obtain the one-way
nonparametric estimator in calendar time:
´
λ(t) =

J
j=1
nevent
j
(t −V
j
)

J
j=1
nrisk
j
(t −V
j
)
, (4.15)
whenever the numerator is positive. Accordingly, one may obtain the empirical cu-
mulative hazard and survival function by the Nelson-Aalen estimator (4.12) and the
Kaplan-Meier estimator (4.13) in each marginal dimension.
For example, consider the simulation data introduced in Section 4.1. Both one-
way nonparametric estimates are plotted in the right panel of Figure 4.1 (upon linear
interpolation), which may capture roughly the endogenous and exogenous hazards.
However, they may be biased in other cases we shall discuss next.
4.2.2 DtBreslow Estimator
The two-way nonparametric approach is to estimate the lifetime hazards λ
f
(m)
and the calendar hazards λ
g
(t) simultaneously under the multiplicative model:
λ
ji
(m, t) = λ
f
(m)λ
g
(t) = exp¦f(m) + g(t)¦, for all (j, i). (4.16)
For identifiability, assume that Elog λ
g
= Eg = 0.
95
Following the one-way empirical hazards (4.11) above, we may restrict to the
pointwise parametrization
ˆ
f(m) =
Lm

l=1
α
l
I(m = m
[l]
), ˆ g(t) =
Lt

l=1
β
l
I(t = t
[l]
) (4.17)
based on L
m
distinct lifetimes m
[1]
< < m
[Lm]
and L
t
distinct calendar times t
[1]
<
< t
[Lt]
such that there is at least one default event observed at the corresponding
time. One may find the MLE of the coefficients α, β in (4.17) by maximizing (4.6) or
(4.8). Rather than the crude optimization, we give an iterative DtBreslow estimator
based on Breslow (1972), by treating (4.16) as the product of a baseline hazard
function and a relative-risk multiplier.
Given the dual-time-to-default data (4.1) and any fixed λ
g
=
´
λ
g
, the Breslow
estimator of λ
f
(m) is given by
´
λ
f
(m) ←

J
j=1
nevent
j
(m)

J
j=1
´
λ
g
(V
j
+ m) nrisk
j
(m)
, m = m
[1]
, . . . , m
[Lm]
(4.18)
and similarly given any fixed λ
f
=
´
λ
f
,
´
λ
g
(t) ←

J
j=1
nevent
j
(t −V
j
)

J
j=1
´
λ
f
(t −V
j
) nrisk
j
(t −V
j
)
, t = t
[1]
, . . . , t
[Lt]
. (4.19)
Both marginal Breslow estimators are known to be the nonparametric MLE of the
baseline hazards conditional on the relative-risk terms. To take into the identifiability
constraint Elog
g
(t) = 0, we may update λ
g
(t) by
λ
g
(t) ← exp
_
log
´
λ
g
(t) −mean
_
log
´
λ
g
_
_
. (4.20)
By iterating (4.18) to (4.20) until convergence, we obtain the DtBreslow estimator.
It could be viewed as a natural extension of the one-way nonparametric estimators
96
(4.14) and (4.15). In (4.18), it is usually appropriate to set the initial exogenous
multipliers
´
λ
g
≡ 1, then the algorithm could converge in a few iterations.
We carry out a simulation study to assess the performance of DtBreslow estimator.
Using the simulation data by (4.2), the two-way estimations of λ
f
(m) and λ
g
(t) are
shown in Figure 4.2 (top). Also plotted are the one-way nonparametric estimation
and the underlying truth. For this rectangular diagram of survival data, the one-way
and DtBreslow estimations have comparable performance. However, if we use the
non-rectangular Lexis diagram (e.g. either pre-2005 vintages or post-2005 vintages),
the DtBreslow estimator would demonstrate the significant improvement over the
one-way estimation; see Figure 4.2 (middle and bottom). It is demonstrated that the
one-way nonparametric methods suffer the potential bias in both cases. By comparing
(4.16) to (4.14) and (4.15), it is clear that the lifting power of DtBreslow estimator
(e.g. in calibrating λ
f
(m)) depends on not only the exogenous scaling factor
´
λ
g
(t),
but also the vintage-specific weights nrisk
j
(m).
4.2.3 MEV Decomposition
Lastly, we consider a three-way nonparametric procedure that is consistent with
the MEV (maturation-exogenous-vintage) decomposition framework we have devel-
oped for vintage data analysis in Chatper III. For each vintage with accounts
(V
j
, U
ji
, M
ji
, ∆
ji
) for i = 1, . . . , n
j
, assume the underlying hazards to be
λ
ji
(m, t) = λ
f
(m)λ
g
(t)λ
h
(V
j
) = exp
_
f(m) + g(t) + h(V
j
)
_
. (4.21)
For identifiability, let Eg = Eh = 0. Such MEV hazards model can be viewed as the
extension of dual-time nonparametric model (4.16) with the vintage heterogeneity.
Due to the hyperplane nature of the vintage diagram such that t ≡ V
j
+ m in
(4.21), the linear trends of f, g, h are not identifiable; see Lemma 3.1. Among other
97
Figure 4.2: DtBreslow estimator vs one-way nonparametric estimator using the data
of (top) both pre-2005 and post-2005 vintages; (middle) pre-2005 vintages
only; (bottom) post-2005 vintages only.
98
approaches to break up the linear dependency, we consider
1. Binning in vintage, lifetime or calendar time for especially the marginal regions
with sparse observations;
2. Removing the marginal trend in the vintage dimension h().
Note that the second strategy assumes the zero trend in vintage heterogeneity, which
is an ad hoc approach when we have no prior knowledge about the trend of f, g, h.
Later in Section 4.4 we discuss also a frailty type of vintage effect.
The nonparametric MLE for the MEV hazard model (4.21) follows the DtBres-
low estimator discussed above. Specifically, we may estimate the three conditional
marginal effects iteratively by:
a) Maturation effect for fixed
´
λ
g
(t) and
´
λ
h
(v)
´
λ
f
(m) ←

J
j=1
nevent
j
(m)

J
j=1
´
λ
h
(V
j
)
´
λ
g
(V
j
+ m) nrisk
j
(m)
(4.22)
for m = m
[1]
, . . . , m
[Lm]
.
b) Exogenous effect for fixed
´
λ
f
(m) and
´
λ
h
(v)
_
¸
¸
_
¸
¸
_
´
λ
g
(t) ←

J
j=1
nevent
j
(t −V
j
)

J
j=1
´
λ
h
(V
j
)
´
λ
f
(t −V
j
) nrisk
j
(t −V
j
)
´
λ
g
() ← exp
_
´ g −mean
_
´ g
_
_
, with ´ g = log
´
λ
g
(4.23)
for t = t
[1]
, . . . , t
[Lt]
.
c) Vintage effect for fixed
´
λ
f
(m) and
´
λ
g
(t)
_
¸
_
¸
_
´
λ
h
(V
j
) ←

n
j
i=1
I(∆
ji
= 1)

Lm
l=1
´
λ
f
(m
[l]
)
´
λ
g
(V
j
+ m
[l]
) nrisk
j
(m
[l]
)
´
λ
h
() ← exp
_
´
h −mean
_
´
h
_
_
, with
´
h = log
´
λ
h
(4.24)
99
for j = 1, . . . , J (upon necessary binning). In (4.24), one may also perform the
trend-removal
´
λ
h
() ← exp¦
´
h − mean⊕trend
_
´
h
_
¦ if zero-trend can be assumed
for the vintage heterogeneity.
Setting the initial values
´
λ
g
≡ 1 and
´
λ
h
≡ 1, then iterating the above three
steps until convergence would give us the nonparametric estimation of the three-way
marginal effects. Consider again the simulation data with both pre-2005 and post-
2005 vintages observed on the rectangular Lexis diagram Figure 4.1. Since there
are limited sample sizes for very old and very new vintages, the vintages with V
j

−36 are merged and the vintages with V
j
≥ 40 are cut off. Besides, given the
prior knowledge that pre-2005 vintages have relatively flat heterogeneity, binning is
performed in every half year. Then, running the above MEV iterative procedure,
we obtain the estimation of maturation hazard λ
f
(m), exogenous hazard λ
g
(t) and
the vintage heterogeneity λ
h
(v), which are plotted in Figure 4.3 (top). They are
consistent with the underlying true hazard functions we have pre-specified in (4.3).
Furthermore, we have also tested the MEV nonparametric estimation for the dual-
time data with either pre-205 vintages only (trapezoidal diagram) or the post-2005
vintages only (triangular diagram). The fitting results are also presented in Figure 4.3,
which implies that our MEV decomposition is quite robust to the irregularity of the
underlying Lexis diagram.
Remarks: before ending this section, we make several remarks about the nonpara-
metric methods for dual-time survival analysis:
1. First of all, the dual-time-to-default data have essential difference from the
usual bivariate lifetime data that are associated with either two separate failure
modes per subject or a single mode for a pair of subjects. For this reason,
it is questionable whether the nonparametric methods developed for bivariate
survival are applicable to Lexis diagram; see Keiding (1990). Our developments
100
Figure 4.3: MEV modeling of empirical hazard rates, based on the dual-time-to-
default data of (top) both pre-2005 and post-2005 vintages; (middle)
pre-2005 vintages only; (bottom) post-2005 vintages only.
101
of DtBreslow and MEV nonparametric estimators for default intensities corre-
spond to the age-period or age-period-cohort model in cohort or vintage data
analysis of loss rates; see (3.4) in the previous chapter. For either default inten-
sities or loss rates on Lexis diagram, one needs to be careful about the intrinsic
identification problem, for which we have discussed the practical solutions in
both of these chapters.
2. The nonparametric models (4.16) and (4.21) underlying the DtBreslow and
MEV estimators are special cases of the dual-time Cox models in Section 4.4.
In the finite-sample (or discrete) case, we have assumed in (4.17) the endoge-
nous and exogenous hazard rates to be pointwise, or equivalently assumed the
cumulative hazards to be step functions. In this case, statistical properties of
DtBreslow estimates
´
λ
f
(m) and
´
λ
g
(t) can be studied by the standard likelihood
theory for the finite-dimensional parameters α or β. However on the continu-
ous domain, the large-sample properties of DtBreslow estimators are worthy of
future study when the dimensions of both α and β increase to infinity at the
essentially same speed.
3. Smoothing techniques can be applied to one-way, two-way or three-way non-
parametric estimates, where the smoothness assumptions follow our discussion
in the previous chapter; see (3.2). Consider e.g. the DtBreslow estimator with
output of two-way empirical hazard rates
´
λ
f
(m) and
´
λ
g
(t), i.e. Nelson-Aalen
increments on discrete time points; see (4.12). One may post-process them by
kernel smoothing
λ
f
(m) =
1
b
f
Lm

l=1
K
f
_
m−m
[l]
b
f
_
´
λ
f
(m
[l]
), λ
g
(t) =
1
b
g
Lt

l=1
K
g
_
t −t
[l]
b
g
_
´
λ
g
(t
[l]
)
for m, t on the continuous domains, where K
f
() and K
g
() together with band-
widths b
f
and b
g
are the kernel functions that are bounded and vanish outside
102
[−1, 1]; see Ramlau-Hansen (1983). Alternatively, one may use the penalized
likelihood formulation via smoothing spline; see Gu (2002; ¸7). The challenging
issues in these smoothing methods are known to be the selection of bandwidths
or smoothing parameters, as well as the correction of boundary effects. More
details can be referred to Wang (2005) and references therein.
4. An alternative way of dual-time hazard smoothing can be based on the vintage
data analysis (VDA) developed in the previous chapter. For large-sample obser-
vations, one may first calculate the empirical hazard rates
´
λ
j
(m) per vintage by
(4.11), use them as the pseudo-responses of vintage data structure as in (3.1),
then perform Gaussian process modeling discussed in Section 3.3. In this case,
the variance of
´
λ
j
(m) can be estimated by a tie-corrected formula
´ σ
2
j
(m) =
_
nrisk
j
(m) −nevent
j
(m)
_
nevent
j
(m)
nrisk
3
j
(m)
and
´
λ
j
(m) are uncorrelated in m; see e.g. Aalen, et al. (2008; p.85). Then, we
may write
´
λ
j
(m) = λ
j
(m)(1 + ε
j
(m)) with Eε
j
(m) = 0 and

2
j
(m) ≈
´ σ
2
j
(m)
´
λ
2
j
(m)
=
1
nevent
j
(m)

1
nrisk
j
(m)
→ 0, as n
j
→ ∞
on the discrete time domain; by Taylor expansion, log
´
λ
j
(m) ≈ log λ
j
(m) +
ε
j
(m), corresponding to the MEV model (3.15) with non-stationary noise.
5. An important message delivered by Figure 4.2 (middle and bottom) is that
the nonparametric estimation of λ
f
(m) in conventional lifetime-only survival
analysis might be misleading when there exists exogenous hazard λ
g
(t) in the
calendar dimension. Both the DtBreslow and MEV nonparametric methods
may correct the bias of lifetime-only estimation, as demonstrated in Figure 4.2
and Figure 4.3 with either pre-2005 only or post-2005 only vintages.
103
4.3 Structural Models
We have introduced in Chapter 1.2.1 the structural approach to credit risk model-
ing, which is popular in the sense that the default events can be triggered by a latent
stochastic process when it hits the default boundary. For a quick review, one may
refer to (1.6) for the latent distance-to-default process w.r.t. zero boundary, (1.9) for
the notion of first-passage-time, and (1.13) for the concrete expression of hazard rate
based on the inverse Gaussian lifetime distribution. In this section, we consider the
extension of structural approach to dual-time-to-default analysis.
4.3.1 First-passage-time Parameterization
Consider the Lexis diagram of dual-time default events for vintages j = 1, . . . , J.
Let X
j
(m) be the endogenous distant-to-defalut (DD) process in lifetime and assume
it follows a drifted Wiener process
X
j
(m) = c
j
+ b
j
m + W
j
(m), m ≥ 0 (4.25)
where X
j
(0) ≡ c
j
∈ R
+
denotes the parameter of initial distance-to-default, b
j

R denotes the parameter of trend of credit deterioration and W
j
(t) is the Wiener
process. Without getting complicated, let us assume that W
j
and W
j
are uncorrelated
whenever j ,= j

. Recall the hazard rate (or, default intensity) of the first-passage-time
of X
j
(m) crossing zero:
λ
j
(m; c
j
, b
j
) =
c
j

2πm
3
exp
_

(c
j
+ b
j
m)
2
2m
_
Φ
_
c
j
+ b
j
m

m
_
−e
−2b
j
c
j
Φ
_
−c
j
+ b
j
m

m
_, (4.26)
whose numerator and denominator correspond to the density and survival functions;
see (1.10)–(1.13). By (4.6), the MLE of (c
j
, b
j
) can be obtained by maximizing the
104
log-likelihood of the j-th vintage observations

j
=
n
j

i=1

ji
log λ
j
(M
ji
; c
j
, b
j
) + log S
j
(M
ji
; c
j
, b
j
) −log S
j
(U
ji
; c
j
, b
j
), (4.27)
where the maximization can be carried out by nonlinear optimization. We supplement
the gradients of the log-likelihood function at the end of the chapter.
As their literal names indicate, both the initial distance-to-default c
j
and the trend
of credit deterioration b
j
have interesting structural interpretations. With reference
to Aalen and Gjessing (2001) or Aalen, et al. (2008; ¸10):
1. The c-parameter represents the initial value of DD process. When c becomes
smaller, a default event tends to happen sooner, hence moving the concentration
of the defaults towards left. In theory, the hazard function (4.26) for m ∈ (0, ∞)
is always first-increasing-then-decreasing, with maximum at m

such that the
derivative λ

(m

) = 0. In practice, let us focus on the bounded interval m ∈
[, m
max
] for some > 0. Then, when c gradually approaches zero, the hazard
rate would be shaped from (a) monotonic increasing to (b) first-increasing-
then-decreasing and finally to (c) monotonic decreasing. See Figure 1.2 for the
illustration with fixed b and varying c values.
2. The b-parameter represents the trend of DD process, and it determines the level
of stabilized hazard rate as m → ∞. In Aalen, et al. (2008; ¸10), let φ
m
(x)
be the conditional density of X(m) given that min
0<s≤m
X(s) > 0 and it is said to
be quasi-stationary if φ
m
(x) is m-invariant. In our case, the quasi-stationary
distribution is φ(x) = b
2
xe
−bx
for x ≥ 0 and the limiting hazard rate is given
by
lim
m→∞
λ(m) =
1
2
∂φ(x)
∂x
¸
¸
¸
x=0
=
b
2
2
,
which does not depend on the c-parameter. In Figure 1.2 with b = −0.02, all
the hazard rate curves will converge to the constant 0.0002 after a long run.
105
The above first-passage-time parameterization is only on the endogenous or mat-
uration hazard in lifetime. Next, we extend the first-passage-time structure taking
into account the exogenous effect in calendar time. To achieve that, we model the
dual-time DD process for each vintage V
j
by a subordinated process
X

j
(m) = X
j

j
(m)), with Ψ
j
(m) =
_
m
0
ψ(V
j
+ s)ds (4.28)
where X
j
(m) is the endogenous DD process (4.25) and Ψ
j
(m) is an increasing time-
transformation function such that all vintages share the same integrand ψ(t) > 0
in calendar time. We set the constraint Elog ψ = 0 on the exogenous effect, in
the sense of letting the ‘average’ DD process be captured endogenously. Such time-
transformation approach can be viewed as a natural generalization of the log-location-
scale transformation (1.16) with correspondence σ
i
= 1 and ψ() = e
−µ
i
. It has
been used to extend the accelerated failure time (AFT) models with time-varying
covariates; see e.g. Lawless (2003; ¸6.4.3).
The exogenous function ψ(t) rescales the original lifetime increment ∆m to be
ψ(V
j
+ m)∆m, and therefore transforms the diffusion dX
j
(m) = b
j
dm + dW
j
(m) to
dX

j
(m) = b
j
ψ(V
j
+ m)dm +
_
ψ(V
j
+ m)dW
j
(m) (4.29)
w.r.t. the zero default boundary. Alternatively, one may rewrite it as
dX

j
(m) =
_
ψ(V
j
+ m)dX
j
(m) −b
j
_
_
ψ(V
j
+ m) −ψ(V
j
+ m)
_
dm. (4.30)
Let us introduce a new process
¯
X
j
(m) such that d
¯
X
j
(m) =
_
ψ(V
j
+ m)dX
j
(m), and
introduce a time-varying default boundary
¯
D
j
(m) = b
j
_
m
0
_
_
ψ(V
j
+ s) −ψ(V
j
+ s)
_
ds.
106
Then, by (4.30), it is clear that the time-to-default τ

j
= inf¦m ≥ 0 : X

j
(m) ≤ 0¦ =
inf¦m ≥ 0 :
¯
X(m) ≤
¯
D
j
(m)¦.
By time-transformation (4.28), it is straightforward to obtain the survival function
of the first-passage-time τ

j
by
S

j
(m) = P
_
X

j
(s) > 0 : s ∈ [0, m]
_
= P
_
X
j
(s) > 0 : s ∈ [0, Ψ
j
(m)]
_
= S
j

j
(m)) = S
j
__
m
0
ψ(V
j
+ s)ds
_
(4.31)
where the benchmark survival S
j
() is given by the denominator of (4.26). Then, the
hazard rate of τ

j
is obtained by
λ

j
(m) =
−log S

j
(m)
dm
=
−log S
j

j
(m))

j
(m)

j
(m)
dm
= λ
j

j
(m))Ψ

j
(m)
= λ
j
__
m
0
ψ(V
j
+ s)ds
_
ψ(V
j
+ m) (4.32)
with λ
j
() given in (4.26). From (4.32), the exogenous function ψ(t) not only maps
the benchmark hazard rate to the rescaled time Ψ
j
(m), but also scales it by ψ(t).
This is the same as the AFT regression model (1.21).
Now consider the parameter estimation of both vintage-specific structural parame-
ters (c
j
, b
j
) and the exogenous time-scaling function ψ(t) from the dual-time-to-default
data (4.1). By (4.6), the joint log-likelihood is given by =

J
j=1

j
, with
j
given
by (4.27) based on λ

j
(m; c
j
, b
j
) and S

j
(m; c
j
, b
j
). The complication lies in the arbi-
trary formation of ψ(t) involved in the time-transformation function. Nonparametric
techniques can be used if extra properties like the smoothness can be imposed onto
the function ψ(t). In our case the exogenous time-scaling effects are often dynamic
and volatile, for which we recommend a piecewise estimation procedure based on the
discretized Lexis diagram.
1. Discretize the Lexis diagram with ∆m = ∆t such that t = t
min
+ l∆t for l =
107
0, 1, . . . , L, where t
min
is set to be the left boundary and t
min
+L∆t corresponds
to the right boundary. Specify ψ(t) to be piecewise constant
ψ(t) = ψ
l
, for t −t
min

_
(l −1)∆t, l∆t
¸
, l = 1, 2, . . . , L
and ψ(t) ≡ 1 for t ≤ t
min
. Then, the time-transformation for the vintage with
origination time V
j
can be expressed as
¯
Ψ
j
(m) =
(V
j
+m−t
min
)/∆t

l=(V
j
−t
min
)/∆t
ψ(t
min
+ l∆t)∆t, for V
j
+ m ≥ t
min
which reduces to t
min
−V
j
+

(V
j
+m−t
min
)/∆t
l=1
ψ
l
∆t in the case of left truncation.
2. For each of l = 1, 2, . . ., get the corresponding likelihood contribution:

[l]
=

(j,i)∈G
[l]

ji
_
log λ
j
(
¯
Ψ
j
(M
ji
)) + log ψ
l
_
+ log S
j
(
¯
Ψ
j
(M
ji
)) −log S
j
(
¯
Ψ
j
(U
ji
))
where (
[l]
= ¦(j, i) : V
j
+M
ji
= t
min
+l∆t¦, and
¯
Ψ
j
(U
ji
) = max¦0, t
min
−V
j
¦ if
t
min
is the systematic left truncation in calendar time.
3. Run nonlinear optimization for estimating (c
i
, b
i
) and ψ
l
iteratively:
(a) For fixed ψ and each fixed vintage j = 1, . . . , J, obtain the MLE (´c
j
,
´
b
j
)
by maximizing (4.27) with M
ji
replaced by
¯
Ψ
j
(M
ji
).
(b) For all fixed (c
j
, b
j
), obtain the MLE of (ψ
1
, . . . , ψ
L
) by maximizing

L
l=1

[l]
subject to the constraint

L
l=1
log ψ
l
= 0.
In Step (3.b), one may perform further binning for (ψ
1
, . . . , ψ
L
) for the purpose of
reducing the dimension of parameter space in constrained optimization. One may also
apply the multi-resolution wavelet bases to represent ¦ψ
1
, . . . , ψ
L
) in a time-frequency
perspective. Such ideas will be investigated in our future work.
108
4.3.2 Incorporation of Covariate Effects
It is straightforward to extend the above structural parameterization to incorpo-
rate the subject-specific covariates. For each account i with origination V
j
, let ˜ z
ji
be
the static covariates observed upon origination (including the intercept and otherwise
centered predictors) and z
ji
(m) for m ∈ (U
ji
, M
ji
] be the dynamic covariates observed
since left truncation. Let us specify
c
ji
= exp¦η
T
˜ z
ji
¦ (4.33)
Ψ
ji
(m) = U
ji
−V
j
+
_
m
U
ji
exp¦θ
T
z
ji
(s)¦ds, m ≥ U
ji
(4.34)
subject to the unknown parameters (η, θ), and suppose the subject-specific DD pro-
cess _
¸
_
¸
_
X

ji
(m) = X
ji

ji
(m)),
X
ji
(m) = c
ji
+ b
j
m + W
ji
(m)
(4.35)
with an independent Wiener process W
ji
(m), the initial value c
ji
and the vintage-
specific trend of credit deterioration b
j
. Then, the default occurs at τ

ji
= inf¦m >
0 : X

ji
(m) ≤ 0¦. Note that when the covariates in (4.34) are not time-varying, we
may specify Ψ
ji
(m) = mexp¦θ
T
z
ji
¦ and obtain the AFT regression model log τ

ji
=
θ
T
z
ji
+log τ
ji
with inverse Gaussian distributed τ
ji
. In the presence of calendar-time
state variables x(t) such that z
ji
(m) = x(V
j
+m) for all (j, i), the time-transformation
(4.34) becomes a parametric version of (4.28).
By the same arguments as in (4.31) and (4.32), we obtain the hazard rate λ

ji
(m) =
λ
ji

ji
(m))Ψ

ji
(m), or
λ

ji
(m) =
c
ji
_
2πΨ
3
ji
(m)
exp
_

_
c
ji
+ b
j
Ψ
ji
(m)
_
2

ji
(m)
_
Ψ

ji
(m)
Φ
_
c
ji
+ b
j
Ψ
ji
(m)
_
Ψ
ji
(m)
_
−e
−2b
j
c
ji
Φ
_
−c
ji
+ b
j
Ψ
ji
(m)
_
Ψ
ji
(m)
_ (4.36)
109
and the survival function S

ji
(m) = S
ji

ji
(m)) given by the denominator in (4.36).
The parameters of interest include (η, θ) and b
j
for j = 1, . . . , J, and they can
be estimated by MLE. Given the dual-time-to-default data (4.1), the log-likelihood
function follows the generic form (4.6). Standard nonlinear optimization programs
can be employed, e.g. the “optim” algorithm of R (stats-library) and the “fminunc”
algorithm of MATLAB (Optimization Toolbox).
Remarks:
1. There is no unique way to incorporate the static covariates in the first-passage-
time parameterization. For example, we considered including them by the
initial distance-to-default parameter (4.33), while they may also be incorpo-
rated into other structural parameters; see Lee and Whitmore (2006). How-
ever, there has been limited discussion on incorporation of dynamic covariates
in the first-passage-time approach, for which we make an attempt to use the
time-transformation technique (4.28) under nonparametric setting and (4.34)
under parameter setting.
2. The endogenous hazard rate (4.26) is based on the fixed effects of initial distance-
to-default and trend of deterioration, while these structural parameters can be
random effects due to the incomplete information available (Giesecke, 2006).
One may refer to Aalen, et al. (2008; ¸10) about the random-effect extension.
For example, when the trend parameter b ∼ N(µ, σ
2
) and the initial value c is
fixed, the endogenous hazard rate is given by p(m)/S(m) with
p(m) =
c
_
2π(m
3
+ σ
2
m
4
)
exp
_

(c + µt)
2
2(t + σ
2
t
2
)
_
S(m) = Φ
_
c + µm

m + σ
2
m
2
_
−e
−2µc+2σ
2
c
2
Φ
_
−c + µm−2σ
2
cm

m + σ
2
m
2
_
.
3. In the first-passage-time parameterization with inverse Gaussian lifetime distri-
110
bution, the log-likelihood functions (4.27) and (4.6) with the plugged-in (4.26)
or (4.36) are rather messy, especially when we attempt to evaluate the deriva-
tives of the log-likelihood function in order to run gradient search; see the
supplementary material. Without gradient provided, one may rely on the crude
nonlinear optimization algorithms, and the parameter estimation follows the
standard MLE procedure. Our discussion in this section mainly illustrates the
feasibility of dual-time structural approach.
4.4 Dual-time Cox Regression
The semiparametric Cox models are popular for their effectiveness of estimating
the covariate effects. Suppose that we are given the dual-time-to-default data (4.1)
together with the covariates z
ji
(m) observed for m ∈ (U
ji
, M
ji
], which may include
the static covariates ˜ z
ji
(m) ≡ ˜ z
ji
. Using the traditional CoxPH modeling approach,
one may either adopt an arbitrary baseline in lifetime
λ
ji
(m) = λ
0
(m) exp¦θ
T
z
ji
(m)¦, (4.37)
or adopt the stratified Cox model with arbitrary vintage-specific baselines
λ
ji
(m) = λ
j0
(m) exp¦θ
T
z
ji
(m)¦ (4.38)
for i = 1, . . . , n
j
and j = 1, . . . , L. However, these two Cox models may not be
suitable for (4.1) observed on the Lexis diagram in the following sense:
1. (4.37) is too stringent to capture the variation of vitnage-specific baselines;
2. (4.38) is too relaxed to capture the interrelation of vintage-specific baselines.
For example, we have simulated a Lexis diagram in Figure 4.1 whose vintage-specific
hazards are interrelated not only in lifetime m but also in calendar time t.
111
4.4.1 Dual-time Cox Models
The dual-time Cox models considered in this section take the following multiplica-
tive hazards form in general,
λ
ji
(m, t) = λ
0
(m, t) exp¦θ
T
z
ji
(m)¦ (4.39)
where λ
0
(m, t) is a bivariate base hazard function on Lexis diagram. Upon the spec-
ification of λ
0
(m, t), the traditional models (4.37) and (4.38) can be included as
special cases. In order to balance between (4.37) and (4.38), we assume the MEV
decomposition framework of the base hazard λ
0
(m, t) = λ
f
(m)λ
g
(t)λ
h
(t −v), i.e. the
nonparametric model (4.21). Thus, we restrict ourselves to the dual-time Cox models
with MEV baselines and relative-risk multipliers
λ
ji
(m, t) = λ
f
(m)λ
g
(t)λ
h
(V
j
) exp¦θ
T
z
ji
(m)¦ (4.40)
= exp¦f(m) + g(t) + h(V
j
) +θ
T
z
ji
(m)¦
where λ
f
(m) = exp¦f(m)¦ represents the arbitrary maturation/endogenous baseline,
λ
g
(t) = exp¦g(t)¦ and λ
h
(v) = exp¦h(v)¦ represent the exogenous multiplier and the
vintage heterogeneity subject to Eg = Eh = 0. Some necessary binning or trend-
removal need to be imposed on g, h to ensure the model estimability; see Lemma 3.1.
Given the finite-sample observations (4.1) on Lexis diagram, one may always find
L
m
distinct lifetime points m
[1]
< < m
[Lm]
and L distinct calendar time points
t
[1]
< < t
[Lt]
such that there exist at least one default event observed at the
corresponding marginal time. Following the least-information assumption of Breslow
(1972), let us treat the endogenous baseline hazard λ
f
(m) and the exogenous baseline
hazard λ
g
(t) as pointwise functions of the form (4.17) subject to

Lt
l=1
β
l
= 0. For the
discrete set of vintage originations, the vintage heterogeneity effects can be regarded
112
as the categorical covariates effects, essentially. We may either assume the pointwise
function of the form h(v) =

J
j=1
γ
j
I(v = V
j
) subject to constraints on both mean
and trend (i.e. the ad hoc approach), or assume the piecewise-constant functional
form upon appropriate binning in vintage originations
h(V
j
) = γ
κ
, if j ∈ (
κ
, κ = 1, . . . , K (4.41)
where the vintage buckets (
κ
= ¦j : V
j
∈ (ν
κ−1
, ν
κ
]¦ are pre-specified based on
V

j
≡ ν
0
< < ν
K
≡ V
J
(where K < J). Then, the semiparametric model (4.40) is
parametrized with dummy variables to be
λ
ji
(m, t) = exp
_
Lm

l=1
α
l
I(m = m
[l]
) +
Lt

l=2
β
l
I(t = t
[l]
)
+
K

κ=2
γ
κ
I(j ∈ (
κ
) +θ
T
z
ji
(m)
_
(4.42)
where β
1
≡ 0 and γ
1
≡ 0 are excluded from the model due to the identifiability
constraints. Plugging it into the joint likelihood (4.6), one may then find the MLE of
(α, β, γ, θ). (Note that it is rather straightforward to convert the resulting parameter
estimates ´ α,
´
β and ´ γ to satisfy the zero sum constraints for β and γ, so is it for
(4.45) to be studied.)
As the sample size tends to infinity, both endogenous and exogenous baseline
hazards λ
f
(m) and λ
g
(t) defined on the continuous dual-time scales (m, t) tend to be
infinite-dimensional, hence making intractable the above pointwise parameterization.
The semiparametric estimation problem in this case is challenging, as the dual-time
Cox models (4.40) involves two nonparametric components that need to be profiled
out for the purpose of estimating the parametric effects. In what follows, we take
an “intermediate approach” to semiparametric estimation via exogenous timescale
discretization, to which the theory of partial likelihood applies.
113
4.4.2 Partial Likelihood Estimation
Consider an intermediate reformulation of the dual-time Cox models (4.40) by
exogenous timescale discretization:
t
l
= t
min
+ l∆t, for l = 0, 1, . . . , L (4.43)
for some time increment ∆t bounded away from zero, where t
min
corresponds to the
left boundary and t
min
+ L∆t corresponds to the the right boundary of the calendar
time under consideration. (If the Lexis diagram is discrete itself, ∆t is automatically
defined. The irregular time spacing is feasible, too.) On each sub-interval, assume
the exogenous baseline λ
g
(t) is defined on the L cutting points (and zero otherwise)
g(t) =
L

l=1
β
l
I(t = t
min
+ l∆t), (4.44)
subject to

L
l=1
β
l
= 0. Then, consider the one-way CoxPH reformulation of (4.40):
λ
ji
(m) = λ
f
(m) exp
_
L

l=2
β
l
I(V
j
+ m ∈ (t
l−1
, t
l
])
+
K

κ=2
γ
κ
I(j ∈ (
κ
) +θ
T
z
ji
(m)
_
(4.45)
where λ
f
(m) is an arbitrary baseline hazard, g(t) for any t ∈ (t
l−1
, t
l
] in (4.40) is
shifted to g(t
l
) in (4.44), and the vintage effect is bucketed by (4.41). It is easy to
see that if the underlying Lexis diagram is discrete, the reformulated CoxPH model
(4.45) is equivalent to the fully parametrized version (4.42).
We consider the partial likelihood estimation of the parameters (β, γ, θ) ≡ ξ. For
convenience of discussion, write for each (j, i, m) the (L − 1)-vector of ¦I(V
j
+ m ∈
(t
l−1
, t
l
])¦
L
l=2
, the (K−1)-vector of ¦I(V
j
∈ (ν
κ−1
, ν
κ
])¦
K
κ=2
, and z
ji
(m) as a long vector
x
ji
(m). Then, by Cox (1972, 1975), the parameter ξ can be estimated by maximizing
114
the partial likelihood
PL(ξ) =
J

j=1
n
j

i=1
_
exp¦ξ
T
x
ji
(M
ji

(j

,i

)∈R
ji
exp
_
ξ
T
x
j

i
(M
ji
)
_
_

ji
(4.46)
where 1
ji
= ¦(j

, i

) : U
j

i
< M
ji
≤ M
j

i
¦ is the at-risk set at each observed M
ji
.
Given the maximum partial likelihood estimator
´
ξ, the endogenous cumulative base-
line hazard Λ
f
(m) =
_
m
0
λ
f
(s)ds can be estimated by the Breslow estimator
´
Λ
f
(m) =
J

j=1
n
j

i=1
I(m ≥ M
ji
)∆
ji

(j

,i

)∈R
ji
exp
_
´
ξ
T
x
j

i
(M
ji
)
_ (4.47)
which is a step function with increments at m
[1]
< < m
[Lm]
. It is well-known in the
CoxPH context that both estimators
´
ξ and
´
Λ
f
() are consistent and asymptotically
normal; see Andersen and Gill (1982) based on counting process martingale theory.
Specifically,

N(
´
ξ − ξ) converges to a zero-mean multivariate normal distribution
with covariance matrix that can be consistently estimated by ¦1(
´
ξ)/N¦
−1
, where
1(ξ) = −∂
2
log PL(ξ)
_
∂ξξ
T
. For the nonparametric baseline,

N
_
´
Λ
f
(m) − Λ
f
(m)
_
converges to a zero-mean Gaussian process; see also Andersen, et al. (1993).
The DtBreslow and MEV estimators we have discussed in Section 4.2 can be
studied by the theory of partial likelihood above, since the underlying nonparametric
models are special cases of (4.45) with θ ≡ 0. Without the account-level covariates,
the hazard rates λ
ji
(m) roll up to λ
j
(m), under which the partial likelihood function
(4.46) can be expressed through the counting notations introduced in (4.9),
PL(g, h) =
J

j=1

m
_
exp¦g(V
j
+ m) + h(V
j

J
j

=1
nrisk
j
(m) exp¦g(V
j
+ m) + h(V
j

_
nevent
j
(m)
(4.48)
where g(t) is parameterized by (4.44) and h(v) is by (4.41). Let
´
λ
g
(t) = exp¦´ g(t)¦
and
´
λ
h
(v) = exp¦
´
h(v)¦ upon maximization of PL(g, h) and centering in the sense of
115
Eg = Eh = 0. Then, by (4.47), we have the Breslow estimator of the cumulative
version of endogenous baseline hazards
´
Λ
f
(m) =
_
m
0

J
j=1
nevent
j
(s)ds

J
j=1
nrisk
j
(s)
´
λ
g
(V
j
+ s)
´
λ
h
(V
j
)
. (4.49)
It is clear that the increments of
´
Λ
f
(m) correspond to the maturation effect (4.22)
of the MEV estimator. They also reduce to (4.18) of DtBreslow estimator in the
absence of vintage heterogeneity.
The implementation of dual-time Cox modeling based on (4.45) can be carried
out by maximizing the partial likelihood (4.46) to find
´
θ and using (4.47) to estimate
the baseline hazard. Standard softwares with survival package or toolbox can be
used to perform the parameter estimation. In the case when the coxph procedure
in these survival packages (e.g. R:survival) requires the time-indepenent covariates,
we may convert (4.1) to an enlarged counting process format with piecewise-constant
approximation of time-varying covariates x
ji
(m); see the supplementary material in
the last section of the chapter.
For example, we have tested the partial likelihood estimation using the coxph of
R:survival package and the simulation data in Figure 4.1, where the raw data (4.1)
are converted to the counting process format upon the same binning preprocessing
as in Section 4.2.3. The numerical results of
´
λ
g
,
´
λ
h
by maximizing (4.48) and
´
λ
f
by
(4.49) are nearly identical to the plots shown in Figure 4.3.
4.4.3 Frailty-type Vintage Effects
Recall the frailty models introduced in Section 1.3 about their double roles in
charactering both the odd effects of unobserved heterogeneity and the dependence
of correlated defaults. They provide an alternative approach to model the vintage
effects as the random block effects. Let Z
κ
for κ = 1, . . . , K be a random sample from
116
some frailty distribution, where Z
κ
represents the shared frailty level of the vintage
bucket (
κ
according to (4.41). Consider the dual-time Cox frailty models
λ
ji
(m, t[Z
κ
) = λ
f
(m)λ
g
(t)Z
κ
exp¦θ
T
z
ji
(m)¦, if j ∈ (
κ
(4.50)
for i = 1, . . . , n
j
and j ∈ (
κ
for κ = 1, . . . , K. We also write λ
ji
(m, t[Z
κ
) as λ
ji
(m[Z
κ
)
with t ≡ V
j
+ m. It is assumed that given the frailty Z
κ
, the observations for all
vintages in (
κ
are independent. In this frailty model setting, the between-bucket
heterogeneity is captured by realizations of the random variate Z, while the within-
bucket dependence is captured by the same realization Z
κ
.
Similar to the intermediate approach taken by the partial likelihood (4.46) under
exogenously discretized dual-time model (4.45), let us write
˜
ξ = (β, θ) and let ˜ x
ji
(m)
join the vectors of ¦I(V
j
+ m ∈ (t
l−1
, t
l
])¦
L
l=2
and z
ji
(m). Then, we may write (4.50)
as λ
ji
(m[Z
κ
) = λ
f
(m)Z
κ
exp
_
˜
ξ
T
˜ x
ji
(m)
_
. The log-likelihood based on the dual-time-
to-default data (4.1) is given by
=
J

j=1
n
j

i=1

ji
_
log λ
f
(M
ji
) +
˜
ξ
T
˜ x
ji
(m)
_
+
K

κ=1
log
_
(−1)

L
(dκ)
(A
κ
)
_
. (4.51)
where L(x) = E
Z
[exp¦−xZ¦ is the Laplace transform of Z, L
(d)
(x) is its d-th
derivative, d
κ
=

j∈Gκ

n
j
i=1

ji
and A
κ
=

j∈Gκ

n
j
i=1
_
M
ji
U
ji
λ
f
(s) exp
_
˜
ξ
T
˜ x
ji
(s)
_
ds;
see the supplementary material at the end of the chapter.
The gamma distribution Gamma(δ
−1
, δ) with mean 1 and variance δ is the most
frequently used frailty due to its simplicity. It has the Laplace transform L(x) = (1+
δx)
−1/δ
whose d-th derivative is given by L
(d)
(x) = (−δ)
d
(1 + δx)
−1/δ−d

d
i=1
(1/δ +
i −1). For a fixed δ, often used is the EM (expectation-maximization) algorithm for
estimating (λ
f
(),
¯
ξ); see Nielsen, et al. (1992) or the supplementary material.
Alternatively, the gamma frailty models can be estimated by the penalized like-
lihood approach, by treating the second term of (4.51) as a kind of regularization.
117
Therneau and Grambsch (2000; ¸9.6) justified that for each fixed δ, the EM algorith-
mic solution coincides with the maximizer of the following penalized log-likelihood
J

j=1
n
j

i=1

ji
_
log λ
f
(M
ji
) +ξ
T
x
ji
(m)
_

1
δ
K

κ=1
_
γ
κ
−exp¦γ
κ
¦
_
(4.52)
where ξ = (β, γ, θ) and x
ji
(m) are the same as in (4.46). They also provide the
programs (in R:survival package) with Newton-Raphson iterative solution as an inner
loop and selection of δ in an outer loop. Fortunately, these programs can be used to
fit the dual-time Cox frailty models (4.50) up to certain modifications.
Remarks:
1. The dual-time Cox regression considered in this section is tricked to the standard
CoxPH problems upon exogenous timescale discretization, such that the existing
theory of partial likelihood and the asymptotic results can be applied. We have
not studied the dual-time asymptotic behavior when both lifetime and calendar
time are considered to be absolutely continuous. In practice with finite-sample
observations, the time-discretization trick works perfectly.
2. The dual-time Cox model (4.40) includes only the endogenous covariates z
ji
(m)
that vary across individuals. In practice, it is also interesting to see the effects
of exogenous covariates ¯z(t) that are invariant to all the individuals at the same
calendar time. Such ¯z(t) might be entertained by the traditional lifetime Cox
models (4.37) upon the time shift, however, they are not estimable if included as
a log-linear term by (4.40), since exp¦θ
T
¯z(t)¦ would be absorbed by the exoge-
nous baseline λ
g
(t). Therefore, in order to investigate the exogenous covariate
effects of ¯z(t), we recommend a two-stage procedure with the first stage fitting
the dual-time Cox models (4.40), and the second stage modeling the correlation
between ¯z(t) and the estimated
´
λ
g
(t) based on the time series techniques.
118
4.5 Applications in Retail Credit Risk Modeling
The retail credit risk modeling has become of increasing importance in the recent
years; see among others the special issues edited by Berlin and Mester (Journal of
Banking and Finance, 2004) and Wallace (Real Estate Economics, 2005). In quan-
titative understanding of the risk determinants of retail credit portfolios, there have
been large gaps among academic researchers, banks’ practitioners and governmental
regulators. It turns out that many existing models for corporate credit risk (e.g. the
Merton model) are not directly applicable to retail credit.
In this section we demonstrate the application of dual-time survival analysis
to credit card and mortgage portfolios in retail banking. Based on our real data-
analytic experiences, some tweaked samples are considered here, only for the purpose
of methodological illustration. Specifically, we employ the pooled credit card data
for illustrating the dual-time nonparametric methods, and use the mortgage loan-
level data to illustrate the dual-time Cox regression modeling. Before applying these
DtSA techniques, we have assessed them under the simulation mechanism (4.2); see
Figure 4.2 and Figure 4.3.
4.5.1 Credit Card Portfolios
As we have introduced in Section 1.4, the credit card portfolios range from product
types, acquisition channels, geographical regions and the like. These attributes may
be either treated as the endogenous covariates, or used to define the multiple seg-
ments. Here, we demonstrate the credit card risk modeling with both segmentation
and covariates. For simplicity, we consider two generic segments and a categorical
covariate with three levels, where the categorical variable is defined by the credit score
buckets upon loan origination: Low if FICO < 640, Medium if 640 ≤ FICO < 740
and High if FICO ≥ 740. Other variables like the credit line and utilization rate are
not considered here.
119
Table 4.1: Illustrative data format of pooled credit card loans
SegID FICO V U M D W
1 Low 200501 0 3 0 988
1 Low 200501 0 3 1 2
1 Low 200501 3 4 0 993
1 Low 200501 3 4 1 5
1 Low 200501 4 5 0 979
1 Low 200501 4 5 1 11
1 Low 200501
.
.
.
.
.
.
.
.
.
.
.
.
1 Low 200502
.
.
.
.
.
.
.
.
.
.
.
.
1 Medium
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
V: vintage. [U,M]: lifetime interval. D: status. W: counts.
In this simple case, one may aggregate the observations at the vintage level for
each FICO bucket in each segment. See Table 4.1 for an illustration of the counting
format of the input data, where each single vintage at each lifetime interval [U, M]
has two records of weight W, namely the number of survived loans (D = 0) and
the number of defaulted loans (D = 1). Note that the last column of Table 4.1
also illustrates that in the lifetime interval [3, 4], the 200501 vintage has 3 attrition
accounts (i.e. 993 −979 −11) which are examples of being randomly censored.
Let us consider the card vintages originated from year 2001 to 2008 with obser-
vations (a) left truncated at the beginning of 2005, (b) right censored by the end
of 2008, and (c) up censored by 48 months-on-book maximum. Refer to Figure 1.3
about the vintage diagram of dual-time data collection, which resembles the rectan-
gular Lexis diagram of Figure 4.1. First of all, we explore the empirical hazards of
the two hypothetical segments, regardless of the FICO variable.
We applied the one-way, two-way and three-way nonparametric methods devel-
oped in Section 4.2 for each segment of dual-time-to-default data. Note that given
the pooled data with weights W for D = 0, 1, it is rather straightforward to modify
120
Figure 4.4: Nonparametric analysis of credit card risk: (top) one-way empirical haz-
ards, (middle) DtBrewlow estimation, (bottom) MEV decomposition.
121
the iterative nonparametric estimators in (4.14) – (4.24), and one may also use the
weighted partial likelihood estimation based on our discussion in Section 4.4. The
numeric results for both segments are shown in Figure 4.4, where we may compare
the performances of two segments as follows.
1. The top panel shows the empirical hazard rates in both lifetime and calendar
time. By such one-way estimates, Segment 1 has shown better lifetime perfor-
mance (i.e. lower hazards) than Segment 2, and it has also better calendar-time
performance than Segment 2 except for the latest couple of months.
2. The middle panel shows the results of DtBreslow estimator that simultaneously
calibrates the endogenous seasoning λ
f
(m) and the exogenous cycle effects λ
g
(t)
under the two-way hazard model (4.16). Compared to the one-way calibration
above, the seasoning effects are similar, but the exogenous cycle effects show
dramatic difference. Conditional on the endogenous baseline, the credit cycle
of Segment 2 grows relatively steady, while Segment 1 has a sharp exogenous
trend since t = 30 (middle 2007) and exceeds Segment 1 since about t = 40.
3. The bottom panel takes into the vintage heterogeneity, where we consider only
the yearly vintages for simplicity. The estimated of λ
f
(m) and λ
g
(t) are both
affected after adding the vintage effects, since the more heterogeneity effects are
explained by the vintage originations, the less variations are attributed to the
endogenous and exogenous baselines. Specifically, the vintage performance in
both segments have deteriorated from pre-2005 to 2007, and made a turn when
entering 2008. Note that Segment 2 has a dramatic improvement of vintage
originations in 2008.
Next, consider the three FICO buckets for each segment, and we still take the
nonparametric approach for each of 2 3 segments. The MEV decomposition of
empirical hazards are shown in Figure 4.5. It is found that in lifetime the low, medium
122
and high FICO buckets have decreasing levels of endogenous baseline hazard rates,
where for Segment 2 the hazard rates are evidently non-proportional. The exogenous
cycles of low, medium and high FICO buckets co-move (more or less) within each
segment. As for the vintage heterogeneity, it is interesting to note that for Segment
1 the high FICO vintages keep deteriorating from pre-2005 to 2008, which is a risky
signal. The vintage heterogeneity in other FICO buckets are mostly consistent with
the overall pattern in Figure 4.4.
Figure 4.5: MEV decomposition of credit card risk for low, medium and high FICO
buckets: Segment 1 (top panel); Segment 2 (bottom panel).
This example mainly illustrates the effectiveness of dual-time nonparametric meth-
ods when there are limited or no covariates provided. By the next example of mort-
gage loans, we shall demonstrate the dual-time Cox regression modeling with various
covariate effects on competing risks.
123
4.5.2 Mortgage Competing Risks
The U.S. residential mortgage market is huge. As of March 2008, the first-lien
mortgage debt is estimated to be about $10 trillion (for 54.7 million loans) with a
distribution of $1.2 trillion subprime mortgages, $8.2 trillion prime and near-prime
combined, and the remaining being government guaranteed; see U.S. Federal Researve
report by Frame, Lehnert and Prescott (2008). In brief, the prime mortgages generally
target the borrowers with good credit histories and normal down payments, while the
subprime mortgages are made to borrowers with poor credit and high leverage.
The rise in mortgage defaults is unprecedented. Mayer, Pence and Sherlund (2009)
described the rising defaults in nonprime mortgages in terms of underwriting stan-
dards and macroeconomic factors. Quantitatively, Sherlund (2008) considered sub-
prime mortgages by proportional hazards models (4.37) with presumed fixed lifetime
baselines (therefore, a parametric setting). One may refer to Gerardi, et al. (2008)
and Goodman, et al. (2008) for some further details of the subprime issue. In
retrospect, the 2007-08 collapse of mortgage market has rather complicated causes,
including the loose underwriting standards, the increased leverage, the originate-then-
securitize incentives (a.k.a. moral hazard), the fall of home prices, and the worsening
unemployment rates.
We make an attempt via dual-time survival analysis to understand the risk deter-
minants of mortgage default and prepayment. Considered here is a tweaked sample of
mortgages loans originated from year 2001 to 2007 with Lexis diagram of observations
(a) left truncated at the beginning of 2005, (b) right censored by October 2008, and
(c) up censored by 60 months-on-book maximum. These loans resemble the noncon-
forming mortgages in California, where the regional macroeconomic conditions have
become worse and worse since middle 2006; see Figure 4.6 for the plots of 2000-08
home price indices and unemployment rate, where we use the up-to-date data sources
from S&P/Case-Shiller and U.S. Bureau of Labor Statistics.
124
Figure 4.6: Home prices and unemployment rate of California state: 2000-2008.
Figure 4.7: MEV decomposition of mortgage hazards: default vs. prepayment
125
For a mortgage loan with competing risks caused by default and prepayment, the
observed event time corresponds to time to the first event or the censoring time.
One may refer to Lawless (2003; ¸9) for the modeling issues on survival analysis of
competing risks. One practically simple approach is to reduce the multiple events to
individual survival analysis by treating alternative events as censored. Then, we may
fit two marginal hazard rate models for default and prepayment, respectively. One
may refer to Therneau and Grambsch (2000; ¸8) about robust variance estimation
for the marginal Cox models of competing risks.
We begin with nonparametric exploration of endogenous and exogenous hazards,
as well as the vintage heterogeneity effects. Similar to the credit card risk modeling
above, the monthly vintages are grouped yearly to be pre-2005, 2005 2006, 2007
originations. The fitting results of MEV hazard decomposition are shown in Figure 4.7
for both default and prepayment. Conditional on the estimated endogenous baseline
hazards of default or prepayment, we find that
1. The exogenous cycle of the default risk has been low (with exogenous multiplier
less than 1) prior to t = 20 (August 2006), then become rising sharply up
to about 50 times more risker at about t = 43 (July 2008). In contrast, the
exogenous cycle of the prepayment risk has shown a modest town-turn trend
during the same calendar time period.
2. The vintage heterogeneity plot implies that the 2005-06 originations have much
higher probability of default than the mortgage originations in other years. In
contrast, the vintage heterogeneity of prepayment keeps decreasing from pre-
2005 to 2007. These findings match the empirical evidences presented lately by
Gerardi, et al. (2008).
Compare the estimated exogenous effects on default hazards to the macroeconomic
conditions in Figure 4.6. It is obvious that the rise in mortgage defaults follows closely
126
the fall of home prices (positively correlated) and the worsening unemployment rate
(negatively correlated). However, based on our demonstration data, it is not possible
to quantify what percentage of exogenous effects should be attributed to either house
prices or unemployment, because (a) the house prices and unemployment in California
are themselves highly correlated from 2005 to 2008, and (b) the exogenous effects on
default hazards could be also caused by other macroeconomic indicators. One may
of course perform multivariate time series analysis in order to study the correlation
between exogenous hazards and macroeconomic indicators, which is beyond the scope
of the present thesis.
Next, we try to decode the vintage effect of unobserved heterogeneity by adding the
loan-level covariates. For demonstration purpose, we consider only the few mortgage
covariates listed in Table 4.2, where CLTV stands for the combined loan-to-value
leverage ratio, FICO is the same measurement of credit score as in the credit card risk,
and other underwriting covariates upon loan origination can be referred to Goodman,
et al. (2008; ¸1) or Wikipedia. We also include the mortgage note rate, either fixed
or floating (i.e., adjustable), to demonstrate the capability of Cox modeling with
time-varying covariates. Note that the categorical variable settings are simplified in
Table 4.2, and they could be way more complicated in real situations. The mean,
standard deviation and range statistics presented for the continuous covariates CLTV,
FICO and NoteRate are calculated from our tweaked sample.
The dual-time Cox regression models considered here take the form of (4.4) with
both endogenous and exogenous baselines. After incorporating the underwriting and
dynamic covariates in Table 4.2, we have
λ
(q)
ji
(m, t) = λ
(q)
f
(m)λ
(q)
g
(t) exp
_
θ
(q)
1
CLTV
ji
+ θ
(q)
2
FICO
ji
+ θ
(q)
3
NoteRate
ji
(m)
+ θ
(q)
4
DocFull
ji
+ θ
(q)
5
IO
ji
+ θ
(q)
6.p
Purpose.purchase
ji
+ θ
(q)
6.r
Purpose.refi
ji
_
, (4.53)
127
Table 4.2: Loan-level covariates considered in mortgage credit risk modeling, where
NoteRate could be dynamic and others are static upon origination.
CLTV µ = 78, σ = 13 and range [10, 125]
FICO µ = 706, σ = 35 and range [530, 840]
NoteRate µ = 6.5, σ = 1.1 and range [1, 11]
Documentation full or limited
IO Indicator Interest-Only or not
Loan Purpose purchase, refi or cash-out refi
Table 4.3: Maximum partial likelihood estimation of mortgage covariate effects in
dual-time Cox regression models.
Covariate Default Prepayment
CLTV 2.841 −0.285
FICO −0.500 0.385
NoteRate 1.944 0.2862
DocFull −1.432 −0.091
IO 1.202 0.141
Purpose.p −1.284 0.185
Purpose.r −0.656 −0.307
where the superscript (q) indicates the competing-risk hazards of default or prepay-
ment, the continuous type of covariates FICO, CLTV and NoteRate are all scaled to
have zero mean and unit variance, and the 3-level Purpose covariate is broken into
2 dummy variables (Purpose.p and Purpose.r against the standard level of cash-out
refi). For simplicity, we have assumed the log-linear effects of all 3 continuous covari-
ates on both default and prepayment hazards, while in practice one may need strive
to find the appropriate functional forms or break them into multiple buckets.
By the partial likelihood estimation discussed in Section 4.4, we obtain the numer-
ical results tabulated in Table 4.3, where the coefficient estimates are all statistically
significant based on our demonstration data (some other non-significant covariates
were actually screened out and not considered here). From Table 4.3, we find that
conditional on the nonparametric dual-time baselines,
1. The default risk is driven up by high CLTV, low FICO, high NoteRate, limited
documentation, Interest-Only, and cash-out refi Purpose.
128
2. The prepayment risk is driven up by low CLTV, high FICO, high NoteRate,
limited documentation, Interest-Only, and purchase Purpose.
3. The covariates CLTV, FICO and Purpose reveals the competing nature of de-
fault and prepayment risks. It is actually reasonable to claim that a homeowner
with good credit and low leverage would more likely prepay rather than choose
to default.
4.6 Summary
This chapter is devoted completely to the survival analysis on Lexis diagram
with coverage of nonparametric estimators, structural parameterization and semi-
parametric Cox regression. For the structural approach based on inverse Gaussian
first-passage-time distribution, we have discussed the feasibility of its dual-time ex-
tension and the calibration procedure, while its implementation is open to future
investigation. Computational results are provided for other dual-time methods devel-
oped in this chapter, including assessment of the methodology through the simulation
study. Finally, we have demonstrated the application of dual-time survival analysis to
the retail credit risk modeling. Some interesting findings in credit card and mortgage
risk analysis are presented, which may shed some light on understanding the ongoing
credit crisis from a new dual-time perspective.
To end this thesis, we conclude that statistical methods play a key role in credit
risk modeling. Developed in this thesis is mainly the dual-time analytics, including
VDA (vintage data analysis) and DtSA (dual-time survival analysis), which can be
applied to model both corporate and retail credit. We have also remarked throughout
the chapters the possible extensions and other open problems of interest. So, the end
is also a new beginning . . .
129
4.7 Supplementary Materials
(A) Calibration of Inverse Gaussian Hazard Function
Consider the calibration of the two-parameter inverse Gaussian hazard rate func-
tion λ(m; c, b) of the form (4.26) from the n observations (U
i
, M
i
, ∆
i
), i = 1, . . . , n
each subject to left truncation and right censoring. Knowing the parametric likeli-
hood given by (4.27), the MLE requires nonlinear optimization that can be carried
out by gradient search (e.g. Newton-Raphason algorithm). Provided below are some
calculations of the derivatives for the gradient search purpose.
Denote by η = (c, b)
T
the structural parameters of inverse Gaussian hazard. The
log-likelihood function is given by
=
n

i=1

i
log λ(M
i
; η) −
_
M
i
U
i
λ(s; η)ds (4.54)
Taking the first and second order derivatives, we have the score and Hessian functions

∂η
=
n

i=1

i
λ(M
i
; η)

_
M
i
U
i
λ

(s; η)ds

2

∂η∂η
T
=
n

i=1

i
_
λ

(M
i
; η)
λ(M
i
; η)

(M
i
; η))
⊗2
λ
2
(M
i
; η)
_

_
M
i
U
i
λ

(s; η)ds
where λ

(m; η) = ∂λ(m; η)/∂η, λ

(m; η) = ∂
2
λ(t; η)/∂η∂η
T
, and a
⊗2
= aa
T
. By
supplying the gradients, the optimization algorithms in most of standard softwares
are faster and more reliable. Note that the evaluation of the integral terms above
can be carried out by numerical integration methods like the composite trapezoidal
or Simpon’s rule; see e.g. Press, et al. (2007). Alternatively, one may replace the
second term of (4.54) by log S(M
i
; η) −log S(U
i
; η) and take their derivatives based
on the explicit expressions below.
The complications lie in the evaluations of ∂λ(m; η)/∂η and ∂
2
λ(m; η)/∂η∂η
T
130
based on the inverse Gaussian baseline hazard (4.26), which are, messy though, pro-
vided below. Denoting
z
1
=
η
1
+ η
2
m

m
, z
2
=
−η
1
+ η
2
m

m
, φ(z) =
1


e
−z
2
/2
,
evaluate first the derivatives of the survival function (i.e. the denominator of (4.26)):
∂S(m; η)
∂η
1
= 2η
2
e
−2η
1
η
2
Φ(z
2
) +
1

m
_
φ(z
1
) + e
−2η
1
η
2
φ(z
2
)
_
∂S(m; η)
∂η
2
= 2η
1
e
−2η
1
η
2
Φ(z
2
) +

m
_
φ(z
1
) −e
−2η
1
η
2
φ(z
2
)
_

2
S(m; η)
∂η
2
1
= −4η
2
2
e
−2η
1
η
2
Φ(z
2
) −

2

m
e
−2η
1
η
2
φ(z
2
) −
1
m
_
φ(z
1
)z
1
−e
−2η
1
η
2
φ(z
2
)z
2
_

2
S(m; η)
∂η
1
∂η
2
= (2 −4η
1
η
2
)e
−2η
1
η
2
Φ(z
2
) −
_
φ(z
1
)z
1
−e
−2η
1
η
2
φ(z
2
)z
2
_

2
S(m; η)
∂η
2
2
= −4η
2
1
e
−2η
1
η
2
Φ(z
2
) + 4η
1

me
−2η
1
η
2
φ(z
2
) −m
_
φ(z
1
)z
1
−e
−2η
1
η
2
φ(z
2
)z
2
_
.
Then, the partial derivatives of λ(m; η) can be evaluated explicitly by
∂λ(m; η)
∂η
1
= −λ
0
(m; η)
_
η
1
m
+ η
2

1
η
1
_

λ
0
(m; η)
S(m; η)
∂S(m; η)
∂η
1
∂λ(m; η)
∂η
2
= −λ(m; η) (η
1
+ η
2
m) −
λ(m; η)
S(m; η)
∂S(m; η)
∂η
2

2
λ(m; η)
∂η
2
1
= −
∂λ(m; η)
∂η
1
_
η
1
m
+ η
2

1
η
1
_
−λ(m; η)
_
1
η
2
1
+
1
m
_

λ(m; η)
S(m; η)

2
S(m; η)
∂η
2
1

1
S(m; η)
∂λ(m; η)
∂η
1
∂S(m; η)
∂η
1
+
λ(m; η)
S
2
(m; η)
_
∂S(m; η)
∂η
1
_
2

2
λ(m; η)
∂η
2
2
= −
∂λ(m; η)
∂η
2

1
+ η
2
m) −λ(m; η)m

λ(m; η)
S(m; η)

2
S(m; η)
∂η
2
2

1
S(m; η)
∂λ(m; η)
∂η
2
∂S(m; η)
∂η
2
+
λ(m; η)
S
2
(m; η)
_
∂S(m; η)
∂η
2
_
2

2
λ(m; η)
∂η
1
∂η
2
= −
∂λ(m; η)
∂η
1

1
+ η
2
m) −λ(m; η)

λ(m; η)
S(m; η)

2
S(m; η)
∂η
1
∂η
2

1
S(m; η)
∂λ(m; η)
∂η
1
∂S(m; η)
∂η
2
+
λ(m; η)
S
2
(m; η)
∂S(m; η)
∂η
1
∂S(t; η)
∂η
2
.
131
(B) Implementation of Cox Modeling with Time-varying Covariates
Consider the Cox regression model with time-varying covariates x(m) ∈ R
p
:
λ(m; x(m)) = λ
0
(m) exp¦θ
T
x(m)¦ (4.55)
where λ
0
(m) is an arbitrary baseline hazard function and θ is the p-vector of regression
coefficients. Provided below is the implementation of maximum likelihood estimation
of θ and λ
0
(m) by tricking it to be a standard CoxPH modeling with time-indepdent
covariates.
Let τ
i
be the event time for each account i = 1, . . . , n that is subject to left
truncation U
i
and right censoring C
i
. Given the observations
(U
i
, M
i
, ∆
i
, x
i
(m)), m ∈ [U
i
, M
i
], i = 1, . . . , n, (4.56)
in which M
i
= min¦τ
i
, C
i
¦, ∆
i
= I(τ
i
< C
i
), the complete likelihood is given by
L =
n

i=1
_
f
i
(M
i
)
S
i
(U
i
)
_

i
_
S
i
(M
i
+)
S
i
(U
i
)
_
1−∆
i
=
n

i=1

i
(M
i
)]

i
S
i
(M
i
+)S
−1
i
(U
i
). (4.57)
Under the dynamic Cox model (4.55), the likelihood function is
L(θ, λ
0
) =
n

i=1
_
λ
0
(M
i
) exp¦θ
T
x
i
(M
i

_

i
exp
_

_
M
i
U
i
λ
0
(s) exp¦β
T
x
i
(s)¦ds
_
.
(4.58)
Taking λ
0
as the nuisance parameter, one may estimate β by the partial likelihood;
see Cox (1972, 1975) or Section 4.4.
The model estimation with time-varying covariates can be implemented by the
standard software package that handles only the time-independent covariates by de-
fault. The trick is to construct little time segments such that x
i
(m) can be viewed
constant within each segment. See Figure 4.8 for an illustration with the construction
132
Figure 4.8: Split time-varying covariates into little time segments, illustrated.
of monthly segments. For each account i, let us partition m ∈ [U
i
, T
i
] to be
U
i
≡ U
i,0
< U
i,1
< < U
i,L
i
≡ M
i
such that
x
i
(m) = x
i,l
, m ∈ (U
i,l−1
, U
i,l
]. (4.59)
Thus, we have converted the n-sample survival data (4.56) to the following

n
i=1
L
i
independent observations each with the constant covariates
(U
i,l−1
, U
i,l
, ∆
i,l
, x
i,l
), l = 1, . . . , L
i
, i = 1, . . . , n (4.60)
where ∆
i,l
= ∆
i
for l = L
i
and 0 otherwise.
It is straightforward to show that the likelihood (4.57) can be reformulated by
L =
n

i=1

i
(M
i
)]

i
L
i

l=1
S
i
(U
i,l
)S
−1
i
(U
i,l−1
) =
n

i=1
L
i

l=1

i
(U
i,l
)]

i,l
S
i
(U
i,l
)S
−1
i
(U
i,l−1
),
(4.61)
which becomes the likelihood function based on the enlarged data (4.60). Therefore,
133
under the assumption (4.59), the parameter estimation by maximizing (4.58) can be
equivalently handled by maximizing
L(θ, λ
0
) =
n

i=1
L
i

l=1
_
λ
0
(U
i,l
) exp¦θ
T
x
i,l
¦
_

i,l
exp
_
−exp¦θ
T
x
i,l
¦
_
U
i,l
U
i,l−1
λ
0
(s)ds
_
.
(4.62)
In R, one may perform the MLE directly by applying the CoxPH procedure (R:
survival package) to the transformed data (4.60); see Therneau and Grambsch (2000)
for computational details.
Furthermore, one may trick the implementation of MLE by logistic regression,
provided that the underlying hazard rate function (4.55) admits the linear approxi-
mation up to the logit link:
logit(λ
i
(m)) ≈ η
T
φ(m) +θ
T
x
i
(m), for i = 1, . . . , n (4.63)
where logit(x) = log(x/1 −x) and φ(m) is the vector of pre-specified basis functions
(e.g. splines). Given the survival observations (4.60) with piecewise-constant covari-
ates, it is easy to show that the discrete version of joint likelihood we have formulated
in Section 4.2 can be rewritten as
L =
n

i=1
_
λ
i
(M
i
)
¸

i
[1 −λ
i
(M
i
)]
1−∆
i

m∈(U
i
,M
i
)
[1 −λ
i
(m)]
=
n

i=1
L
i

l=1
_
λ
i
(M
i
)
¸

i,l
[1 −λ
i
(M
i
)]
1−∆
i,l
(4.64)
where ∆
i,l
are the same as in (4.60). This becomes clearly the likelihood function of
the independent binary data (∆
il
, x
i,l
) for l = 1, . . . , L
i
and i = 1, . . . , n, where x
i,l
are constructed by (4.59). Therefore, under the prameterized logistic model (4.63),
one may perform the MLE for the parameters (η, θ) by a standard GLM procedure;
see e.g. Faraway (2006).
134
(C) Frailty-type Vintage Effects in Section 4.4.3
The likelihood function. By the independence of Z
1
, . . . , Z
k
, it is easy to write down
the likelihood as
L =
K

κ=1
E

_

j∈Gκ
n
j

i=1
_
λ
ji
(M
ji
[Z
κ
)
_

ji
exp
_

_
M
ji
U
ji
λ
ji
(s[Z
κ
)ds
_
_
. (4.65)
Substituting λ
ji
(m[Z
κ
) = λ
f
(m)Z
κ
exp
_
˜
ξ
T
˜ x
ji
(m)
_
, we obtain for each (
κ
the likeli-
hood contribution
L
κ
=

j∈Gκ
n
j

i=1
_
λ
f
(M
ji
) exp
_
˜
ξ
T
˜ x
ji
(m)
__

ji
E
Z
_
Z

exp¦−ZA
κ
¦
_
(4.66)
in which d
κ
=

j∈Gκ

n
j
i=1

ji
and A
κ
=

j∈Gκ

n
j
i=1
_
M
ji
U
ji
λ
f
(s) exp
_
˜
ξ
T
˜ x
ji
(s)
_
ds.
The expectation term in (4.66) can be derived from the d
k
-th derivative of Laplace
transform of Z which satisfies that E
Z
_
Z

exp¦−ZA
κ
¦
_
= (−1)

L
(dκ)
(A
κ
); see e.g.
Aalen, et al. (2008; ¸7). Thus, we obtain the joint likelihood
L =
K

κ=1

j∈Gκ
n
j

i=1
_
λ
f
(M
ji
) exp
_
˜
ξ
T
˜ x
ji
(m)
__

ji
(−1)

L
(dκ)
(A
κ
), (4.67)
or the log-likelihood function of the form (4.51).
EM algorithm for fraitly model estimation.
1. E-step: given the fixed values of (
¯
ξ, λ
f
), estimate the individual frailty levels
¦Z
κ
¦
K
κ=1
by the conditional expectation
´
Z
κ
= E[Z
κ
[(
κ
] = −
L
(dκ+1)
(A
κ
)
L
(dκ)
(A
κ
)
(4.68)
which corresponds to the empirical Bayes estimate; see Aalen, et al. (2008;
¸7.2.3). For the gamma frailty Gamma(δ
−1
, δ) with L
(d)
(x) = (−δ)
d
(1 +
135
δx)
−1/δ−d

d
i=1
(1/δ + i −1), the frailty level estimates are given by
´
Z
κ
=
1 + δd
κ
1 + δA
κ
, κ = 1, . . . , K. (4.69)
2. M-step: given the frailty levels ¦Z
κ
¦
K
κ=1
, estimate (
¯
ξ, λ
f
) by partial likelihood
maximization and Breslow estimator with appropriate modification based on
the fixed frailty multipliers
PL(
¯
ξ) =
J

j=1
n
j

i=1
_
_
exp¦
¯
ξ
T
x
ji
(M
ji

(j

,i

)∈R
ji
Z
κ(j

)
exp
_
¯
ξ
T
x
j

i
(M
ji
)
_
_
_

ji
(4.70)
´
Λ
f
(m) =
J

j=1
n
j

i=1
I(m ≥ M
ji
)∆
ji

(j

,i

)∈R
ji
Z
κ(j

)
exp
_
´
ξ
T
x
j

i
(M
ji
)
_, (4.71)
where κ(j) indicates the bucket (
κ
to which V
j
belongs. In R, the partial
likelihood maximization can be implemented by the standard CoxPH procedure
based on the offset type of log-frailty predictors as provided; details omitted.
To estimate the frailty parameter δ, one may run EM algorithm for δ on a grid to
obtain (
¯
ξ

(δ), λ

f
(; δ)), then find the maximizer of the profiled log-likelihood (4.51).
136
BIBLIOGRAPHY
137
BIBLIOGRAPHY
[1] Aalen, O. Borgan, O. and Gjessing, H. K. (2008). Survival and Event History
Analysis: A Process Point of View. Springer, New York.
[2] Aalen, O. and and Gjessing, H. K. (2001). Understanding the shape of the hazard
rate: a process point of view (with discussion). Statistical Science, 16, 1–22.
[3] Abramovich, F. and Steinberg, D. M. (1996). Improved inference in nonpara-
metric regression using L
k
-smoothing splines. Journal of Statistical Planning and
Inference, 49, 327–341.
[4] Abramowitz, M. and Stegun, I. A. (1972). Handbook of Mathematical Functions
(10th printing). New York: Dover.
[5] Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (1993). Statistical
Models Based on Couting Processes. Springer, New York.
[6] Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting
processes: a large sample study. Annals of Statistics, 10, 1100–1120.
[7] Arjas, E. (1986). Stanford heart transplantation data revisited: a real time ap-
proach. In: Modern Statistical Methods in Chronic Disease Epidemiology (Mool-
gavkar, S. H. and Prentice, R. L., eds.), pp 65–81. Wiley, New York.
[8] Berlin, M. and Mester, L. J. (eds.) (2004). Special issue of “Retail credit risk
management and measurement”. Journal of Banking and Finance, 28, 721–899.
[9] Bielecki, T. R. and Rutkowski, M. (2004). Credit Risk: Modeling, Valuation and
Hedging. Springer-Verlag, Berlin.
[10] Bilias, Y., Gu, M. and Ying, Z. (1997). Towards a general asymptotic theory for
Cox model with staggered entry. Annals of Statistics, 25, 662–682.
[11] Black, F. and Cox, J. C. (1976). Valuing corporate securities: Some effects of
bond indenture provisions. Journal of Finance, 31, 351–367.
[12] Bluhm, C. and Overbeck, L. (2007) Structured Credit Portfolio Analysis, Baskets
and CDOs. Boca Raton: Chapman and Hall.
[13] Breeden, J. L. (2007). Modeling data with multiple time dimensions. Computa-
tional Statistics & Data Analysis, 51, 4761–4785.
138
[14] Breslow, N. E. (1972). Discussion of “Regression models and life-tables” by Cox,
D. R. Journal of the Royal Statistical Society: Ser. B, 25, 662–682.
[15] Chen, L., Lesmond, D. A. and Wei, J. (2007). Corporate yield spreads and bond
liquidity. Journal of Finance, 62, 119–149.
[16] Chhikara, R. S. and Folks, J. L. (1989). The Inverse Gaussian Distribution:
Theory, Methodology, and Applications. Dekker, New York.
[17] Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Sta-
tistical Society. Series B, 34, 187–220.
[18] Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 269–276.
[19] Cox, J. C., Ingersoll, J. E. and Ross, S. A. (1985). A theory of the term structure
of interest rates. Econometrika, 53, 385–407.
[20] Cressie, N. A. (1993). Statistics for Spatial Data. New York: Wiley.
[21] Das, S. R., Duffie, D., Kapadia, N. and Saita, L. (2007). Common failings: how
corporate defaults are correlated. Journal of Finance, 62, 93–117.
[22] Deng, Y., Quigley, J. M. and Van Order, R. (2000). Mortgage terminations,
heterogeneity and the exercise of mortgage options. Econometrica, 68, 275–307.
[23] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet
shrinkage. Biometrika, 81 (3), 425-455.
[24] Duchesne, T. and Lawless, J. F. (2000). Alternative time scales and failure time
models. Lifetime Data Analysis, 6, 157–179.
[25] Duffie, D., Pan, J. and Singleton, K. (2000). Transform analysis and option
pricing for affine jump-diffusions. Econometrica, 68, 1343–1376.
[26] Duffie, D. and Singleton, K. J. (2003). Credit Risk: Pricing, Measurement, and
Management. Princeton University Press.
[27] Duffie, D., Saita, L. and Wang, K. (2007). Multi-period corporate default pre-
diction with stochastic covariates. Journal of Financial Economics, 83, 635–665.
[28] Duffie, D., Eckner, A., Horel, G. and Saita, L. (2008). Frailty correlated default.
Journal of Finance, to appear.
[29] Efron, B. (2002). The two-way proportional hazards model. Journal of the Royal
Statistical Society: Ser. B, 34, 216–217.
[30] Esper, J., Cook, E. R. and Schweingruber (2002). Low-frequency signals in long
tree-ring chronologies for reconstruction past temperature variability. Science,
295, 2250–2253.
139
[31] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications.
Chapman & Hall, Boca Raton.
[32] Fan, J., Huang, T. and Li, R. (2007). Analysis of longitudinal data with semi-
parametric estimation of covariance function. Journal of the American Statistical
Association, 102, 632–641.
[33] Fan, J. and Yao, Q. (2003). Nonlinear Time Series: Nonparametric and Para-
metric Methods. Springer, New York.
[34] Fang, K. T., Li, R. and Sudjianto, A. (2006). Design and Modeling for Compute
Experiments. Baca Raton: Chapman and Hall.
[35] Faraway, J. J. (2006). Extending the Linear Model with R: Generalized Linear,
Mixed Effects and Nonparametric Regression Models. Baca Raton: Chapman and
Hall.
[36] Farewell, V. T. and Cox, D. R. (1979). A note on multiple time scales in life
testing. Appl. Statist., 28, 73–75.
[37] Frame, S., Lehnert, A. and Prescott, N. (2008). A snapshot of mortgage condi-
tions with an emphasis on subprime mortgage performance. U.S. Federal Reserve,
August 2008.
[80] Friedman, J. H. (1991). Multivariate adaptive regression splines (with discus-
sion). Annals of Statistics, 19, 1–141.
[39] Fu, W. J. (2000). Ridge estimator in singular design with applications to age-
period-cohort analysis of disease rates. Communications in Statistics – Theory
and Method, 29, 263–278.
[40] Fu, W. J. (2008). A smoothing cohort model in age-period-cohort analysis with
applications to homicide arrest rates and lung cancer mortality rates. Sociological
Methods and Research, 36, 327–361.
[41] Gerardi, K., Sherlund, S. M., Lehnert, A. and Willen, P. (2008). Making sense of
the subprime crisis. Brookings Papers on Economic Activity, 2008 (2), 69–159.
[42] Giesecke, K. (2006). Default and information. Journal of Economic Dynamics
and Control, 30, 2281–2303.
[43] Golub, G. H., Heath, M. and Wahba G. (1979). Generalized cross-validation as
a method for choosing a good ridge parameter. Technometrics, 21, 215–223.
[44] Goodman, L. S., Li, S., Lucas, D. J., Zimmerman, T. A. and Fabozzi, F. J.
(2008). Subprime Mortgage Credit Derivatives. Wiley, Hoboken NJ.
[45] Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian treed Gaussian process mod-
els with an application to computer modeling. Journal of the American Statistical
Association, 103, 1119–1130.
140
[46] Green, P. G. and Silverman, B. W. (1994). Nonparametric Regression and Gen-
eralized Linear Models. Chapman & Hall, London.
[47] Gu, C. (2002). Smoothing Spline ANOVA Models. Springer, New York.
[48] Gu, M. G. and Lai, T. L. (1991). Weak convergence of time-sequential cen-
sored rank statistics with applications to sequential testing in clinical trials. Ann.
Statist., 19, 1403–1433.
[49] Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models. New York:
Chapman and Hall.
[50] Hougaard, P. (2000). Analysis of Multivariate Survival Data. Springer, New York.
[51] Jarrow, R. and Turnbull, S. (1995). Pricing derivatives on financial securities
subject to credit risk. Journal of Finance, 50, 53–85.
[52] Keiding, N. (1990). Statistical inference in the Lexis diagram. Phil. Trans. R.
Soc. Lond. A, 332, 487–509.
[53] Kimeldorf, G. and Wahba, G. (1970). A correspondence between Bayesian esti-
mation of stochastic processes and smoothing by splines. Ann. Math. Stat., 41,
495–502.
[54] Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline
functions. J. Math. Anal. Appl., 33, 82–95.
[55] Kupper, L. L., Janis, J. M., Karmous, A. and Greenberg, B. G. (1985). Statistical
age-period-cohort analysis: a review and critique. Journal of Chronic Disease,
38, 811–830.
[56] Lando, D. (1998). On Cox processes and credit-risky securities. Review of Deriva-
tives Research 2, 99–120.
[57] Lando, D. (2004). Credit Risk Modeling: Theory and Applications. Princeton
University Press.
[58] Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. Wiley,
Hoboken NJ.
[59] Lee, M.-L. T. and Whitmore, G. A. (2006). Threshold regression for survival
analysis: modeling event times by a stochastic process reaching a boundary.
Statistical Science, 21, 501–513.
[60] Lexis, W. (1875). Einleitung in die Theorie der Bev¨olkerungsstatistik. Strass-
burg: Tr¨ ubner. Pages 5–7 translated to English by Keyfitz, N. and printed in
Mathematical Demography (ed. Smith, D. and N. Keyfitz, N., 1977), Springer,
Berlin.
141
[61] Luo, Z. and Wahba, G. (1997). Hybrid adaptive splines. Journal of the American
Statistical Association, 92, 107–116.
[62] Madan, D. and Unal, H. (1998). Pricing the risks of default. Review of Derivatives
Research, 2, 121–160.
[63] Mason, K. O., Mason, W. H., Winsborough, H. H. and Poole, K. (1973). Some
methodological issues in cohort analysis of archival data. American Sociological
Review, 38, 242–258.
[64] Mason, W. H. and Wolfinger, N. H. (2002). Cohort analysis. In: International
Encyclopedia of the Social and Behavioral Sciences, 2189–2194. New York: Else-
vier.
[65] Marshall, A. W. and Olkin, I. (2007). Life Distributions: Structural of Nonpara-
metric, Semiparametric, and Parametric Families. Springer.
[66] Mayer, C., Pence, K. and Sherlund, S. M. (2009). The rise in mortgage defaults.
Journal of Economic Perspectives, 23, 27–50.
[67] McCulloch, C. E. and Nelder, J. A. (1989). Generalized Linear Models (2nd ed.).
London: Chapman and Hall.
[68] Merton, R. C. (1974). On the pricing of corporate debt: The risk structure of
interest rates. Journal of Finance, 49, 1213–1252.
[69] Moody’s Special Comment: Corporate Default and Recovery Rates, 1920-2008.
Data release: February 2009.
[70] M¨ uller, H. G. and Stadtm¨ uller, U. (1987). Variable bandwidth kernel estimators
of regression curves. Annals of Statistics, 15, 282–01.
[71] Nelsen, R. B. (2006). An introduction to copulas (2nd ed.). Springer, New York.
[72] Nielsen, G. G., Gill, R. D., Andersen, P. K. and Sørensen, T. I. A. (1992). A
counting process approach to maximum likelihood estimation in frailty models
Scan. J. Statist., 19, 25–43.
[73] Nychka, D. W. (1988). Bayesian confidence intervals for a smoothing spline.
Journal of The American Statistical Association, 83, 1134–1143.
[74] Oakes, D. (1995). Multiple Time Scales in Survival Analysis. Lifetime Data Anal-
ysis, 1, 7–18.
[75] Pintore, A., Speckman, P. and Holmes, C. C. (2006). Spatially adaptive smooth-
ing splines. Biometrika, 93, 113–125.
[76] Press, W. H., Teukolsky, S. A., Vetterling, W. T. and Flannery, B. P. (2007)
Numerical Recipes: The Art of Scientific Computing (3rd, ed.). Cambridge Uni-
versity Press.
142
[77] Ramlau-Hansen, H. (1983). Smoothing counting process intensities by means of
kernel functions. Ann. Statist., 11, 453–466.
[78] Ramsay, J. O. and Li, X. (1998). Curve registration. Journal of the Royal Sta-
tistical Society: Ser. B, 60, 351–363.
[79] Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine
Learning. The MIT Press.
[80] Rice, J. and Rosenblatt, M. (1983). Smoothing splines: regression, derivatives
and deconvolution. Annals of Statistics, 11, 141–156.
[81] Ruppert, D. and Carroll, R. J. (2000). Spatially-adaptive penalties for spline
fitting. Aust. N. Z. J. Statist., 42(2), 205–223.
[82] Schr¨odinger, E. (1915). Z¨ ur theorie der fall- und steigversuche an teilchen mit
Brownscher bewegung. Physikalische Zeitschrift, 16, pp. 289–295.
[83] Selke, T. and Siegmund, D. (1983). Sequential analysis of the proportional haz-
ards model. Biometrika, 70, 315–326.
[84] Shepp, L. A. (1966). Radon-Nikodynm derivatives of Gaussian measures. Annals
of Mathematical Statistics, 37, 321–354.
[85] Sherlund, S. M. (2008). The past, present, and future of subprime mortgages.
Finance and Economics Discussion Series, 2008-63, U.S. Federal Reserve Board.
[86] Silverman, B. W. (1985). Some aspects of the spline smoothing approch to non-
parametric curve fitting (with discussion). Journal of the Royal Statistical Soci-
ety: Ser. B, 47, 1–52.
[87] Slud, E. V. (1984). Sequential linear rank tests for two-sample censored survival
data. Ann. Statist., 12, 551–571.
[88] Stein, M. L. (1999). Interpolation of Spatial Data: Some Theory for Kriging.
New York: Springer-Verlag.
[89] Sujianto, A., Li, R. and Zhang, A. (2006). GAMEED documentation. Technical
report. Enterprise Quantitative Risk Management, Bank of America, N.A.
[90] Therneau, R. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extend-
ing the Cox Model. Springer, New York.
[91] Tweedie, M. C. K. (1957). Statistical Properties of Inverse Gaussian Distribu-
tions. I and II. Annals of Mathematical Statistics, 28, 362–377 and 696–705.
[92] Vaupel, J. W., Manton, K. G. and Stallard, E. (1979). The impact of heterogene-
ity in individual frailty on the dynamics of mortality. Demography 16, 439–454.
143
[93] Vasicek, O. (1977). An equilibrium characterization of the term structure. Jour-
nal of Financial Economics, 5, 177–188.
[94] Vidakovic, B. (1999). Statistical Modeling by Wavelets. Wiley, New York.
[95] Wahba, G. (1983). Bayesian confidence intervals for the cross-validated smooth-
ing spline. Journal of the Royal Statistical Society: Ser. B, 45, 133–150.
[96] Wahba, G. (1990). Spline Models for Observational Data, SSBMS-NSF Regional
Conference Series in Applied Mathematics. Philadelphia: SIAM.
[97] Wahba, G. (1995). Discussion of “Wavelet shrinkage: asymptopia” by Donoho,
et al. Journal of The Royal Statistical Society, Ser. B, 57, 360-361.
[98] Wallace, N. (ed.) (2005). Special issue of “Innovations in mortgage modeling”.
Real Estate Economics, 33, 587–782.
[99] Wand, M. P. and Jones, M. C. (1995). Kernel Smoothing. Chapman & Hall,
London.
[100] Wang, J.-L. (2005). Smoothing hazard rate. In Arbmitage, P. and Colton, T.
(Eds.): Encyclopedia of Biostatistics (2nd ed.), Volume 7, pp. 4986–4997. Wiley.
[101] Wikipedia, The Free Encylopedia. URL: http://en.wikipedia.org
144

c

Aijun Zhang

2009

All Rights Reserved

To my elementary school, high school and university teachers

ii

To them. active support and his many illuminating ideas. This thesis is impossible without the love and remote support of my parents in China. I am most indebted. I appreciate his inspiration. I am thankful to Julian Faraway for his encouragement during the first years of my PhD journey.ACKNOWLEDGEMENTS First of all. my co-advisor from Bank of America. I appreciate his guidance. I would express my gratitude to my advisor Prof. I would also like to thank Ji Zhu. and others I first met in 2006 at the Bank. Judy Jin and Tailen Hsing for serving on my doctoral committee and helpful discussions on this thesis and other research works. iii . Xuejun Zhou. I would extend my acknowledgement to Prof. Arun Pinto. encouragement and protection through these valuable years at the University of Michigan. They all persuaded me to jump into the area of credit risk research. Shelly Ennis. I did it a year later and finally came up with this thesis within two more years. Agus Sudjianto. Mike Bonn. Ruilong He. I am grateful to Dr. for giving me the opportunity to work with him during the summers of 2006 and 2007 and for offering me a full-time position. I would also like to thank Tony Nobili. Kai-Tai Fang for his consistent encouragement ever since I graduated from his class in Hong Kong 5 years ago. Vijay Nair for guiding me during the entire PhD research.

. . . . .2. . . . . . . 2. . . . . . . . . . An Introduction to Credit Risk Modeling . ABSTRACT . . . . . . . 2. . 1. . .2 Intensity-based Approach . . . . . . . . . . . . . . . . . . .3 1. . . . . . . . Survival Models . . . . . . . . 1. . . .2. . . . . . .1 Two Worlds of Credit Risk . . . .2. Adaptive Smoothing Spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Parametrizing Default Intensity 1. . . . . . . . . . . . . . LIST OF FIGURES . . . . . . . . AdaSS: Adaptive Smoothing Spline . . . . . . . . . Scope of the Thesis . . . . 1. . . . . . . . . . . . . . . . . . . .2 Introduction . . . .2 Local Penalty Adaptation . . . . . . . . .TABLE OF CONTENTS DEDICATION . . . .2. . . . . . . . . . . . . . . . . . . . . . .2 Incorporating Covariates . Credit Risk Models . . . . . . . . . . . . 1. . . . . . . . . . . . . Experimental Results . . . . . . . . . . . . . . . . . . . . CHAPTER I. . . . . . 2. . . . . . . .4 II. 1. . . . . . .3. . . . . . . . . . . . .2 1. . . . . . . . . . . . . . . . . . . . . . . . . . . 1. . . . . . . . . . . . . . . . .1 Reproducing Kernels . . . . . . . . . . . . . . . . . . 2. . .3. . LIST OF TABLES . . . . . . . . . . .1 Structural Approach . . .3 Correlating Credit Defaults . . . . . . . . . . . . . . . . . . . . ii iii vi ix x 1 2 2 5 6 6 9 11 12 13 15 18 24 24 28 30 32 36 38 1. . . . ACKNOWLEDGEMENTS . . .3. . . . . . . . .1. . . . . . . . .1 Credit Spread Puzzle . . .1.2 Actual Defaults . . . . . . . . .1 2. . . . . . . . . 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 iv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . AdaSS Properties . . . . . .3 2.

. . . . . . . . . . . 3. . .4. . . . . . . . . . . . . . . . . . . . . . . .6 Summary .2. . . . .3 Frailty-type Vintage Effects . . . . . 4. . . .5.2 Kriging.5 4. . . . 4. . . .3. . . . . . 3. . 45 46 48 48 51 58 58 60 65 67 68 69 73 74 78 80 82 86 88 88 93 94 95 97 104 104 109 111 112 114 116 119 119 124 129 130 III.1 Single Segment . . 4. . . . . .1 4. . . . . . . .1 Simulation Study . . . . . . . . . . .1 Covariance Kernels .2 Partial Likelihood Estimation . . . . .1 Credit Card Portfolios .4. . . . . . . . . . .5. . . . . . . . . . . . . Dual-time Survival Analysis . . . . . . . . . . . . . . . . . . . .4. . . . . . . . . . . . .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Incorporation of Covariate Effects . . . . . . . . . . . . . . . . . . . . . . . . . .2 Multiple Segments .1 Dual-time Cox Models . . . . . . Discussion . . . . . . . . . . . . . 4. 4. . . . . . . . . . . . . . . . . . . . . 4. . . . . . . . . . Nonparametric Methods . . . . . . . . . . . . . . .2 Mortgage Competing Risks . . .4 4. . . . . . . .2 3. . . . .3 Introduction . . . . Supplementary Materials . . . . .7 IV. . 3.3 Retail Loan Loss Rates . . . . . . . .2 Introduction . 3. . . . . . . . . . . . . .5. . 4. . . . . . . . . . . . . . . . . . . . Semiparametric Regression . . . . . . . . . . . . .2. . . . . . . .4. . . . . . 4. . . . . . . . . . . .1 Empirical Hazards . . . . .7 BIBLIOGRAPHY . .2. . . . . . . . . . . .2 Corporate Default Rates . .1 3. . . . . . . . . . . . . . . . . . . . . . .5 3. . . 4. . . . . .2 DtBreslow Estimator . . . . . . . . . . . .3 MEV Decomposition . . 3. Gaussian Process Models . . . . . . . . . . . . . Dual-time Cox Regression . . . . . . . . .3 4. . 3. .3.2. . . . . . . . . . . . . . . . . . .4 3. . . . . . . . . Vintage Data Analysis . . . . . . Structural Models . . . . . . . . . . . . 3. . . . 4. . .5 2. . . . . . 4. . . . . MEV Decomposition Framework . . . . . . .3. . . . . . . . . . 137 v . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . .3. .4. . Spline and Kernel Methods 3. . . . . . . . . . . . . .3. . . . . Applications in Credit Risk Modeling . . . Technical Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Technical Proofs . . . . . . .6 4. . . .1 First-passage-time Parameterization 4. . . . . . .3 MEV Backfitting Algorithm . . .5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 3. . . . . . . . . 3. Applications in Retail Credit Risk Modeling . . . . . . . . . . . . . . . . . . . . . 3. . . .

5 22 2. the log loss rates are considered as the responses.com/).org/cycles/). . Moody’s speculative-grade default rates for annual cohorts 1970-2008: projection views in lifetime. .2 1. . . both have time-varying frequency. The sin(ωt4 ) signal resembles the Doppler function in Figure 2. . .3 1. . n = 100 and snr = 7.stlouisfed.1 26 2. . . . . . 1920-2008 (http://www. . . . . . and (4) Credit-Risk tweaked sample. .LIST OF FIGURES Figure 1. .4 21 1. . . 4 8 19 1. . . (3) MotorcycleAccident experimental data. . . .1 Moody’s-rated corporate default rates. bond spreads and NBERdated recessions. .org/fred2/categories/119). . . .3 39 vi . and the time-dependent weights are specified proportional to the number of replicates. . . . . .nber. . .2 34 2. . . . Illustration of retail credit portfolios and vintage diagram. A road map of thesis developments of statistical methods in credit risk modeling. . c) NBER-dated recessions (http://www. Data sources: a) Moody’s Baa & Aaa corporate bond yields (http://research. . . . . . . . . b) Moody’s Special Comment on Corporate Default and Recovery Rates. . .1. . . . . . . . The dashed lines represent 95% confidence intervals. . . . . . ω = 10π. . . . . . . . OrdSS curve fitting with m = 2 (shown by solid lines). . . calendar and vintage origination time. . (2) HeaviSine function simulated with noise. . . . . . . Simulated drifted Wiener process. first-passage-time and hazard rate. In the credit-risk case.moodys. Heterogeneous smooth functions: (1) Doppler function simulated with noise. . . OrdSS function estimate and its 2nd-order derivative (upon standardization): scaled signal from f (t) = sin(ωt) (upper panel) and f (t) = sin(ωt4 ) (lower panel).

. . . . . . . . . . (bottom) data smoothing on vintage diagram.4 79 79 3. . . . (2nd) simulation with noise. . (bottom) MEV decomposition. . . . . . . . . . . . . . . . . . .5 2. . Results of MEV Backfitting algorithm upon convergence (right panel) based on the squared exponential kernels. .6 82 4. . . . AdaSS (red) and the heterogeneous truth (light background). . (middle) pre2005 vintages only. . (right) empirical hazard rates in either lifetime or calendar time. . (3nd) MEV Backfitting algorithm upon convergence. . . . . . .3 MEV modeling of empirical hazard rates. . . 42 AdaSS estimate of maturation curve for Credit-Risk sample: piecewiseconstant ρ−1 (t) (upper panel) and piecewise-linear ρ−1 (t) (lower panel). . . . .2 98 4. . Non-stationary OrdSS and AdaSS for Motorcycle-Accident Data. . . . .2. . . 90 DtBreslow estimator vs one-way nonparametric estimator using the data of (top) both pre-2005 and post-2005 vintages. . . . . . calendar t and vintage origination ˆ ˆ time v. non-stationary OrdSS and AdaSS: performance in comparison. . . . . .1 3.4 . . . .3 77 3.6 2. . based on the dual-time-todefault data of (top) both pre-2005 and post-2005 vintages. Synthetic vintage data analysis: (top) projection views of fitted values. . . 121 vii 4. . . . . . (bottom) Estimation compared to the underlying truth. . (middle) pre-2005 vintages only. . MEV fitted values: Moody’s-rated Corporate Default Rates . . . (middle) DtBrewlow estimation. . Vintage data analysis of retail loan loss rates: (top) projection views of emprical loss rates in lifetime m. . . . . . . . . . . .5 3. . . . . . 50 3. . . . . . . . . 40 42 2. . (bottom) post-2005 vintages only.2 76 3. . g (t) and h(v) (at ˆ log scale). . 44 Vintage diagram upon truncation and exemplified prototypes.1 Dual-time-to-default: (left) Lexis diagram of sub-sampled simulations. . . .7 OrdSS.4 Simulation study of Doppler and HeaviSine functions: OrdSS (blue). (bottom) post-2005 vintages only. . . . . . . . . (bottom) MEV decomposition effects f (m). . Synthetic vintage data analysis: (top) underlying true marginal effects. . . . . . . . 101 Nonparametric analysis of credit card risk: (top) one-way empirical hazards. 4. . . Shown in the left panel is the GCV selection of smoothing and structural parameters. . . . . . . . . .

. 133 viii .5 MEV decomposition of credit card risk for low.6 4. 123 Home prices and unemployment rate of California state: 2000-2008.7 4.4. Segment 2 (bottom panel). prepayment . illustrated. medium and high FICO buckets: Segment 1 (top panel). 125 Split time-varying covariates into little time segments.8 MEV decomposition of mortgage hazards: default vs. . 125 4.

. . .com/) and author’s calculations. .1 77 4.1 3. . . . . . . .3 ix . . . . . . . . . . . . 128 4. . where NoteRate could be dynamic and others are static upon origination. . . . . . 128 Maximum partial likelihood estimation of mortgage covariate effects in dual-time Cox regression models. . .1 4. . . . . . . .1 Moody’s speculative-grade default rates. . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Illustrative data format of pooled credit card loans . . . . . . . . . . . AdaSS parameters in Doppler and HeaviSine simulation study. . . . . . . .LIST OF TABLES Table 1. . 1920-2008 (http://www. . . Synthetic vintage data analysis: MEV modeling exercise with GCVselected structural and smoothing parameters. . . . . . . . . . .moodys. . . . . . . . 23 41 2. . Data source: Moody’s special comment (release: February 2009) on corporate default and recovery rates. . . 120 Loan-level covariates considered in mortgage credit risk modeling. . . . . . .

Credit risk modeling has been the subject of considerable research interest in finance and has recently drawn the attention of statistical researchers. provide an efficient MEV backfitting algorithm and assess its performance with both simulation and real examples. and “dual-time survival analysis” (DtSA) in the fourth chapter. The second topic is the development of dual-time analytics for observations involving both lifetime and calendar timescale. x . The intrinsic identification problem is discussed for both VDA and DtSA. for which we derive an explicit solution based on piecewise type of local adaptation. exogenous influence by macroeconomic conditions.ABSTRACT This research deals with some statistical modeling problems that are motivated by credit risk analysis. We propose a maturation-exogenous-vintage (MEV) decomposition strategy in order to understand the risk determinants in terms of self-maturation in lifetime. and heterogeneity induced from vintage originations. we consider VDA under Gaussian process models. Two challenging issues that arise in this context are evaluation of reproducing kernel and determination of local penalty. In the first chapter. The first statistical problem considered is the development of adaptive smoothing spline (AdaSS) for heterogeneously smooth function estimation. we provide an up-to-date review of credit risk models and demonstrate their close connection to survival analysis. and it plays a key role in subsequent work in the thesis. Our nonparametric AdaSS technique is capable of fitting a diverse set of ‘smooth’ functions including possible jumps. It includes “vintage data analysis” (VDA) for continuous type of responses in the third chapter. Specifically.

xi . and shed some light on understanding the ongoing credit crisis. We consider nonparametric estimators.DtSA on Lexis diagram is of particular importance in credit risk modeling where the default events could be triggered by both endogenous and exogenous hazards. We demonstrate the application of DtSA to credit card and mortgage risk analysis in retail banking. first-passage-time parameterization and semiparametric Cox regression. These developments extend the family of models for both credit risk modeling and survival analysis.

By Wikipedia’s definition. The data availability leads to two streams of credit risk modeling that have key distinctions. the corporate bankruptcy.org. the loss severity upon the default event. and has recently drawn the attention of statistical researchers. The examples of default event include the bond default. It has been the subject of considerable research interest in banking and finance communities. 1 . An enormous literature in credit risk has been fostered by both academics in finance and practitioners in industry. the credit card chargeoff. consumers and regulators. Other forms of credit risk include the repayment delinquency in retail loans. and the mortgage foreclosure.” (Wikipedia. as of March 2009) Central to credit risk is the default event. There are two parallel worlds based upon a simple dichotomous rule of data availability: (a) the direct measurements of credit performance and (b) the prices observed from credit market. as well as the unexpected change of credit rating. “Credit risk is the risk of loss due to a debtor’s non-payment of a loan or other line of credit.CHAPTER I An Introduction to Credit Risk Modeling Credit risk is a critical area in banking and is of concern to a variety of stakeholders: institutions. which occurs if the debtor is unable to meet its legal obligation according to the debt contract.

1.1

Two Worlds of Credit Risk

The two worlds of credit risk can be simply characterized by the types of default probability, one being actual and the other being implied. The former corresponds to the direct observations of defaults, also known as the physical default probability in finance. The latter refers to the risk-neutral default probability implied from the credit market data, e.g. corporate bond yields. The academic literature of corporate credit risk has been inclined to the study of the implied defaults, which is yet a puzzling world.

1.1.1

Credit Spread Puzzle

The credit spread of a defaultable corporate bond is its excess yield over the default-free Treasury bond of the same time to maturity. Consider a zero-coupon corporate bond with unit face value and maturity date T . The yield-to-maturity at t < T is defined by Y (t, T ) = − log B(t, T ) T −t (1.1)

where B(t, T ) is the bond price of the form
T

B(t, T ) = E exp −
t

(r(s) + λ(s))ds

,

(1.2)

given the independent term structures of the interest rate r(·) and the default rate λ(·). Setting λ(t) ≡ 0 gives the benchmark price B0 (t, T ) and the yield Y0 (t, T ) of Treasury bond. Then, the credit spread can be calculated as log(B(t, T )/B0 (t, T )) log(1 − q(t, T )) =− T −t T −t (1.3)

Spread(t, T ) = Y (t, T ) − Y0 (t, T ) = −

where q(t, T ) = E e−

RT
t

λ(t)ds

is the conditional default probability P[τ ≤ T |τ > t] and

τ denotes the time-to-default, to be detailed in the next section. 2

The credit spread is supposed to co-move with the default rate. For illustration, Figure 1.1 (top panel) plots the Moody’s-rated corporate default rates and BaaAaa bond spreads ranging from 1920-2008. The shaded backgrounds correspond to NBER’s latest announcement of recession dates. Most spikes of the default rates and the credit spreads coincide with the recessions, but it is clear that the movements of two time series differ in both level and change. Such a lack of match is the so-called credit spread puzzle in the latest literature of corporate finance; the actual default rates could be successfully implied from the market data of credit spreads by none of the existing structural credit risk models. The default rates implied from credit spreads mostly overestimate the expected default rates. One factor that helps explain the gap is the liquidity risk – a security cannot be traded quickly enough in the market to prevent the loss. Figure 1.1 (bottom panel) plots, in fine resolution, the Moody’s speculative-grade default rates versus the high-yield credit spreads: 1994-2008. It illustrates the phenomenon where the spread changed in response to liquidity-risk events, while the default rates did not react significantly until quarters later; see the liquidity crises triggered in 1998 (Russian/LTCM shock) and 2007 (Subprime Mortgage meltdown). Besides the liquidity risk, there exist other known (e.g. tax treatments of corporate bonds vs. government bonds) and unknown factors that make incomplete the implied world of credit risk. As of today, we lack a thorough understanding of the credit spread puzzle; see e.g. Chen, Lesmond and Wei (2007) and references therein. The shaky foundation of the default risk implied from market credit spreads without looking at the historical defaults leads to further questions about the credit derivatives, e.g. Credit Default Swap (CDS) and Collateralized Debt Obligation (CDO). The 2007-08 collapse of credit market in Wall Street is partly due to overcomplication in “innovating” such complex financial instruments on the one hand, and over-simplification in quantifying their embedded risks on the other.

3

Figure 1.1: Moody’s-rated corporate default rates, bond spreads and NBERdated recessions. Data sources: a) Moody’s Baa & Aaa corporate bond yields (http://research.stlouisfed.org/fred2/categories/119); b) Moody’s Special Comment on Corporate Default and Recovery Rates, 1920-2008 (http://www.moodys.com/); c) NBER-dated recessions (http://www.nber.org/cycles/).

4

which we believe is largely due to the limited access for an academic researcher to the proprietary internal data of historical defaults. and intends to represent the creditworthiness of a borrower such that he or she will repay the debt. the three major U. Experian and TransUnion. developed by Fair Isaac Corporation. which use Aaa. b) credit scoring in consumer lending.2 Actual Defaults The other world of credit risk is the study of default probability bottom up from the actual credit performance. FICO. credit bureaus: Equifax. credit bureaus often report inconsistent FICO scores based on their own proprietary models. Both credit ratings and scores represent the creditworthiness of individual corporations and consumers. by e. the three major U. the academic literature based on the actual defaults is much smaller. banks and other credit card or mortgage lenders. It ranges from 300 (very poor) to 850 (best).1. It includes the popular industry practices of a) credit rating in corporate finance. Ca. Caa. the three major U. Baa. and Fitch.S. C to represent the likelihood of default from the lowest to the highest. The letters Aaa and Baa in Figure 1.1.S.g. A.S. Let us describe very briefly some rating and scoring basics related to the thesis.1 refers to Ba and the worse ratings. by e. as well as judgement by rating/scoring specialists.1 are examples of Moody’s rating system. Standard & Poor’s. The speculative-grade corporate bonds are sometimes said to be high-yield or junk. B. Compared to either the industry practices mentioned above or the academics of the implied default probability. A few academic 5 . The final evaluations are based on statistical models of the expected default probability. For the same borrower. The speculative grade in Figure 1.S. Ba. rating agencies: Moody’s.g. Aa. is the best-known consumer credit score and it is the most widely used by U.

by It´’s lemma. s). For demonstration. In this thesis.works will be reviewed later in Section 1. σ 2 t .1 Structural Approach In credit risk modeling. 1. 1976).2. structural approach is also known as the firm-value approach since a firm’s inability to meet the contractual debt is assumed to be determined by its asset value. It was inspired by the 1970s Black-Scholes-Merton methodology for financial option pricing. 6 . we make an attempt to develop statistical methods based on the actual credit performance data. Let D be the debt level with maturity date T . The Merton model assumes that the default event occurs at the maturity date of debt if the asset value is less than the debt level. we shall use the synthetic or tweaked samples of retail credit portfolios. Two classic structural models are the Merton model (Merton. 1974) and the first-passage-time model (Black and Cox.5) from which one may evaluate the default probability P(V (T ) ≤ D). 2 ∼ Lognormal (1. Our focus is placed on the probability of default and the hazard rate of time-to-default. o V (t) = exp V (0) 1 µ − σ 2 t + σWt 2 1 µ − σ 2 t.4) with drift µ.2 Credit Risk Models This section reviews the finance literature of credit risk models. Recall that EW (t) = 0.3. EW (t)W (s) = min(t. including both structural and intensity-based approaches. as well as the public release of corporate default rates by Moody’s. (1. and let V (t) be the latent asset value following a geometric Brownian motion dV (t) = µV (t)dt + σV (t)dW (t). 1. volatility σ and the standard Wiener process W (t). Given the initial asset value V (0) > D.

(1.The notion of distance-to-default facilitates the computation of conditional default probability. where we set the parameters b = −0. By (1. it is easy to verify that the conditional probability of default at maturity date T is X(t) + b(T − t) √ T −t P(V (T ) ≤ D|V (t) > D) = P(X(T ) ≤ 0|X(t) > 0) = Φ . σ Then. According to Duffie and Singleton (2003). consider the 7 .e. The first-passage-time model by Black and Cox (1976) extends the Merton model so that the default event could occur as soon as the asset value reaches a pre-specified debt barrier.6) Clearly. X(t) is a drifted Wiener process of the form X(t) = c + bt + W (t). V (t) hits the debt barrier once the distance-to-default process X(t) hits zero. Figure 1. let us define the distance-to-default X(t) by the number of standard deviations such that log Vt exceeds log D. (1. i. Given the sample path of asset values up to t.4) by maximum likelihood method.t. which is treated as default in one model but not the other. a constant debt barrier). Given the initial distance-to-default c ≡ X(0) > 0. The key difference of Merton model and Black-Cox model lies in the green path (the middle realization viewed at T ).7) and (1. one may first estimate the unknown parameters in (1.8) where Φ(·) is the cumulative normal distribution function.7) with b = and c = log V (0)−log D .8). X(t) = (log V (t) − log D)/σ.02 and c = 8 (s.2 (left panel) illustrates the first-passage-time of a drifted Wiener process by simulation. µ−σ 2 /2 σ t≥0 (1.

1957. (1. §13). 1 f (t) P(t ≤ τ < t + ∆t|τ ≥ t) = . (1.Figure 1.2: Simulated drifted Wiener process. first-passage-time τ = inf{t ≥ 0 : X(t) ≤ 0}. is defined by the instantaneous rate of default conditional on the survivorship.11) The hazard rate. It is well known that τ follows the inverse Gaussian distribution (Schr¨dinger. Tweedie. ∆t↓0 ∆t S(t) λ(t) = lim (1. (1. 1915. first-passage-time and hazard rate.12) Using the inverse Gaussian density and survival functions. we obtain the form of the 8 .10) See also other types of parametrization in Marshall and Olkin (2007. The survival function S(t) is defined by P(τ > t) for any t ≥ 0 and is given by c + bt √ t − e−2bc Φ −c + bt √ t S(t) = Φ . t ≥ 0. 1989) with the o density (c + bt)2 c f (t) = √ t−3/2 exp − 2t 2π . or the conditional default rate. Chhikara and Folks.9) where inf ∅ = ∞ as usual.

see the details in Chapter IV or Aalen. Unlike the structural approach that assumes the default to be completely determined by the asset value subject to a barrier. The default is treated as an unexpected event that comes ‘by surprise’. we will discuss the dual-time extension of first-passage-time parameterization with both endogenous and exogenous hazards. Many follow-up papers can be found in Lando (2004). Later in Chapter IV. 8.g.2 Intensity-based Approach The intensity-based approach is also called the reduced-form approach. proposed independently by Jarrow and Turnbull (1995) and Madan and Unal (1998). since in the real world the default event (e.2 (right panel) plots λ(t. (1. Figure 1. b) for b = −0.02 and c = 4. the default event in the reduced-form approach is governed by an externally specified intensity process that may or may not be related to the asset value. 1. Modern developments of structural models based on Merton and Black-Cox models can be referred to Bielecki and Rutkowski (2004. Borgan and Gjessing (2008. c. c. −c + bt −2bc √ −e Φ t 2πt3 c + bt √ t c > 0.first-passage-time hazard rate: √ λ(t. This is a practically appealing feature.13) This is one of the most important forms of hazard function in structural approach to credit risk modeling. Both the trend parameter b and the initial distance-to-default parameter c provide insights to understanding the shape of the hazard rate.2. 6. Bielecki and Rutkowski (2004) and references therein. b) = Φ c exp − (c + bt)2 2t . §3). which resemble the default rates from low to high credit qualities in terms of credit ratings or FICO scores. as well as non-constant default barrier and incomplete information about structural parameters. Year 2001 bankruptcy of Enron Corporation) is 9 . 12. 10. §10).

The choices (1. When S(t) is absolutely continuous with f (t) = d(1 − S(t))/dt.12) and it has roots in statistical reliability and survival analysis of time-to-failure. Thus.often all of a sudden happening without announcement. What matters in modeling defaults is only the first jump of the counting process. t ≥ 0. The term-structure models provide straightforward ways to simulate the future default intensity for the purpose of predicting the conditional default probability. and it covers both the mean-reverting Vasicek and CIR models by setting µ(λ(t)) = κ(θ − λ(t)) and σ(λ(t)) = 2 2 σ0 + σ1 λ(t). The last model involves a pure-jump process Jt . Pan and Singleton (2000). In credit risk modeling. In finance. we have that −dS(t) d[log S(t)] =− . S(t)dt dt t λ(t) = S(t) = exp − 0 λ(s)ds . the intensity-based models are mostly the term-structure models borrowed from the literature of interest-rate modeling. they are ad hoc models lacking fundamental interpretation of the default event. The default intensity corresponds to the hazard rate λ(t) = f (t)/S(t) defined in (1. in which case the default intensity is equivalent to the hazard rate.14) dλ(t) = µ(λ(t))dt + σ(λ(t))dWt + dJt with reference to Vasicek (1977). λ(t) is often treated as stochastic. However. In survival analysis.14) are popular because they could yield closed-form pricing 10 . Below is an incomplete list: Vasicek: Cox-Ingersoll-Roll: Affine jump: dλ(t) = κ(θ − λ(t))dt + σdWt dλ(t) = κ(θ − λ(t))dt + σ λ(t)dWt (1. λ(t) is usually assumed to be a deterministic function in time. Cox process) that refers to a counting process with possibly recurrent events. the default time τ is doubly stochastic. Ingersoll and Roll (1985) and Duffie. Note that Lando (1998) adopted the term “doubly stochastic Poisson process” (or. Cox.

2. while the real-world default intensities deserve more flexible and meaningful forms. Here. e. flexibility in parametrizing the default intensity. the intensity models in (1. In practice. Saita and Wang (2007) and Deng. Duffie. The dependence of default intensity on state variables z(t) (e. since the latter by definition is the analysis of time-to-failure data.2). by correlated Wiener processes.g. This approach essentially presumes a linear dependence in the diffusion components. For instance. Quigley and Van Order (2000).g. In a rough sense. z(t)). 2. which opens another door for approaching credit risk. macroeconomic covariates) is usually treated through a multivariate term-structure model for the joint of (λ(t). 3.3 Survival Models Credit risk modeling in finance is closely related to survival analysis in statistics.g. see e.formulas for (1. the effect of a state variable on the default intensity can be non-linear in many other ways. including both the first-passage-time structural models and the duration type of intensity-based models. the failure refers to the default event in either corporate or retail risk exposures. effectiveness in modeling the credit portfolios. default risk models based on the actual credit performance data exclusively belong to survival analysis. We find that the survival models have at least four-fold advantages in approach to credit risk modeling: 1. flexibility in incorporating various types of covariates.14) cannot be used to model the endogenous shapes of first-passage-time hazards illustrated in Figure 1. They correspond to the classical survival analysis in statistics. The intensity-based approach also includes the duration models for econometric analysis of actual historical defaults. 1. and 11 .

16) with individual risk parameters (µi . there are a rich set of lifetime distributions for parametrizing the default intensity (i.4. while the inverse Gaussian distribution has a beautiful first-passagetime interpretation. Then. hazard rate) λ(t).3. including Exponential: Weibull: Lognormal: Log-logistic: λ(t) = α λ(t) = αtβ−1 1 φ(log t/α) tα Φ(− log t/α) (β/α)(t/α)β−1 λ(t) = .13). 1 + (t/α)β λ(t) = (1. β > 0 where φ(x) = Φ (x) is the density function of normal distribution. More examples of parametric models can be found in Lawless (2003) and Marshall and Olkin (2007) among many other texts. The log-location-scale transform can be used to model the hazard rates. 1. being straightforward to enrich and extend the family of credit risk models.17) .1 Parametrizing Default Intensity In survival analysis.15) α. The following materials are organized in a way to make these points one by one. Given a latent default time τ0 with baseline hazard rate λ0 (t) and survival function S0 (t) = exp − t 0 λ0 (s)ds . They all correspond to the distributional assumptions of the default time τ . σi > 0 (1. Another example is the inverse Gaussian type of hazard rate function in (1. the firm-specific survival function takes the form Si (t) = S0 t eµi 12 1/σi (1. σi ). −∞ < µi < ∞.e. one may model the firm-specific default time τi by log τi = µi + σi log τ0 .

given the Weibull baseline hazard rate λ0 (t) = αtβ−1 . thus ending up with proportional hazard rates.19) 1. namely 1. The dependence of default intensity on zi can be studied by regression models. one may use the inverse Gaussian hazard rate (1. the log-locationscale transformed hazard rate is given by β βµi α σ −1 t i exp − σi σi λi (t) = . In the context of survival analysis.2 Incorporating Covariates Consider firm-specific (or consumer-specific) covariates zi ∈ Rp that are either static or time-varying.3. there are two popular classes of regression models. λi (t)/λj (t) = exp{θ T ([zi (t) − zj (t)]}. the hazard ratio between any two firms.18) For example. and 2. Note that the covariates zi (t) are sometimes transformed before entering 13 .and the firm-specific hazard rate is d log Si (t) 1 t = dt σi t eµi 1/σi λi (t) = − λ0 t eµi 1/σi . Then. (1. For example.15) as the baseline λ0 (t).13) or pick one from (1. (1. is constant if the covariates difference zi (t)−zj (t) is constant in time.20) where λ0 (t) is a baseline hazard function and r(zi (t)) is a firm-specific relative-risk multiplier (both must be positive). The hazard rate is taken as the modeling basis in the first class of regression models: λi (t) = λ0 (t)r(zi (t)) (1. the multiplicative hazards regression models. The relative risk term is often specified by r(zi (t)) = exp{θ T zi (t)} with parameter θ ∈ Rp . the accelerated failure time (AFT) regression models.

Rather than using the parametric form of baseline hazard. by (1. and ˆ the nonparametric likelihood to estimate λ0 (t). Given τ0 with baseline hazard rate λ0 (t). Lawless (2003.20). we have that λi (t) = λ0 (t) exp{−βθ T zi }. consider (1. For more details about AFT regression. Then. One may use the partial likelihood to estimate θ. an econometric modeler usually takes difference of Gross Domestic Product (GDP) index and uses the GDP growth as the entering covariate. the covariates zi could induce the firm-specific heterogeneity via either a relative-risk multiplier λi (t)/λ0 (t) = eθ eθ T T zi (t) or the default time acceleration τi /τ0 = zi . see e. Suppose that zi is not time-varying. The estimated baseline λ0 (t) can be further smoothed upon parametric. Cox (1972. as a unique property of the Weibull lifetime distribution. the introduction of the multiplicative hazards model (1.16) with the location µi = θ T zi and the scale σi = 1 (for simplicity). For example. §6). In situations where certain hidden covariates are not accessible. 14 .the model. λi (t) = λ0 t exp{−θ T zi } exp{−θ T zi } (1. Thus. it is straightforward to get the survival function and hazard rate by (1. Such semiparametric modeling could reduce the estimation bias of the covariate effects due to baseline misspecification.19). 1975) considered λi (t) = λ0 (t) exp{θ T zi (t)} by leaving the baseline hazard λ0 (t) unspecified. et al. The second class of AFT regression models are based on the aforementioned loglocation-scale transform of default time.e. The frailty models introduced by Vaupel. The latter is among the most popular models in modern survival analysis.17) and (1.18): Si (t) = S0 t exp{−θ T zi } .g.20) is incomplete without introducing the Cox’s proportional hazards (CoxPH) model. the unexplained variation also leads to heterogeneity. i. For a survival analyst. This is equivalent to use the multiplicative hazards model (1. term-structure or data-driven techniques. log τi = θ T zi + log τ0 .21) An interesting phenomenon is when using the Weibull baseline.

§6). δ) with shape δ −1 and scale δ > 0. The default correlation could be effectively characterized by multivariate survival analysis. 2.3 Correlating Credit Defaults The issue of default correlation embedded in credit portfolios has drawn intense discussions in the recent credit risk literature. correlating the default times through the copulas. δ) = (1. 1.3. say. (2007) performed an empirical analysis of default times for U. correlating the default intensities through the common covariates. z>0 Γ(k)δ k p(z. among other works. 15 . et al.g. the proportional frailty model extends (1. there exists two different approaches: 1. the positive correlation of defaults would increase the level of total volatility given the same level of total expectation (cf. Markowitz portfolio theory). (1.(1979) are designed to model such unobserved heterogeneity as a random quantity. Das. Bluhm and Overbeck (2007). δ) distribution has the density z k−1 exp{−z/δ} . see e. which is the next topic.23) for each single obligor i subject to the odd effect Z. the frailty models can be also used to characterize the default correlation among multiple obligors. Z ∼ Gamma(δ −1 . Then. In a broad sense. see Aalen. On the other hand. In eyes of a fund manager. In academics.S. (2008. et al. corporations and provided evidence for the importance of default correlation. in particular along with today’s credit crisis involving problematic basket CDS or CDO financial instruments. k.22) with mean kδ and variance kδ 2 . where a Gamma(k.20) to be λi (t) = Z · λ0 (t)r(zi (t)).

see Nelsen (2006) for details. As announced earlier. . and the house-price appreciation index.g. Φ−1 (un )) CΣ. §11) for some intriguing discussion of dynamic frailty modeled by diffusion and L´vy processes. . . . . The copula approach to default correlation considers the default times as the modeling basis.The first approach is better known as the conditionally independent intensity-based approach. the frailty models (1. un ) = ΘΣ. the Archimedean copula reduces to n i=1 ui (independence copula). The frailty effect Z is usually assumed to be time-invariant. Duffie. e. e In finance. (2008. A copula C : [0. the GDP growth. Here. . the short-term interest rate. as well as a dynamic frailty effect to represent the unobservable macroeconomic covariates. . . 1] is a function used to formulate the multivariate joint distribution based on the marginal distributions. . the default correlation induced by the common covariates is in contrast to the aforementioned default heterogeneity induced by the firm-specific covariates.ν (Θ−1 (u1 ).g. for differentiation from singleobligor modeling purpose. 16 . un ) = ΦΣ (Φ−1 (u1 ). When Ψ(u) = − log u. ΦΣ denotes the multivariate normal distribution. and Ψ is the generator of Archimedean copulas. 1]n → [0. .. They are usually rephrased as shared frailty models.ν (u1 . in the sense that the default times are independent conditional on the common covariates. et al. . See also Aalen et al. ΘΣ. (2008) applied the shared frailty effect across firms. . un ) = Ψ−1 i=1 Ψ(ui ) where Σ ∈ Rn×n .ν denotes the multivariate Student-t distribution with degrees of freedom ν. e. Examples of common covariates include the market-wide variables. .23) can also be used to capture the dependence of multi-obligor default intensities in case of hidden common covariates. One may refer to Hougaard (2000) for a comprehensive treatment of multivariate survival analysis from a frailty point of view.24) CΨ (u1 . . By Sklar’s theorem. . . . . . Gaussian: Student-t: Archimedean: CΣ (u1 . . Θ−1 (un )) ν ν n (1.

et al. Besides. . . For each τi with underlying hazard rates Z ·λi (t). . . . . . . Then. we have that L (x) = (1 + δx)−1/δ . the joint survival distribution is given by n n Sjoint (t1 . so n ti −1/δ Sjoint (t1 . i = 1. which 17 . . . tn ) = 1+δ i=1 0 λi (s)ds . given the Gamma frailty Z ∼ Gamma(δ −1 . tn ) = C(S1 (t1 ). . Interestingly enough. upon an appropriate selection of copula C.25) where L (x) = EZ [exp{−xZ}] is the Laplace transform of Z and Λi (t) is the cumulative hazard. . the joint survival distribution of (τ1 . tn ) = L i=1 Λi (ti ) =L i=1 L −1 (Si (ti )) . the correlation risk in these complex financial instruments is highly contingent upon the macroeconomic conditions and black-swan events. the marginal survival distributions are found by integrating over the distribution of Z. . .26) where the second equality follows (1. . Similarly. there always exists a copula that can link the joint distribution to its univariate marginals.25) and leads to an Archimedean copula. . τn ) can be characterized by Sjoint (t1 . . §7). §13) and recast by Aalen. δ). . Therefore. the shared frailty models correspond to the Archimedean copula.for a multivariate joint distribution. assume λi (t) to be deterministic and Z to be random. . (2008. For example. An obvious counter example is the recent collapse of the basket CDS and CDO market for which the rating agencies have once abused the use of the over-simplified Gaussian copula. (1. . one must first of all check the appropriateness of the assumption behind either the frailty approach or the copula approach. . In practice of modeling credit portfolios. Sn (tn )). n (1. . . as observed by Hougaard (2000. t Si (t) = EZ exp − 0 λi (s)ds = L (Λi (t)). .

see e. speciality.stlouisfed. Lawless (2003) and Aalen.g. banking center. and other distinguishable characteristics (e. we have only cited Hougaard (2000).g. Federal Reserve Bank of St. telephone and internet). but there are numerous texts and monographs on survival analysis and reliability. reward.g. et al. or the new developments that would benefit both fields. direct mail. Generalization The above survey of survival models is too brief to cover the vast literature of survival analysis that deserves potential applications in credit risk modeling. It is straightforward to enrich and extend the family of credit risk models. To be selective. volatile and unpredictable.are rather dynamic. Louis. geographic region where the card holder resides). ALFRED digital archive1 hosted by U. acquisition channels (e. 1 http://alfred. ranging from product types (e. The vintage time series are among the most popular versions of economic and risk management data. (2008). consider for instance the revolving exposures of credit cards.4 Scope of the Thesis Both old and new challenges arising in the context of credit risk modeling call for developments of statistical methodologies. 1. By convention. and we will focus on the dual-time generalization. Let us fist preview the complex data structures in real-world credit risk problems before defining the scope of the thesis. Subject to the market share of the card-issuing bank. the accounts originated from the same month constitute a monthly vintage (quarterly and yearly vintages can be defined in the same fashion). For risk management in retail banking. co-brand and secured cards).org/ 18 . The latter is one of our objectives in this thesis.g. thousands to hundred thousands of new cards are issued monthly.S. by either the existing devices in survival analysis.

the vintage diagram consists of unit-slope trajectories each representing the evolvement of a different vintage. binary or count observations.e. called the vintage diagram. Figure 1. all kinds of data can be collected including the continuous. then the lifetime (i. On a vintage diagram. a vintage time series refers to the longitudinal observations and performance measurements for the cohort of accounts that share the same origination date and therefore the same age. where “vintage” is a borrowed term from wine-making industry to account for the sharing origination. Metropolitan Statistical Area) as a representative of segmentation variables. Each segment consists of multiple vintages that have different origination dates.S. In particular. age. or called the “month-on-book” for a monthly vintage) is calculated by m = t − v for t ≥ v. t.3: Illustration of retail credit portfolios and vintage diagram. The most important feature of the vintage data is its dual-time coordinates. Unlike a usual 2D space.3 provides an illustration of retail credit portfolios and the scheme of vintage data collection. each vintage be denoted by its origination time v. and each vintage consists of varying numbers of accounts. when the dual-time 19 . v) tuples lie in the dual-time domain. where we use only the geographic MSA (U. Such (m.Figure 1. Let the calendar time be denoted by t. with x-axes representing the calendar time t and y-axis representing the lifetime m. Then.

but very volatile in calendar time. one may draw a line segment for each individual to represent the vintage data graphically. which are calculated from the cumulative rates released recently by Moody’s Global Credit Policy (February 2009).1 shows the Moody’s speculativegrade default rates for annual cohorts 1970-2008. there seems to be heterogeneity among vintages. or “dual-time analytics”. demography.4. Besides.1. the observations of default events or default rates share the same vintage data structure.e. First of all. It is clear that the average default rates gradually decrease in lifetime. Each cohort is observed by year end of 2008. cohort) heterogeneity. 2002) analyzed the long historical tree-ring records in a dual-time manner in order to reconstruct the 20 . it is worth mentioning that the vintage data structure is by no means an exclusive feature of economic and financial data.1 are aligned in lifetime. These projection views are only an initial exploration. etc. As an interesting example in dendrochronology. Further analysis of this vintage data will be performed in Chapter III. To compare the cohort performances in dual-time coordinates. see Figure 4. For corporate bond and loan issuers. while they can be also aligned in calendar time. Thus. the performance window for the cohorts originated in the latest years becomes shorter and shorter. the marginal plots in both lifetime and calendar time are given Figure 1. epidemiology. Note that the calendar dynamics follow closely the NBER-dated recessions shown in Figure 1.observations are the individual lives together with birth and death time points. Esper. together with the side-by-side box-plots for checking the vintage (i. The main purpose of the thesis is to develop a dual-time modeling framework based on observations on the vintage diagram.1 based on a dual-time-to-default simulation. The default rates in Table 1. and when these observations are subject to truncation and censoring. Such a graph is known as the Lexis diagram in demography and survival analysis. Cook and Schweingruber (Science. It also appears in clinical trials with staggered entry. Table 1. dendrochronology. and truncated by 20 years maximum in lifetime.

Also discussed is the semiparametric extension of MEV models in the presence of dual-time covariates. An efficient MEV Backfitting algorithm is provided for model estimation.g. where M stands for the maturation curve. loss rates). Note that the AdaSS may estimate ‘smooth’ functions including possible jumps. E for the exogenous influence and V for the vintage heterogeneity. for which we derive an explicit solution based upon piecewise type of local adaptation. and its performance is assessed by a sim- 21 . we discuss the adaptive smoothing spline (AdaSS) for heterogeneously smooth function estimation. The intrinsic identification problem is discussed. and it plays a key role in the subsequent work in the thesis. Such models are motivated from the practical needs in financial risk management. Two challenging issues are evaluation of reproducing kernel and determination of local penalty.5 outlines the scope of the thesis. each having a specific feature of heterogeneous smoothness.4: Moody’s speculative-grade default rates for annual cohorts 1970-2008: projection views in lifetime. calendar and vintage origination time. In Chapter III.Figure 1. We believe that the dual-time analytics developed in the thesis can be also applied to non-financial fields. are used to demonstrate the new AdaSS technique. past temperature variability. We propose an MEV decomposition framework based on Gaussian process models. The road map in Figure 1. we develop the vintage data analysis (VDA) for continuos type of dual-time responses (e. Four different examples. In Chapter II.

It is of particular importance in credit risk modeling where the default events could be triggered by both endogenous and exogenous hazards. (b) structural parameterization via the first-passage-time triggering system with an endogenous distance-to-default process associated with exogenous time-transformation. Finally. We consider (a) nonparametric estimators under one-way. two-way and three-way underlying hazards models. Also discussed is the random-effect vintage heterogeneity modeled by the shared frailty. ulation study. In Chapter IV. for which the method of partial likelihood estimation is discussed. 22 . and shed some light on understanding the ongoing credit crisis from a new dual-time analytic perspective. Then. we demonstrate the application of DtSA to credit card and mortgage risk analysis in retail banking. and (c) dual-time semiparametric Cox regression with both endogenous and exogenous baseline hazards and covariate effects. we apply the MEV risk decomposition strategy to analyze both corporate default rates and retail loan loss rates. we study the dual-time survival analysis (DtSA) for time-to-default data observed on Lexis diagram.5: A road map of thesis developments of statistical methods in credit risk modeling.' Chapter I: Introduction Credit Risk Modeling Survival Analysis $ ' Chapter II: AdaSS Smoothing Spline Local Adaptation $ & ' Chapter IV: DtSA ? % & @ @ @ @ @ ? $ ' @ R @ Chapter III: % VDA $ & Nonparametric Estimation  Structural Parametrization Dual-time Cox Regression % & MEV Decomposition Gaussian Process Models Semiparametric Regression % Figure 1.

000 1.037 1.647 1.Table 1.000 0.000 0.349 2.681 8.134 0.282 5.890 1989 5.460 7.861 3.833 0.000 1.001 1.918 4.759 0.955 2004 2.028 3.238 1.439 1.290 5.943 2.840 0.608 2.000 0.501 1.000 0.269 0.411 0.181 1.194 1.392 Year7 0.372 4.653 1.824 3.709 7.049 4.794 9.000 0.824 3.153 1984 3.000 1.531 1.829 0.797 1985 3.145 1997 2.721 4.000 1.364 3.120 1.086 3.000 1.213 2.370 5.719 0.468 1.477 2.139 4.668 1986 5.894 1988 3.573 0.406 1.701 4.571 4.923 3.887 3.500 1.000 6.466 1.544 4.607 0.moodys.123 2.157 1.155 1.599 1.667 1999 2000 6.832 1.954 1.000 0.930 0.804 0.521 5.575 0.862 4.859 0.463 1.560 2.000 0.793 3.000 0.355 2.474 0.927 0.522 2.306 1.741 1.134 3.000 2.496 2.401 2.000 1.000 0.078 1.303 0.897 1979 1980 1.575 6.247 1.570 1998 3.009 2.644 4.976 8.399 1.072 4.000 0.792 2008 4.586 3.000 1.170 0.000 0.967 0.977 1.348 2.000 2.082 1971 1.072 1.272 2.702 6.727 8.966 1.735 0.034 0.563 1987 4.477 0.495 3.070 Year6 0.688 1.896 1.275 2.780 1.829 1.000 0.078 2.640 2.098 2.125 0.375 2003 5.958 4.050 1.618 3.884 0. Data source: Moody’s special comment (release: February 2009) on corporate default and recovery rates.874 3.339 9.895 2006 1.494 2.000 0.000 0.071 2.299 4.223 3.141 4.486 1.339 3.406 1978 1.916 0.000 0.535 5.960 2.484 1.814 1.341 3.280 1.152 4.739 0.129 Year3 1.982 2.362 1.199 3.000 0.com/) and author’s calculations.261 8.045 2.255 2.507 7.308 2.871 4.752 1.620 1.957 0.471 1.763 4.668 1.813 1.717 Year4 0.266 1.904 1.176 4.901 1.471 1.266 4.680 2.000 1.197 4.247 0.849 2.440 2.000 1.812 1.283 7.454 6.514 0.050 1.608 4.013 5.000 4.921 5.208 6.324 6.785 1996 1.305 2.698 1.000 2.994 7.205 3.349 2.798 0.000 0.883 3.730 1.000 1.000 1.016 0.069 0.419 5.013 3.124 7.000 0.744 1.695 1.362 0.012 0.951 0.435 1.927 1.000 0.256 1.866 0.470 4.623 0.000 0.103 1.696 2.657 2.152 5.889 3.642 5.371 1.053 1982 3.424 0.943 1.792 2.339 1.000 1.663 3.000 0.000 0.580 5.902 3.169 0.000 0.222 3.000 0.382 1.000 1.894 2.644 3.613 0.148 5.000 0.878 0.049 0.000 1.000 0.110 0.794 0.925 6.000 0.000 2.654 3.343 2.000 0. 1920-2008 (http://www.568 0.000 0.715 3.931 1975 1.674 3.711 0.864 1.726 0.448 6.104 1983 3.778 0.119 6.000 0.000 0.419 0.000 0.617 1.403 7.876 1973 1974 1.474 1.840 0.932 2.963 7.032 3.975 0.654 1.721 2.384 6.276 0.659 2.871 3.933 8.553 6.000 1.534 4.871 3.735 0.463 2.310 1.880 Year5 0.000 0.128 3.617 2.673 2.365 1.138 7.187 4.250 3.639 0.449 1.303 4.542 2.482 2.855 1.000 2.280 2.000 0.261 2.000 1.443 1.462 0.297 0.000 1.014 1.811 0.056 0.278 0.755 0.180 7.000 0.041 7.631 1.715 0.000 1.864 1.772 1.859 3. 23 Cohort Year1 Year2 1970 8.268 2.783 2.059 1991 1992 5.128 3.018 1.194 3.000 0.101 4.620 6.993 3.210 4.281 3.348 2.730 2.000 0.000 1.536 0.000 1.869 1995 2.084 1.848 1994 2.534 0.712 0.423 7.119 1.856 0.501 0.020 2.329 0.237 1.154 0.000 1.210 0.264 0.265 2.773 1.398 1.007 0.105 2.597 5.132 1.000 1.441 8.333 3.780 Year9 Year10 Year11 Year12 Year13 Year14 Year15 Year16 Year17 Year18 Year19 Year20 1.090 0.955 3.000 0.234 1.173 2.829 0.052 1.087 0.607 1.000 0.000 1.761 0.873 0.266 6.315 4.439 4.223 0.851 1.386 0.579 2.000 0.297 4.326 0.346 1.113 0.293 0.430 5.150 1.754 1.637 2.152 1.327 3.124 7.231 7.912 1976 0.045 1.980 1990 9.271 0.765 0.222 3.072 0.185 1.596 3.440 2001 10.263 6.000 0.562 1.169 2007 0.000 0.650 1.338 3.261 1.000 0.603 1.154 3.146 3.165 0.527 5.926 4.186 1.239 1.697 3.224 3.204 1.607 .798 0.000 1.651 2.678 0.000 1.920 1.855 1.934 0.791 0.914 2.000 0.743 7.493 1.638 0.078 2.225 3.898 0.009 1.691 2.561 0.420 0.415 0.000 1.906 0.890 2.031 1.963 6.351 1.000 1.017 1.547 3.941 1.289 4.989 1972 1.030 0.603 8.340 1.213 2.1: Moody’s speculative-grade default rates.943 1.967 2.068 3.361 1977 1.376 2.876 2.983 2.000 0.495 1993 3.612 3.578 0.000 2.000 0.458 3.960 1.185 1.966 2.815 4.921 5.713 2002 7.796 2.000 0.753 4.448 1.652 2005 1.451 0.073 2.468 1.000 1.057 1.860 3.732 9.997 1.947 2.507 6.000 0.759 0.000 1.723 2.048 4.718 0.157 1.676 0.000 1.675 3.884 3.435 1.069 4.859 1981 0.631 1.335 8.732 1.000 1.052 7.949 4.070 7.559 0.000 1.396 3.065 1.000 1.194 0.877 2.696 1.308 0.000 0.177 2.330 0.191 1.890 0.205 0.444 0.125 1.159 Year8 0.788 1.

i = 1. n (2. local polynomial smoothing [Fan and Gijbels (1996)] and wavelet shrinkage [Vidakovic (1999)]. and includes the methods of kernel smoothing [Wand and Jones (1995)].1 Introduction For observational data. . λ>0 (2. statisticians are generally interested in smoothing rather than interpolation. and ε(ti ) is the random error. Function estimation from noisy data is a central problem in modern nonparametric statistics. see the monographs by Wahba (1990). Let the observations be generated by the signal plus noise model yi = f (ti ) + ε(ti ). . we concentrate on the method of smoothing spline that has an elegant formulation through the mathematical device of reproducing kernel Hilbert space (RKHS). Ordinarily. the smoothing spline is formulated by the following variation problem 1 n n 1 f ∈Wm min w(ti )[yi − f (ti )]2 + λ i=1 0 [f (m) (t)]2 dt .CHAPTER II Adaptive Smoothing Spline 2. 1]. There is a rich body of literature on smoothing spline since the pioneering work of Kimeldorf and Wahba (1970). .2) 24 .1) where f is an unknown smooth function with unit interval support [0. Green and Silverman (1994) and Gu (2002). In this thesis. .

K are the pointwise averages.1.1) where ε(t) ∼ N (0. . as noted by Wahba (1985) and Nycha (1988). To train ¯ OrdSS. f . f (m) ∈ L2 [0. the mean squared error MSE = 1 n n i=1 w(ti )[yi − f (ti )]2 and the roughness penalty PEN = 1 (m) [f (t)]2 dt. then performs regularization based on the simple integral of squared derivatives [f (m) (t)]2 .4) where yk ∼ N (f (tk ).05.1) with stationary error but time-varying sampling frequency rk for k = 1. ¯ The OrdSS assumes that f (t) is homogeneously smooth (or rough). . It 25 . . . 1]}. In practice. t ∈ [0. there are various types of functions of non-homogeneous smoothness. The last two are the sampled data from automotive engineering and financial engineering. (2. 0 hence the tuning parameter λ controls the trade-off between fidelity to the data and the smoothness of the estimate. . 1]. It excludes the target functions with spatially varying degrees of smoothness. The weights w(ti ) in MSE come from the non-stationary error assumption of (2. . . (2. and four such scenarios are illustrated in Figure 2.2) OrdSS in short. f (m−1) absolutely continuous. . Let us call the ordinary smoothing spline (2. Doppler function: f (t) = t(1 − t) sin(2π(1 + a)/(t + a)). Suppose the data are generated by (2. a = 0. the raw data can be replaced by yk s together with the weights w(tk ) ∝ rk . They can be also used for aggregating data with replicates. . σ 2 /rk ) for k = 1. .2) is a sum of two functionals. K distinct points. by adding the N (0. The first two are the simulation examples following exactly the same setting as Donoho and Johnstone (1994). . 1. . σ 2 /w(t)). 1) noise to n = 2048 equally spaced signal responses that are re-scaled to attain signal-to-noise ratio 7.where the target function is an element of the m-order Sobolev space Wm = {f : f. then 1 n K rk MSE = [ykj − f (tk )]2 = k=1 j=1 1 K K w(tk ) yk − f (tk ) ¯ k=1 2 + Const.3) The objective in (2.

2. and it illustrates the fact that the acceleration stays relatively constant until the crash impact. HeaviSine function: f (t) = 3 sin 4πt − sgn(t − 0.72. a hypothesized smooth shape is overlaid on the observations.3) − sgn(0. In Figure 2. For academic use.72 − t). and (4) Credit-Risk tweaked sample. (2) HeaviSine function simulated with noise.Figure 2. 4.1: Heterogeneous smooth functions: (1) Doppler function simulated with noise. in which the time points are irregularly spaced and the noises vary over time. 1]. business background is removed and 26 . Credit-Risk tweaked sample: retail loan loss rates for the revolving credit exposures in retail banking. has both time-varying magnitudes and time-varying frequencies.3 and an upward jump at 0. t ∈ [0. 3. It has a downward jump at t = 0. It consists of accelerometer readings taken through time from a simulated motorcycle crash experiment.1. (3) Motorcycle-Accident experimental data. Motorcycle-Accident experiment: a classic data from Silverman (1985).

5) where the add-on function ρ(t) is adaptive in response to the local behavior of f (t) so that heavier penalty is imposed to the regions of lower curvature. 2008). suggested to use the local penalty 1 1 PEN = 0 ρ(t)[f (m) (t)]2 dt. increases upon growth. in a discussion. so a vintage booked earlier has a longer observation length. such as variable-bandwidth kernel smoothing (M¨ller and Stadtm¨ller. whose degree of smoothness. where a vintage refers to a group of credit accounts originated from the same month. They are all observed up to the end of 2006. s. In Section 2. where function estimation with non-homogeneous smoothness has been of interest for decades.the sample is tweaked. Luo and Wahba (1997) proposed a hybrid adaptive spline with knot selection. ρ(t) > 0 and 0 ρ(t)dt = 1 (2. Wahba (1995). Relevant works include Abramovich and Steinberg (1996) and Pintore.t. by assumption. The tweaked sample includes monthly vintages booked in 7 years (2000-2006).2 we formulate the AdaSS (adap- 27 . and the circles are the vintage loss rates. which are wiggly on large months-on-book due to decreasing sample sizes. Our purpose is to estimate an underlying shape (illustrated by the solid line). The point-wise averages are plotted as solid dots. In Figure 2. 2000). Speckman and Holmes (2006). rather than using every sampling point as a knot. multivariate u u adaptive regression splines (Friedman. This local penalty functional is the foundation of the whole chapter. In the context of smoothing spline. The first three examples have often been discussed in the nonparametric analysis literature. adaptive wavelet shrinkage (Donoho and Johnstone. 1991). More naturally. local-penalty regression spline (Ruppert and Carroll. 1987). 1994). A variety of adaptive methods has been proposed. the x-axis represents the months-on-book.1. as well as the treed Gaussian process modeling (Gramacy and Lee. This chapter is organized as follows.

. (m−1)! Then. . .6) can be expressed as n m−1 f (t) = j=0 αj φj (t) + i=1 ci K(t. t.2) based on the local penalty (2. . or see Wahba (1990. · · · . B = (φ(t1 ). m − 1.3 covers some basic properties of the AdaSS. w(tn )) and Σ = [K(ti .5 and give technical proofs in the last section. We summarize the chapter with Section 2. Proposition 2. α ∈ Rm . it can be solved by generalized linear regression.6).7) where Gm (t. the solution of (2. . . . W = diag(w(t1 ).k.2 AdaSS: Adaptive Smoothing Spline The AdaSS extends the OrdSS (2.tive smoothing spline) through the local penalty (2. yn )T . .6) The following proposition has its root in Kimeldorf and Wahba (1971). (2. define the r. c ∈ Rn . Section 2. . Given a bounded integrable positive function ρ(t). . . s ∈ [0. u) = (t−u)m−1 + . derive its solution via RKHS and generalized ridge regression.4. . φ(tn ))T . we 28 . It is formulated as 1 n n 2 1 2 f ∈Wm min w(ti ) yi − f (ti ) i=1 +λ 0 ρ(t) f (m) (t) dt . Substituting (2. 1 K(t. Write y = (y1 . ti ) ≡ αT φ(t) + cT ξ n (t) (2. then discuss how to determine both the smoothing parameter and the local penalty adaptation by the method of cross validation.8) where φj (t) = tj /j! for j = 0. 2. s) = 0 ρ−1 (u)Gm (t. §1). u)du. 1] (2. tj )]n×n .8).5). Since the AdaSS has a finite-dimensional linear expression (2. The experimental results for the aforementioned examples are presented in Section 2. u)Gm (s. λ > 0.8) into (2.1 (AdaSS).5).

13) 29 . By (2. Consider the QR-decomposition B = (Q1 . T )T = Q1 R with orthogonal . 2 (2.ρ y = nλW−1 Q2 QT (Σ + nλW−1 )QT 2 2 −1 QT y. Then. . 2 α = R−1 QT y − (Σ + nλW−1 )c . (2.10).Q . so the residuals are given by ˆ e = y − y = nλW−1 c ≡ I − Sλ. or Bα + (Σ + nλW−1 )c = y. α and c to be zero. ΣW Bα + (Σ + nλW−1 )c = ΣWy. (Q1 . 2 )(RT .12) where Sλ.Q Multiplying both sides of y = Bα + (Σ + nλW−1 )c by QT and using the fact 2 that c = Q2 QT c (since BT c = 0).0 .r.11) 1 ˆ The fitted values at the design points are given by y = Bα + Σc.have the regularized least squares problem 1 y − Bα − Σc n 2 W 2 Σ min α.c +λ c .t. 2 )n×n and upper-triangular Rm×m . (2.ρ = I − nλW−1 Q2 QT (Σ + nλW−1 )QT 2 2 matrix in the sense that QT is often called the smoothing 2 ˆ y = Sλ. (2. BT c = 0. simple algebra yields 2 2 c = Q2 QT (Σ + nλW−1 )Q2 2 −1 QT y.9) where x 2 A = xT Ax for A ≥ 0. . we obtain the equations BT W Bα + Σc = BT Wy.10) . we have that QT y = QT (Σ + nλW−1 )c = 2 2 2 QT (Σ + nλW−1 )Q2 QT c. where Q1 is n × m and Q2 is n × (n − m).ρ y −1 (2. By setting partial derivatives w. y = Bα + (Σ + nλW−1 )c.

Let ρ−1 (t) = ak +bk t. Proposition 2. (2. i]. In what follows we present a closed-form expression of K(t. . The AdaSS is analogous to OrdSS in constructing the confidence interval under a Bayesian interpretation by Wahba (1983).ρ ) can be interpreted as the equivalent degrees of freedom. and {F (u)}|τ = F (τ ) − F (τ ). For general ρ(t). of the integral form (2. Then.7) can be explicitly evaluated by κt∧s m−1 K(t.7). For the sampling time points in particular.7). where p = trace(Sλ. . κ1∧1 = K. The estimate of noise variance is given by σ 2 = (I − Sλ.2 is general enough to cover many types of r.1 Reproducing Kernels The AdaSS depends on the evaluation of the r. the AdaSS kernel (2.k.ρ )y 2 W (n − p).ρ [i.k.ρ [i.’s. n (2.We use the notation Sλ.15) τk−1 where κt∧s = min{k : τk > t ∧ s}. bk ) and the knots {0 ≡ τ0 < τ1 < · · · < τK ≡ 1}. . tj )] based on the r. t ∈ [τk−1 .2. τk ) for some fixed (ak . where ρ is needed for the evaluation of the matrix Σ = [K(ti .k. i = 1.2 (Reproducing Kernel). τ Proposition 2. 2. 30 . we may construct the 95% interval pointwise by ˆ f (ti ) ± z0. s) when ρ−1 (t) is piecewise linear.14) where Sλ. we must resort to numerical integration techniques.ρ to denote its explicit dependence on both the smoothing parameter λ and the local adaptation ρ(·). s) = k=1 j=0 (−1)j (ak + bk u) m−1−j (u − t)m−1−j (u − s)m+j (m − 1 − j)!(m + j)! − t∧s∧τk −bk l=0 (u − t)m−1−j−l (u − s)m+1+j+l (−1)l (m − 1 − j − l)!(m + 1 + j + l)! (2. For example. i] represents the i-th diagonal element of the smoothing matrix.025 σ 2 w−1 (ti )Sλ. .

K = 1 and ρ−1 (t) = a + bt for t ∈ [0. and they should be treated from knot to knot if ρ−1 (t) is piecewise. otherwise.19) (m − 1 − j − l)!(m + j)! (2m − 1 − l)! Note that the kernel derivative coincides with (2. set m = 2. the Green’s function Gm (t. (2. et al. s) = − s ). The equa- tion (2. s) (s fixed). ∂t  otherwise. . s) = j=0 (−1)j tm−1−j sm+j (−1)m (s − t)2m−1 + I(t ≤ s). (2.17). t) = m  0. 1].k.16) K(t. m − 1.(2.17) when l = m.   a 2  (3t s − t3 ) + 6  a  (3ts2 − s3 ) + 6 b (2t3 s 12 b (2ts3 12 − t4 ).5). Remarks: Numerical methods are inevitable for approximating (2. . In solving boundary value differential equations. j = 0.setting bk ≡ 0 gives the piecewise-constant results for ρ−1 (t) obtained by Pintore.2) and their derivatives to be used in next section: m−1 K(t. When b = 0. (m − 1 − j)!(m + j)! (2m − 1)! (2.7) when ρ−1 (t) 31 .15) can be obtained by straightforward but tedious calculations.    m−1 (2. are also of interest. Lower-order l-th derivatives (l < m) of (2. u) has the property that f (t) = 1 0 Gm (t.17) The well-defined higher-order derivatives of the kernel can be derived from (2. Here.7) is such an example of applying Gm (t. . ·) to K(t. implying that (aκt + bκt t) (s−t) . a = b/(eb − 1) and b ∈ R. For another example. s) (m−1)! −1 = ρ (t)Gm (s. if t ≤ s ∂ m K(t.16) reduces to the OrdSS kernel for fitting cubic smoothing splines. (By (2. 4 if t ≤ s. .18) ∂ l K(t. s) = ∂tl m−1−l j=0 (−1)j tm−1−j−l sm+j (−1)m+l (s − t)2m−1−l + I(t ≤ s).) Then. u)f (m) (u)du for f ∈ Wm with zero f (j) (0). we present only the OrdSS kernels (by setting ρ(t) ≡ 1 in Proposition 2. The derivatives of the r. (2006).

one may approximate ρ−1 (t) first by a piecewise function with knot selection. n.ρ ). ρ(·)). . . i])2 n where diag(Sλ. and (2) the local roughness of f (·) is quantified by the squared m-th 32 . (1 − n−1 trace(Sλ. by varying log λ on the real line. yi )}i=k for k = 1. .e.ρ ))−2 (I − Sλ.2. i.2. ρ) = ˆ w(ti ) yi − f [i] (ti ) .ρ ))2 (2. .20) For the linear predictor (2.21) (1 − Sλ.takes general forms.ρ )2 y. . consider the rules: (1) choose a smaller value of ρ(t) if f (·) is less smooth at t.22) on its log scale. .ρ ) is the diagonal part of the smoothing matrix. (2. We use the method of crossvalidation based on the leave-one-out scheme. Rather than directly approximating (2.22) For fixed ρ(·). we obtain the so-called generalized crossvalidation score 1 GCV(λ. . i=1 (2. Define the score of cross-validation by 1 n n 2 CV(λ. n.21) or (2. It follows that 1 n n CV(λ. To select ρ(·). .7). ˆ Conditional on (λ. Craven and Wahba (1979) ˆ ˆ ˆ justified that f (ti ) − f [i] (ti ) = Sλ.ρ (yi − f [i] (ti )) for i = 1. then use the closed-form results derived in Proposition 2.ρ [i.13) with smoothing matrix Sλ. ρ) = n n i=1 ˆ w(ti )(yi − fλ (ti ))2 . the best tuning parameter λ is usually found by minimizing (2.ρ . Replacing the elements of diag(Sλ ) by their average n−1 trace(Sλ. and f [k] (t) denote the AdaSS trained on the leave-one-out sample {(ti . 2.2 Local Penalty Adaptation One of the most challenging problems for the AdaSS is selection of smoothing parameter λ and local penalty adaptation function ρ(t). let f (t) denote the AdaSS trained on the complete sample ˆ of size n. ρ) = i=1 ˆ 1 w(ti )(yi − f (ti ))2 = yT W(I − diag(Sλ.

The extra order a = 0. we start with OrdSS upon the improper adaptation ρ(t) ≡ 1.24) tends to under-smooth. In the context of smoothing splines. So. it is reasonable to determine the shape of ρ(t) by ρ−1 (t) ∝ f (m) (t) . Given the finite sample.23) Since the true function f (t) is unknown a priori. If the true f (m) (t) were known. so are its derivatives. 1. u)du. 2 imposed on penalization of derivatives ˜ 1 (m+a) [f (t)]2 dt 0 is to control the smoothness of the m-th derivative estimate. we may adopt a two-step procedure for selection of ρ(t). ˜ Step 2: calculate f (m) (t) and use it to determine the shape of ρ−1 (t). ti ).derivative f (m) (t) . ti ) ∂tm (2. as follows. 2 2 (2. u)Gm+a (s.2 demonstrates the 33 . 1. In Step 1. Step 1: run the OrdSS with m = m + a (a = 0. The goodness of the spline derivative depends largely on the spline estimate itself. a−1 n ˜ f (m) (t) = j=0 αm+j φj (t) + ˜ i=1 ci ˜ ∂m K(t. Using that as an initial estimate of local roughness. 2) and cross-validated λ to ˜ ˜ obtain the initial function estimate: f (t) = where K(t. Rice and Rosenblatt (1983) studied the large-sample properties of spline derivatives (2. the m-th derivative of f (t) can be obtained by differentiating the bases. Figure 2. the ˜ choice of a depends on how smooth the estimated f (m) (t) is desired to be.24) that have slower convergence rates than the spline itself. ti ) ∂tm is calculated according to (2.24) where ∂m K(t. Let us consider the estimation of the derivatives. but it may capture the rough shape of the derivatives upon standardization. s) = 1 0 m+l−1 j=0 αj φj (t) + ˜ n ˜ i=1 ci K(t. Gm+a (t.19) after replacing m by m + a. ˜ In Step 2. we find by simulation study that (2. nonparametric derivative estimation is usually obtained by taking derivatives of the spline estimate. with no prior knowledge.

meeting our needs for estimating ρ(t) below. Despite these downside effects. estimation of the 2nd-order derivative for f (t) = sin(ωt) and f (t) = sin(ωt4 ) using the OrdSS with m = 2. n = 100 and snr = 7. as in Proposition 2. 1]. we restrict ourselves to the piecewise class of functions. the order-2 OrdSS could yield rough estimates of derivative. By (2. 3. both have time-varying frequency. The sin(ωt4 ) signal resembles the Doppler function in Figure 2. ω = 10π.1. κt = min{k : τk ≥ t} (2.2: OrdSS function estimate and its 2nd-order derivative (upon standardization): scaled signal from f (t) = sin(ωt) (upper panel) and f (t) = sin(ωt4 ) (lower panel). the shape of the derivative can be more or less captured. and the order-3 OrdSS could give smooth estimates but large bias on the boundary.23). ˜ In determining the shape of ρ−1 (t) from f (m) (t). In both sine-wave examples. (2) continuity: ak + 34 . we suggest to estimate ρ−1 (t) piecewise by the method of constrained least squares: 2γ ˜ f (m) (t) + ε = a0 ρ−1 (t) = aκt + bκt t.2.25) subject to the constraints (1) positivity: ρ−1 (t) > 0 for t ∈ [0.Figure 2.

bk τk = ak+1 + bk+1 τk for k = 1, . . . , K − 1 and (3) normalization: a0 = a0
K k=1
− τk dt τk−1 ak +bk t

1 0

ρ(t)dt =

> 0, for given γ ≥ 1 and pre-specified knots {0 ≡ τ0 < τ1 < . . . <

− τK−1 < 1} and τK = 1. Clearly, ρ−1 (t) is an example of linear regression spline. If the

continuity condition is relaxed and bk ≡ 0, we obtain the piecewise-constant ρ−1 (t). ˜ The power transform f (m) (t)

˜ , γ ≥ 1 is used to stretch [f (m) (t)]2 , in order to

make up for (a) overestimation of [f (m) (t)]2 in slightly oscillating regions and (b) underestimation of [f (m) (t)]2 in highly oscillating regions, since the initial OrdSS is trained under the improper adaptation ρ(t) ≡ 1. The lower-right panel of Figure 2.2 ˜ is such an example of overestimating the roughness by the OrdSS. The knots {τk }K ∈ [0, 1] can be specified either regularly or irregularly. In most k=1 situations, we may choose equally spaced knots with K to be determined. For the third example in Figure 2.1 that shows flat vs. rugged heterogeneity, one may choose sparse knots for the flat region and dense knots for the rugged region, and may also choose regular knots such that ρ−1 (t) is nearly constant in the flat region. However, for the second example in Figure 2.1 that shows jump-type heterogeneity, one should shrink the size of the interval containing the jump. Given the knots, the parameters (λ, γ) can be jointly determined by the crossvalidation criteria (2.21) or (2.22). Remarks: The selection of the local adaptation ρ−1 (t) is a non-trivial problem, for ˜ which we suggest to use ρ−1 (t) ∝ f (m) (t)

together with piecewise approximation.

This can be viewed as a joint force of Abramovich and Steinberg (1996) and Pintore, et al. (2006), where the former directly executed (2.23) on each sampled ti , and the latter directly impose the piecewise constant ρ−1 (t) without utilizing the fact (2.23). Note that the approach of Abramovich and Steinberg (1996) may have the estimation ˜ bias of f (m) (t)
2

as discussed above, and the r.k. evaluation by approximation based

on a discrete set of ρ−1 (ti ) remains another issue. As for Pintore, et al. (2006), the high-dimensional parametrization of ρ−1 (t) requires nonlinear optimization whose 35

stability is usually of concern. All these issues can be resolved by our approach, which therefore has both analytical and computational advantages.

2.3

AdaSS Properties

Recall that the OrdSS (2.2) has many interesting statistical properties, e.g. (a) it is a natural polynomial spline of degree (2m − 1) with knots placed t1 , . . . , tn (hence, the OrdSS with m = 2 is usually referred to the cubic smoothing spline); (b) it has a Bayesian interpretation through the duality between RKHS and Gaussian stochastic processes (Wahba, 1990, §1.4-1.5); (c) its large sample properties of consistency and rate of convergence depends on the smoothing parameter (Wahba, 1990, §4 and references therein); (d) it has an equivalent kernel smoother with variable bandwidth (Silverman, 1984). Similar properties are desired to be established for the AdaSS. We leave (c) and (d) to future investigation. For (a), the AdaSS properties depend on the r.k. (2.7) and the structure of local adaptation ρ−1 (t). Pintore, et al. (2006) investigated the piecewiseconstant case of ρ−1 (t) and argued that the AdaSS is also a piecewise polynomial spline of degree (2m − 1) with knots on t1 , . . . , tn , while it allows for more flexible lower-order derivatives than the OrdSS. The piecewise-linear case of ρ−1 (t) can be analyzed in a similar manner. For the AdaSS (2.8) with the r.k. (2.15),
m−1

f (t) =
j=0

αj φj (t) +
i:ti ≤t

ci K(t, ti ) +
i:ti >t

ci K(t, ti )

with degree (m − 1) in the first two terms, and degree 2m in the third term (after one degree increase due to the trend of ρ−1 (t)).

36

For (b), the AdaSS based on the r.k. (2.7) corresponds to the following zero-mean Gaussian process
1

Z(t) =
0

ρ−1/2 (u)Gm (t, u)dW (u),

t ∈ [0, 1]

(2.26)

where W (t) denotes the standard Wiener process. The covariance kernel is given by E[Z(t)Z(s)] = K(t, s). When ρ(t) ≡ 1, the Gaussian process (2.26) is the (m − 1)fold integrated Wiener process studied by Shepp (1966). Following Wahba (1990; §1.5), consider a random effect model F (t) = αT φ(t) + b1/2 Z(t), then the best linear unbiased estimate of F (t) from noisy data Y (t) = F (t) + ε corresponds to the AdaSS solution (2.8), provided that the smoothing parameter λ = σ 2 /nb. Moreover, one may set an improper prior on the coefficients α, derive the posterior of F (t), then obtain a Bayesian type of confidence interval estimate; see Wahba (1990; §5). Such confidence intervals on the sampling points are given in (2.14). The inverse function ρ−1 (t) in (2.26) can be viewed as the measurement of nonstationary volatility (or, variance) of the m-th derivative process such that dm Z(t) dW (t) = ρ−1/2 (t) . m dt dt

(2.27)

By the duality between Gaussian processes and RKHS, dm Z(t)/dtm corresponds to f (m) (t) in (2.6). This provides an alternative reasoning for (2.23) in the sense that ρ−1 (t) can be determined locally by the second-order moments of dm Z(t)/dtm . To illustrate the connection between the AdaSS and the Gaussian process with nonstationary volatility, consider the case m = 1 and piecewise-constant ρ−1 (t) = ak , t ∈ [τk−1 , τk ) for some ak > 0 and pre-specified knots 0 ≡ τ0 < τ1 < · · · < τK ≡ 1. By (2.15), the r.k. is given by
κt∧s

K(t, s) =
k=1

ak t ∧ s − τk−1 ,

t, s ∈ [0, 1].

(2.28)

37

By (2.26) and (2.27), the corresponding Gaussian process satisfies that √

dZ(t) =

ak dW (t), for t ∈ [τk−1 , τk ), k = 1, . . . , K

(2.29)

where ak is the local volatility from knot to knot. Clearly, Z(t) reduces to the standard Wiener process W (t) when ak ≡ 1. As one of the most appealing features, the construction (2.29) takes into account the jump diffusion, which occurs in [τk−1 , τk ) when the interval shrinks and the volatility ak blows up.

2.4

Experimental Results

The implementation of OrdSS and AdaSS differs only in the formation of Σ through the r.k. K(t, s). It is most crucial for the AdaSS to select the local adaptation ρ−1 (t). Consider the four examples in Figure 2.1 with various scenarios of heterogeneous smoothness. The curve fitting results by the OrdSS with m = 2 (i.e. cubic smoothing spline) are shown in Figure 2.3, as well as the 95% confidence intervals for the last two cases. In what follows, we discuss the need of local adaptation in each case, and compare the performance of OrdSS and AdaSS. Simulation Study. The Doppler and HeaviSine functions are often taken as the benchmark examples for testing adaptive smoothing methods. The global smoothing by the OrdSS (2.2) may fail to capture the heterogeneous smoothness, and as a consequence, over-smooth the region where the signal is relatively smooth and under-smooth where the signal is highly oscillating. For the Doppler case in Figure 2.3, the underlying signal increases its smoothness with time, but the OrdSS estimate shows lots of wiggles for t > 0.5. In the HeaviSine case, both jumps at t = 0.3 and t = 0.72 could be captured by the OrdSS, but the smoothness elsewhere is sacrificed as a compromise. The AdaSS fitting results with piecewise-constant ρ−1 (t) are shown in Figure 2.4. 38

21). and pre-specified the knots (0. To estimate the piecewise-constant local adaptation ρ−1 (t). 10] and γ ∈ [1. The dashed lines represent 95% confidence intervals. Table 2. 0. and the time-dependent weights are specified proportional to the number of replicates. we assumed that both tuning parameters are bounded such that log λ ∈ [−40. true σ 2 = 1). and the 39 . then estimated the 2nd-order derivatives by (2.24). Initially. then performed the bounded nonlinear optimization. In our implementation. 0. the function ρ−1 (t) is estimated by constrained least squares (2.3: OrdSS curve fitting with m = 2 (shown by solid lines). where the best γ ∗ is found jointly with λ by minimizing the score of cross-validation (2.25) conditional on the power transform parameter γ.Figure 2. 0.715.) Given the knots. In the credit-risk case. we pre-specified 6 equally spaced knots (including 0 and 1) for the Doppler function.305. the estimated variance of noise (cf. we fitted the OrdSS with cross-validated tuning parameter λ. 1) for the HeaviSine function. 5]. 0.1 gives the numerical results in both simulation cases. including the cross-validated (λ. (Note that such knots can be determined graphically based on the plots of the squared derivative estimates.725. 0. the log loss rates are considered as the responses.5.295. γ).

40 .Figure 2.4: Simulation study of Doppler and HeaviSine functions: OrdSS (blue). AdaSS (red) and the heterogeneous truth (light background).

6 7 1.Table 2.95 1. Upon the crash occurrence. For the Doppler function.19 2180. see the confidence intervals in Figure 2. Recall that the data consists of acceleration measurements (subject to noise) of the head of a motorcycle rider during the course of a simulated crash experiment. smoothness non-homogeneity is another important feature of this data set. Silverman (1985) originally used the motorcycle-accident experimental data to test the non-stationary OrdSS curve fitting. 0.(2006) who used a small sample size (n = 128) and high-dimensional optimization techniques. Simulation n snr λ γ σ2 ˆ dof CV Doppler 2048 7 1.07 2123. Both simulation examples demonstrate the advantage of using the AdaSS. Besides error non-stationarity. et al. Let us focus on the period prior to the crash happening at t ≈ 0. resulting in heterogenous curve fitting. The data indicates clearly that the acceleration rebounds well above its original level before setting back.21e-012 2. for which it is reasonable to 41 . while the relatively same high level of penalty is assigned to other regions.1: AdaSS parameters in Doppler and HeaviSine simulation study.13 0. For the HeaviSine function. Figure 2. the imposed penalty ρ(t) in the first interval [0.9 HeaviSine 2048 equivalent degrees of freedom.0175 35.4 (bottom panel) shows the improvement of AdaSS over OrdSS after zooming in. The middle panel shows the data-driven estimates of the adaptation function ρ(t) plotted at log scale. it decelerates to about negative 120 gal (a unit gal is 1 centimeter per second squared). low penalty is assigned to where the jump occurs.24 (after time re-scaling). Motorcycle-Accident Data.86e-010 4. A similar simulation study of the Doppler function can be found in Pintore.3. the treatment from a heterogenous smoothness point of view is more appropriate than using only the non-stationary error treatment.9662 164.2) is much less than that in the other four intervals. To fit such data.

Figure 2. non-stationary OrdSS and AdaSS: performance in comparison.6: OrdSS.Figure 2.5: Non-stationary OrdSS and AdaSS for Motorcycle-Accident Data. 42 .

the simple pointwise or moving 43 . Compared to Figure 2. The last plot in Figure 2. Both results are shown in Figure 2. we estimated the local adaptation ρ−1 (t) from the derivatives of non-stationary OrdSS. it is clear that AdaSS performs better than both versions of OrdSS in modeling the constant acceleration before the crash happening. Credit-Risk Data.2. loess) is usually performed. the OrdSS estimate in Figure 2. The local moving averaging (e.6 compares the performances of stationary OrdSS. see the upper-left panel in Figure 2. Then. while the estimate by the non-stationary AdaSS looks reasonable. non-stationary OrdSS and non-stationary AdaSS. AdaSS gives a more smooth estimate than OrdSS. Figure 2. the smoothed weight estimates are plugged into (2.28].e. To perform AdaSS (2.3. It is of our interest to model the growth or the maturation curve of the log of loss rates in lifetime (month-on-book).6) with the same weights.3 shows a slight bump around t = 0. the sample size decreases in lifetime. Their monthly loss rates are observed up to December 2006 (right truncation). see Silverman (1985). However. Besides. σ 2 /w(ti )) with unknown weights w(ti ).assume the acceleration stays constant. Aligning them to the same origin. 0. we have the observations accumulated more in smaller month-on-book and less in larger month-on-book. Consider the sequence of vintages booked in order from 2000 to 2006.1 represents a tweaked sample in retail credit risk management.g.5.2) to run the non-stationary OrdSS.5.55. As a consequence. a common approach is to begin with an unweighted OrdSS and estimate w−1 (ti ) from the local residual sums of squares. i. After zooming in around t ∈ [0. For the setting-back process after attaining the maximum acceleration at t ≈ 0. both non-stationary spline models result in tighter estimates of 95% confidence intervals than the unweighted OrdSS. a false discovery. they differ in estimating the minimum of the acceleration curve at t ≈ 0.36. where the non-stationary OrdSS seems to have underestimated the magnitude of deceleration.. In presence of non-stationary errors N (0.1.

This is also a desired property of the maturation curve for the purpose of extrapolation. which however does not smooth out the maturation curve for large month-on-book. averages have increasing variability. Alternatively.4). By (2. we applied the AdaSS with judgemental prior on heterogeneous smoothness of the target function. spline smoothing may work on the pointwise averages subject to nonstationary errors N (0. our experiences tell that the maturation curve is expected to have increasing degrees of smoothness once the lifetime exceeds certain month-on-book. and (b) the raw inputs for training purpose are contaminated and have abnormal behavior. For demonstration. σ 2 /w(ti )) with weights determined by the number of replicates.3. The reasons of the unstability in the rightmost region could be (a) OrdSS has the intrinsic boundary bias for especially small w(ti ). The exogenous cause of contamination in (b) will be explained in Chapter III. The non-stationary OrdSS estimate is shown in Figure 2. Nevertheless.7: AdaSS estimate of maturation curve for Credit-Risk sample: piecewiseconstant ρ−1 (t) (upper panel) and piecewise-linear ρ−1 (t) (lower panel). we purposefully specified 44 .Figure 2.

45 . Four different examples. it is not clear how the AdaSS can be extended to multivariate smoothing in a tensor-product space. each having a specific feature of heterogeneous smoothness. 2. we will consider the application of AdaSS in a dual-time domain. we proposed a shape estimation procedure based the function derivatives. Compared with the OrdSS performance in the large monthon-book region. transformed to the original scale. as an analogy of the smoothing spline ANOVA models in Gu (2002). In particular.7. Both specifications are non-decreasing in lifetime and they span about the same range. (c) and (d) listed in Section 2. This credit-risk sample will appear again in Chapter III of vintage data analysis. as well as tighter confidence intervals based on judgemental prior. we studied the AdaSS with piecewise adaptation of local penalty and derived the closed-form reproducing kernels. the AdaSS yields smoother estimate of maturation curve. The curve fitting results for both ρ−1 (t) are nearly identical upon visual inspection. Its connection with a time-warping smoothing spline is also under our investigation. To determine the local adaptation function.ρ−1 (t) to take the forms of Figure 2.5 Summary In this chapter we have studied the adaptive smoothing spline for modeling heterogeneously ‘smooth’ functions including possible jumps. where the notion of smoothing spline will be generalized via other Gaussian processes. The AdaSS estimate of maturation curve.t. In next chapter.3. However. The really interesting story is yet to unfold. were used to demonstrate the improvement of AdaSS over OrdSS.r. including (a) piecewise-constant and (b) piecewise-linear. Future works include the AdaSS properties w.1. together with the method of cross-validation for parameter tuning. is plotted by the red solid line in Figure 2.

6 Technical Proofs Proof to Proposition 2. t). .2. g1 H = 0 f1 (u)g1 (u)λ(u)du. (m − 1)! (m − 1)! 46 . K(t. 2 H 1 0 (m) (2. for k = 0. then 1 0 Gm (t. . Then. 1. associate H with the inner product 1 (m) (m) f1 .k.2). By Taylor’s theorem with remainder. The remainder functions f1 (t) = Gm (t.k. 1]. u)g(u)du. with piecewise ρ−1 (t) equals κt∧s − t∧s∧τk K(t. . m − 1}. By similar arguments to Wahba (1990.3) would prove the finite-dimensional expression (2.3).1: by minor modification on Wahba (1990. (2. s) = k=1 τk−1 (ak + bk u) (u − t)m−1 (u − s)m−1 du. u)f (m) (u)du lie in the subspace H = Wm ∩ Bm . any function in Wm (2. . ∀f1 ∈ H (2. Proof to Proposition 2. using the arguments of Wahba (1990. §1. where the reproducing property f1 (·). ·) ∂m K(t. u) = problem: if f1 f1 (t) = 1 0 (m) (t−u)+ (m−1)! (m−1) is the Green’s function for solving the boundary value = g with f1 ∈ Bm = {f : f (k) (0) = 0. Given the bounded integrable function ρ(t) > 0 on t ∈ [0.2: The r.8) that minimizes the AdaSS objective functional (2. §1. §1.7).32) follows that = ρ(t)−1 Gm (s.2-1. s) ∂tm H = f1 (t). u)f (m) (u)du 0 (2.31) and the induced square norm f1 = λ(u)[f1 (u)]2 du.3) can be expressed by m−1 j f (t) = j=0 t (j) f (0) + j! 1 Gm (t.6). one may verify that H is an RKHS with r.30) where Gm (t.

. . . . evaluation by κt∧s m−1 K(t. using the successive integration by parts.For any τk−1 ≤ t1 ≤ t2 < τk . m − 1 t2 Tj ≡ t1 (u − t)m−1−j (u − s)m+j du = (m − 1 − j)! (m + j)! m−1−j (−1)l l=0 (u − t)m−1−j−l (u − s)m+1+j+l (m − 1 − j − l)!(m + 1 + j + l)! t2 . . t2 (u − t)m−1 (u − s)m (u − t)m−1 (u − s)m−1 (ak + bk u) du = d (m − 1)! (m − 1)! (m − 1)! m! t1 t1 t2 (u − t)m−2 (u − s)m (u − t)m−1 (u − s)m t2 (ak + bk u) = (ak + bk u) − du (m − 1)! m! (m − 2)! m! t1 t1 t2 (u − t)m−1 (u − s)m −bk du = · · · . 1. τ2 = t ∧ s ∧ τk for k = 1. κt∧s and adding them all gives the final r. . (m − 1)! m! t1 t2 (ak + bk u) it is straightforward (but tedious) to derive that t2 (ak + bk u) t1 m−1 (u − t)m−1 (u − s)m−1 du (m − 1)! (m − 1)! j t2 t1 m−1 (u − t)m−1−j (u − s)m+j = (−1) (ak + bk u) (m − 1 − j)! (m + j)! j=0 − bk j=0 (−1)j Tj in which Tj denotes the integral evaluations for j = 0. . s) = k=1 j=0 (−1)j (ak + bk u) m−1−j (u − t)m−1−j (u − s)m+j (m − 1 − j)!(m + j)! − t∧s∧τk −bk l=0 (u − t)m−1−j−l (u − s)m+1+j+l (−1)l (m − 1 − j − l)!(m + 1 + j + l)! .k. . τk−1 47 . t1 − Choosing τ1 = τk−1 .

vj ). vkj )} for segment k = 1.CHAPTER III Vintage Data Analysis 3. longitudinal or panel data with dual-time characteristics. x(mkjl . . 1. . each subject j we consider corresponds to a vintage originated at time vj such that v1 < v2 < . . tkjl . where Lj is the total number of observations on the j-th subject. tjl . vj ) a vintage diagram. . In situations where more than one segment of dual-time observations are available.3 for an illustration of vintage diagram and 48 . . vj ) ∈ Rp . . . Lj . See Figure 1. . . To this end. Also observed on the vintage diagram could be the covariate vectors x(mjl . J .1 Introduction The vintage data represents a special class of functional. (3. 2. . . . l = 1. . . j = 1. and denote the vintage data by {y(mkjl . . . < vJ and the elapsed time mjl = tjl − vj measures the lifetime or age of the subject. Unlike purely longitudinal setup. . . . mjl = 0. It shares the cross-sectional time series feature such that there are J subjects each observed at calendar time points {tjl . vj + 1. . we denote the vintage data by y(mjl . We call the grids of such (mjl . vj + 2.1) where the dual-time points are usually sampled from the regular grids such that for each vintage j. vkj ). tjl . l = 1. tjl . we may denote the (static) segmentation variables by zk ∈ Rq . and tjl = vj . K. . tkjl . .. Lj }. . .

To achieve this. and several interesting prototypes are illustrated in Figure 3.1. by integrating ideas from functional data analysis (FDA). the U.S. m. Meanwhile.dual-time collection of observations. which is flexible Despite the overlap between FDA and LDA. time and vintage dimensions are all of potential interest in financial risk management. the vintage diagram corresponds to the hyperplane {(t. shown in Table 1. In practice. The vintage diagram could be truncated in either age or time. Each vintage series evolves in the 45◦ direction such that for the same vintage vj . we aim to develop a formal statistical framework for vintage data analysis (VDA). v) : v = t − m} in the 3D space. The time effects are often dynamic and volatile due to macroeconomic environment. Bearing in mind these practical considerations. Besides. the calendar time tj and the lifetime mj move at the same speed along the diagonal — therefore. The empirical evidences tell that the age effects of self-maturation nature are often subject to smoothness in variation. observations up to today). there exist other prototypes of vintage data truncation. The triangular prototype on the top-left panel is the most typical as the time series data are often subject to the systematic truncation from right (say. are truncated in lifetime to be 20 years maximum.1. we will take a Gaussian process modeling approach. as well as the hierarchical structure of multisegment credit portfolios. longitudinal data analysis (LDA)1 and time series analysis (TSA). 1 49 . and they could be treated as random effects upon appropriate structural assumption. the vintage effects are conceived as the forward-looking measure of performance since origination. The age. Treasury rates released for multiple maturity dates serve as an example of the parallelogram prototype. The corporate default rates released by Moody’s. The other trapezoidal prototype corresponds to the time-snapshot selection common to credit risk modeling exercises. we try to differentiate them methodologically by tagging FDA with nonparametric smoothing and tagging LDA with random-effect modeling. The rectangular diagram is obtained after truncation in both lifetime and calendar time.

1: Vintage diagram upon truncation and exemplified prototypes.Figure 3. 50 .

where we discuss Kriging. 3. the dual-time domain Ω consists of isolated 45◦ lines {(m.4 covers the semiparametric regression given the dual-time covariates and segmentation variables. This chapter develops the VDA framework for continuous type of responses (usually.1 is the discrete counterpart of Ω upon right truncation in calendar time.2 MEV Decomposition Framework Let V = {vj : 0 ≡ v1 < v2 < . and real applications in both corporate and retail credit risk cases. Computational results are presented in Section 3. It is organized as follows. t) space. v) : m ≥ 0. Section 3. for which we propose the MEV (maturation-exogenous-vintage) decomposition modeling framework η(y(m. . (3. t = v + m. The triangular vintage diagram in Figure 3. Let Ω be the support of the vintage data (3. m ≥ 0} for each fixed vintage vj ∈ V. rates). . where m denotes the lifetime and t denotes the calendar time. t. An efficient backfitting algorithm is provided for model estimation. t. Unlike a usual 2D (m. and define the dual-time domain Ω = (m. v)) = f (m) + g(t) + h(v) + ε. v ∈ V .1). smoothing spline and kernel methods.enough to cover a diversity of scenarios.5. In Section 3. we discuss the maturationexogenous-vintage (MEV) decomposition in general. The MEV models based on Gaussian processes are presented in Section 3.3. t) : t − m = vj .6 and give technical proofs in the last section.2. < vJ < ∞} be a discrete set of origination times of J vintages. We conclude the chapter in Section 3.2) 51 . including a simulation study for assessing the MEV decomposition technique.

as stated formally in the following lemma.2) defined on the dual-time domain Ω. write f = f0 + f1 with the linear part f0 and the nonlinear part f1 and similarly write g = g0 + g1 . g. h = h0 + h1 . g0 . The identification (or collinearity) problem in Lemma 3. and the three MEV components have the following practical interpretations: a) f (m) represents the maturation curve characterizing the endogenous growth. c) h(v) represents the vintage heterogeneity measuring the origination quality. We begin with some general remarks about model identification. time or vintage direction. v) : v = t − m} (see Figure 3.where η is a pre-specified transform function. one may group the large m region. the small t region and the large v region. given the vintage data of triangular prototype. as 52 .3) The question is. ε is the random error. m. nonparametric or stochastic process models. Suppose the maturation curve f (m) in (3.1 top right panel). Then. do these constraints suffice to ensure the model identifiability? It turns out for the dual-time domain Ω. (3.2) absorbs the intercept effect. h be open to flexible choices of parametric. We need to break up such linear dependency in order to make the MEV models estimable. For the MEV models (3. and the constraints Eg = Eh = 0 are set as usual: tmax E tmin g(t)dt = E V h(v)dv = 0. one of f0 .1 (Identification). h0 is not identifiable. For the time being.1 is incurred by the hyperplane nature of the dual-time domain Ω in the 3D space: {(t. b) g(t) represents the exogenous influence by macroeconomic conditions. all the MEV decomposition models are subject to the linear dependency. For example. A practical way is to perform binning in age. Lemma 3. let f.

The second example is the generalized additive model (GAM) in nonparametric statistics. The MEV decomposition with zero-trend vintage effects may be referred to as the ad hoc approach. h. One may retrieve h0 (v) = γv by simultaneously adjusting f (m) → f (m) + γm and g(t) → g(t) − γt. Suppose there are I distinct m-values and L distinct t-values in the vintage data. g. this strategy is simple to implement. The trend removal for h(v) is needed only when there is no such extra knowledge. An alternative strategy for breaking up the linear dependency is to directly remove the trend for one of f. The APC Model. g. h functions. provided that f. g. then it would be absorbed by f (m) and g(t). The third example is an industry GAMEED approach that may reduce to a sequential type of MEV decomposition. A commonly used binning procedure for h(v) is to assume the piecewise-constant vintage effects by grouping monthly vintages every quarter or every year. The first example is a conventional age-period-cohort (APC) model in social sciences of demography and epidemiology. where the smoothing spline is used as the scatterplot smoother. setting f (m) = µ + I i=1 αi I(m = mi ). By default. h. we remove the trend of h and let it measure only the nonlinear heterogeneity. By (3. They all can be viewed as one or another type of vintage data analysis. In what follows we review several existing approaches that are related to MEV decomposition. Indeed. To determine the slope γ requires extra knowledge about the trend of at least one of f. Then. A word of caution is in order in situations where there does exist an underlying trend h0 (v). The last example is another industrial DtD technique involving the interactions of MEV components. one only needs to fix the linear bases for any two of f. g(t) = L l=1 βl I(t = tl ) and h(v) = 53 . g.these regions have few observations. h all could be expanded by certain basis functions.2).

. the ordinary least squares (OLS) cannot be used for fitting y = Xθ + ε.J j=1 γj I(v = vj ) gives us the the APC (age-period-cohort) model η(y(mi . vJ ) to satisfy mI − tL + vJ = 0 (after reshuffling the indices of v). This is a key observation by Kupper (1985) that generated long-term discussions and debates on the estimability of the APC model. (1973) and Mason and Wolfinger (2002). The identification problem of the APC model (3. . . x1 . Without loss of generality.1. et al. see Mason. whose limiting case leads to a so-called intrinsic estimator θ = (XT X)+ XT y. . let us select the factorial levels (mI . Among others. tl . t = L l=1 tl /L and v = ¯ J j=1 vj /J. where y is the the vector consisting of η(y(mi . then one may reformulate the APC model as y = Xθ + ε. where A+ is the Moore-Penrose generalized inverse of A satisfying that AA+ A = A. tL . . . xi is defined by xi (m) = I(m = mi )−I(m = mI ) for i = 1.4) follows Lemma 3. xI−1 . . since the regression matrix is linearly dependent through I−1 L−1 J−1 (mi − i=1 (α) m)xi ¯ − l=1 (tl − ¯ (β) t )xl + j=1 ¯ ¯ ¯ (vj − v )xj = m − t + v . as usual. . (3. So. xL−1 . xJ−1 (α) (α) with dummy variables. . Assume i αi = l βl = j vj = 0. ¯ (γ) (3. Since the vintage data are always unbalanced 54 . . . Fu (2000) suggested to estimate the APC model by ridge regression. vj )) = µ + αi + βl + γj + ε. I − 1. For example. .4) where the vintage effects {γj s} used to be called the cohort effects. . x1 . . .5) where m = ¯ I i=1 ¯ mi /I. vj )) and (α) (α) (β) (β) (γ) (γ) X = 1. tl . x1 . The discrete APC model tends to massively identify the effects at the grid level and often ends up with unstable estimates. .

in m, t or v projection, there would be very limited observations for estimating certain factorial effects in (3.4). The APC model is formulated over-flexibly in the sense that it does not regularize the grid effect by neighbors. Fu (2008) suggested to use the spline model for h(v); however, such spline regularization would not bypass the identification problem unless it does not involve a linear basis. The Additive Splines. If f, g, h are all assumed to be unspecified smooth functions, the MEV models would become the generalized additive models (GAM); see Hastie and Tibshirani (1990). For example, using the cubic smoothing spline as the scatterplot smoother, we have the additive splines model formulated by
J Lj 2 2

arg min
f,g,h j=1 l=1

η(yjl ) − f (mjl ) − g(tjl ) − h(vj )
2

+ λf

f (2) (m) dm h(2) (v) dv ,
2

+λg

g (2) (t) dt + λh

(3.6)

where λf , λg , λh > 0 are the tuning parameters separately controlling the smoothness of each target function; see (2.2). Usually, the optimization (3.6) is solved by the backfitting algorithm that iterates

ˆ f ← Sf g ← Sg ˆ ˆ h ← Sh

ˆ η(yjl ) − g (tjl ) − h(vj ) ˆ ˆ ˆ η(yjl ) − f (mjl ) − h(vj )

j,l j,l j,l

(centered) (centered)

(3.7)

ˆ η(yjl ) − f (mjl ) − g (tjl ) ˆ

after initializing g = h = 0. See Chapter II about the formation of smoothing matrices ˆ ˆ Sf , Sg , Sh , for which the corresponding smoothing parameters can be selected by the method of generalized cross-validation. However, the linear parts of f, g, h are not identifiable, by Lemma 3.1. To get around the identification problem, we may take

55

the ad hoc approach to remove the trend of h(v). The trend-removal can be achieved by appending to every iteration of (3.7):
J J 2 −1 vj j=1 j=1

ˆ ˆ h(v) ← h(v) − v

ˆ vj h(vj ).

We provide more details on (3.6) to shed light on its further development in the next section. By Proposition 2.1 in Chapter II, each of f, g, h can be expressed by the basis expansion of the form (2.8). Taking into account the identifiability of both the intercept and the trend, we may express f, g, h by
I i=1

f (m) = µ0 + µ1 m + g(t) = µ2 t + h(v) =
J j=1 L l=1

αi K(m, mi ) (3.8)

βl K(t, tl )

γj K(v, vj ).

Denote µ = (µ0 , µ1 , µ2 )T , α = (α1 , . . . , αI )T , β = (β1 , . . . , βL )T , and γ = (γ1 , . . . , γJ )T . We may equivalently write (3.6) as the regularized least squares problem 1 y − Bµ − Σf α − Σg β − Σh γ n
J j=1 2 2 Σf 2 Σg 2 Σh

min

+ λf α

+ λg β

+ λh γ

, (3.9)

where y is the vector of n =

Lj observations {η(yjl )}, B, Σf , Σf are formed by

B=

(1, mjl , tjl )

j,l n×3

,

Σf = K(mjl , mi )

n×I

,

Σf = K(mi , mi )

I×I

and Σg , Σg , Σh , Σh are formed similarly. The GAMEED Approach. The GAMEED refers to an industry approach, namely the generalized additive maturation and exogenous effects decomposition, adopted by Enterprise Quantitative Risk Management, Bank of America, N.A. (Sudjianto, et al., 2006). It has also been 56

documented as part of the patented work “Risk and Reward Assessment Mechanism” (USPTO: 20090063361). Essentially, it has two practical steps: ˆ 1. Estimate the maturation curve f (m) and the exogenous effect g (t) from ˆ

η(y(m, t; v)) = f (m; α) + g(t; β) + ε,

Eg = 0

where f (m; α) can either follow a spline model or take a parametric form supported by prior knowledge, and g(t; β) can be specified like in (3.4) upon necessary time binding while retaining the flexibility in modeling the macroeconomic influence. If necessary, one may further model g (t; β) by a time-series model in ˆ order to capture the autocorrelation and possible jumps. ˆ 2. Estimate the vintage-specific sensitivities to f (m) and g (t) by performing the ˆ regression fitting
(g) (f ) ˆ ˆ η(y(mjl , tjl ; vj )) = γj + γj f (mjl ) + γj g (tjl ) + ε

for each vintage j (upon necessary bucketing). The first step of GAMEED is a reduced type of MEV decomposition without suffering theidentification problem. The estimates of f (m) and g(t) can be viewed as the common factors to all vintages. In the second step, the vintage effects are measured through not only the main (intercept) effects, but also the interactions
(f ) (g) ˆ with f (m) and g (t). Clearly, when the interaction sensitivities γj = γj = 1, the ˆ

industry GAMEED approach becomes a sequential type of MEV decomposition. The DtD Technique. The DtD refers to the dual-time dynamics, adopted by Strategic Analytics Inc., for consumer behavior modeling and delinquency forecasting. It was brought to the

57

1 Covariance Kernels Gaussian processes are rather flexible in modeling a function Z(x) through a mean function µ(·) and a covariance kernel K(·. ·). 3. the exogenous influence g(t) and the vintage heterogeneity h(v) are postulated to capture the endogenous. Unfortunately. x ) EZ(x) = µ(x). g(t) and γj . the identification problem could as well happen to the DtD technique. However. Clearly. vj )) = γj f (mjl ) + γj g(tjl ) + ε (f ) (g) (3.10) and η −1 (·) corresponds to the so-called superposition function in Breeden (2007). the maturation curve f (m). 3.11) 58 . Breeden (2007) suggested to iteratively fit f (m).2). Using our notations. we propose a general approach to MEV decomposition modeling based on Gaussian processes and provide an efficient algorithm for model estimation. the DtD model involves the interactions between the vintage effects and (f (m). such identification problem was not addressed by Breeden (2007).3.public domain by Breeden (2007). unless it imposes structural assumptions that could break up the trend dependency.3 Gaussian Process Models In MEV decomposition framework (3. g(t)). γj (f ) (g) in a simultaneous manner. In this section. (3.10) then apply Lemma 3. To see this. and the stability of the iterative estimation algorithm might be an issue. E[Z(x) − µ(x)][Z(x ) − µ(x )] = σ 2 K(x. the DtD model takes the form η(y(mjl . exogenous and origination effects on the dual-time responses. one may use Taylor expansion for each nonparametric function in (3. tjl . Unlike the sequential GAMEED above.1 to the linear terms.

In the Mat´rn e family. x ) = K(x . Both the exponential and Mat´rn families consist of stationary kernels. 0 < κ ≤ 2. x ). Z(xn )) follows a multivariate normal distribution Z ∼ Nn (0. where the variance scale σ 2 (x) requires a separate model of either parametric or nonparametric type. x) and positive definite: x. we will write µ(x) separately and only consider the zero-mean Gaussian process Z(x). It is symmetric K(x. . ·) plays a key role in constructing Gaussian processes. Z(x )] = σ(x)σ(x )K(x. x ) depends only on |xi −xi |. xj )]n×n . For ease of discussion. Huang and Li (2007) used the local polynomial approach to smoothing σ 2 (x). u)du. For example. .k. σ 2 Σ) with Σ = [K(xi . Then. . 5/2 and the corre- 59 . 3/2. x ) = Kκ . x ) = exp − K(x.x K(x. family: φ−1 (u)Gκ (x. For x ∈ R. φ > 0 (3. κ.12) Γ(κ) φ φ K(x. i = 1. φ(x) > 0. Often used are the half-integer order κ = 1/2. . . see Abramowitz and Stegun (1972). including the adaptive smoothing spline (AdaSS) kernels discussed in Chapter II. x ) = X κ Exponential family: Mat´rn family: e AdaSS r. There is a parsimonious approach to using stationary kernels K(·. Kκ (·) denotes the modified Bessel function of order κ. Several popular classes of covariance kernels are listed below. a stationary kernel is also isotropic. The covariance kernel K(·. any finite-dimensional collection of random variables Z = (Z(x1 ). The kernel is said to be stationary if K(x. . It is further said to be isotropic or radial if K(x. ·) to model certain nonstationary Gaussian processes such that Cov[Z(x). Fan. for which e φ is the scale parameter and κ can be viewed as the shape parameter. . x ) depends only on the distance x−x . |x − x | . More examples of covariance kernels can be referred to Stein (1999) and Rasmussen and Williams (2006). u)Gκ (x . d for x ∈ Rd . φ > 0 φ |x − x | 21−κ |x − x | κ K(x. . x )f (x)f (x )dxdx > 0 for any non-zero f ∈ L2 .where σ 2 is the variance scale.

Among others. the Gaussian process with kernel (3.14) with piecewise-constant adaptation φ−1 (x) = φk for x ∈ [τk − 1. In this chapter.g. expressions with piecewise φ−1 (x).2 Kriging.k. the κ = 1/2 Mat´rn kernel.13) |x−x | φ It is interesting that these half-integer Mat´rn kernels correspond to autocorrelated e Gaussian processes of order κ + 1/2. In Chapter 2. t.k.29).14) max generalizes the standard Wiener process with nonstationary volatility. 1+ φ 3φ2 1+ (3. x ) = e− K(x.15) . By (2. see Stein (1999. p. κ = 5/2.3. Spline and Kernel Methods Using Gaussian processes to model the component functions in (3. which is useful for modeling the heterogeneous behavior of an MEV component. τk ) and knots x− ≡ min τ0 < τ1 < · · · < τK ≡ x+ . x ) = k=1 min{k:τk >x∧x } φk x ∧ x − τk−1 (3. κ = 3/2 φ |x − x | |x − x |2 + . v)) = µ(m. 3.1. In particular. is e the covariance function of the Ornstein-Uhlenbeck (OU) process. x ) = e− K(x.3. u) = (x − u)+ /(κ − 1)! with order κ and local adaptation function φ(x) > 0. we consider η(y(m. which is equivalent to the exponential kernel with κ = 1. The AdaSS family of reproducing kernels (r. we have derived the explicit r.2.’s) are based on the Green’s function κ−1 Gκ (x. t. we concentrate on a particular class of AdaSS kernels K(x. κ = 1/2 |x − x | .sponding Mat´rn covariance kernels take the following forms e K(x. They are nonstationary covariance kernels with the corresponding Gaussian processes discussed in Chapter 2. e. v) + Zf (m) + Zg (t) + Zh (v) + ε 60 (3. g(t).2) would give us a class of Gaussian process MEV models. 31). x ) = e− |x−x | φ |x−x | φ .

Zh (v)}.1) with n = J j=1 Lj observations. v) = µ0 + µ1 m + µ2 t. x ). In other situations. z = Z(xjl ) + ε n×1 and Σ = K(xjl . Zh (v) are mutually independent for x ≡ (m. v) = µ0 . xj l ) n×n . let us assume that Zf (m). let us denote µ(m.15) in the vector-matrix form Cov[z] = σ 2 Σ + I y = Bµ + z. respectively. Zg (t). m ) + λ−1 Kg (t. for µ ∈ Rd . (3. we may write (3. Zh (v) are three zero-mean 2 2 2 Gaussian processes with covariance kernels σf Kf . σh Kh . the overall mean function is postulated to be µ(m. B = bT jl n×d . t}. the functional space H0 might shrink to be span{1} in order to satisfy (3. where H0 ⊆ span{1. given the vintage data (3. t} ∩ null{Zf (m). By the independence assumption. m. Z(x )] = σ 2 K(x. Zg (t).16) among other choices. v). t.18) 2 2 2 with λf = σ 2 /σf .19) where y = η(yjl ) n×1 . K(x. g. 61 . For simplicity. By either GLS (generalized least squares) or MLE (maximum likelihood estimation). Consider the sum of univariate Gaussian processes Z(x) = Zf (m) + Zg (t) + Zh (v). Then. t.where ε ∼ N (0. then µ(m. g f h (3.17) with mean zero and covariance Cov[Z(x). λg = σ 2 /σg and λh = σ 2 /σh . h allows for H0 = span{0. v) ∈ R3 . choosing cubic spline kernels for all f. Zf (m). then µ(m. t. Zg (t). σ 2 ) is the random error. v ). (3. To bypass the identification problem in Lemma 3. t. In general. t ) + λ−1 Kh (v. σg Kg .1. t. For example. v) ∈ H0 on the hyperplane of the vintage diagram. x ) = λ−1 Kf (m. (3.16). v) = µT b(m. m. t.

we will discuss how to iteratively estimate the parameters by an efficient backfitting algorithm. 62 . ˆ Usually. Whenever (λ. θg .19) and after dropping the constant term.21) where p = rank(B) is plugged in order to make σ 2 an unbiased estimator. θg . e.20) where the invertibility of Σ + I is guaranteed by large values of λf . θ) that admit no closed-form solutions. In the next subsection. the nugget effect). see Fang. (3. the log-likelihood function is given by 1 1 n (µ.a. or local adape tation φ(·) in AdaSS kernel. as the ratios of σ 2 to the variance scales of Zf . Li and Sudjianto (2006. λ. θ) are fixed. namely (λ. λh . θf . λg . as well as the unknown variance parameter σ 2 (a. θ) = (λf .20). n−p σ2 = ˆ (3. θ). θ. θh ). λg .µ is estimated to be −1 −1 −1 µ = BT Σ + I B BT Σ + I y.g scale parameter φ and shape parameter κ in exponential or Mat´rn family of kernels. λh . numerical procedures like Newton-Raphson algorithm are needed to iteratively estimate (λ. By (3. §5). σ 2 ) = − log(σ 2 ) − log Σ + I − 2 (y − Bµ)T (Σ + I)−1 (y − Bµ) 2 2 2σ in which Σ depends on (λ. then one may estimate the variance parameter to be 1 (y − Bµ)Σ−1 (y − Bµ). Zg . The θf . In the Gaussian process models there are additional unknown parameters for defining the covariance kernel (3. λg .18).k. The λf . Zh . λh as in the context of ridge regression. are usually referred to as the smoothing parameters. Such a standard maximum likelihood estimation method is computationally intensive. θh denote the structural parameters. the µ estimate is given by (3.

2 (MEV Kriging as Spline Estimator). Proposition 3. see Cressie (1993) in spatial statistics. By the property of multivariate normal distribution. Kriging can be equally treated by either multivariate normal distribution theory or Bayesian approach.MEV Kriging means the prediction based on the Gaussian process MEV model (3. Let us denote the prediction of η(ˆ(x)) at x = (m. t} and H1 is the reproducing kernel Hilbert space induced by (3. n×1 evaluated between x and each sam- The Kriging (3.18) such that H0 ⊥ H1 . v) by ζ(x) = µT b(x) + y E Zf (m) + Zg (t) + Zh (v) + ε z . where E[Z|z] denotes the conditional expectation of Z given z = y − Bµ.22) is known to be a best linear unbiased predictor. for which the Kriging behaves as a smoother rather than an interpolator. Formally. xjl ) pling point. where the naming of ‘Kriging’ follows the geostatistical convention. 63 . Li and Sudjianto (2006) for the use of Kriging in the context of computer experiments. The MEV Kriging (3.6) as a special case. See also Fang. The former approach is taken here for making prediction from the noisy observations.22) where ξ n (x) denotes the vector of K(x.15).23) where H = H0 ⊕ H1 with H0 ⊆ span{1. since the conditional expectation of multivariate normal distribution equals the Bayesian posterior mean. t.22) can be formulated as a smoothing spline estimator J Lj 2 arg min ζ∈H j=1 l=1 η(yjl ) − ζ(xjl ) + ζ 2 H1 (3. It has also a smoothing spline reformulation that covers the additive cubic smoothing splines model (3. we have the following proposition. m. −1 ˆ ζ(x) = µT b(x) + ξ T (x) Σ + I n y − Bµ (3.

It connects the kernel methods in machine learning that advocates the use of covariance kernels for mapping the original inputs to some high-dimensional feature space. xjl ) (3. see e. The MEV Kriging (3.3 breaks down the high-dimensional parameter estimation and it makes possible the development of a backfitting algorithm to deal with three uni64 .1).23) can be written as a finite-dimensional expression J Lj ζ(x) = µ b(x) + j=1 l=1 T cjl K(x. or the RKHS.24) with coefficients determined by the regularized least squares. Proposition 3.22) can be expressed additively as I L J ζ(x) = µ b(x) + i=1 T αi Kf (m. Proposition 3. (3. such kernelized feature space corresponds to the family of Gaussian processes.27) where the matrix notations are the same as in (3.26) based on I distinct m-values.The spline solution to (3. we have the following simplification. Proposition 3.22) to a kernelized additive model of the form (3. Rasmussen and Williams (2006) for more details. tl ) + j=1 γj Kg (v. The coefficients can be estimated by 2 2 Σf 2 Σg 2 Σh min y − Bµ − Σf α − Σg β − Σh γ + λf α + λg β + λh γ . vj ) (3. mi ) + l=1 βl Kg (t. (3. 2 min y − Bµ − Σc + c Σ .9) up to mild modifications.3 converts the MEV Kriging (3. L distinct t-values and J distinct v-values in (3. Based on our previous discussions.g.25) Furthermore.3 (Additive Separability).26).

estimate α by ridge regression T −1 T α = Σf Σf + λf Σf Σf y (3. Set the initial values of β.27). the covariance kernels in (3.28) where Σf = Kf (mjl . see also Golub.3 MEV Backfitting Algorithm For given (λ. γ. λf . In this section we propose a backfitting algorithm to minimize the regularized least squares criterion (3. There are however complications for determining (λ. α. so one may find the optimal solution to (3.3.18) are defined explicitly.1) with notations y. θ). β. θf ) I×I . θ). γ to be zero.variate Gaussian processes separately. θh used for defining the covariance kernels. Step 2a: Given Ξ with (α. Determine the 65 . λf . Given the vintage data (3.27) by setting partial derivatives to zero and solving the linear system of algebraic equations. λh . mj . θf . together with a generalized cross-validation (GCV) procedure for selecting the smoothing parameters λf . Here. it can easily deal with the collinearity problem among Zf (m). as well as the structural parameters θf . θf ) unknown. mi . Heath and Wahba (1979) in the context of ridge regression. Such backfitting procedure provides a more efficient procedure than the standard method of maximum likelihood estimation. θg . θf ). 3.2. B in (3. get the pseudo data y = y − Bµ − Σg β − Σh γ.6): Step 1: Set the initial value of µ to be the ordinary least squares (OLS) estimate (BT B)−1 BT y. σ 2 } where λ’s are smoothing parameter and θ’s are structural parameters for defining the covariance kernels. θf ) n×I and Σf = Kf (mi .2. Besides. λh . θh . Zh (v) caused by the hyperplane dependency of vintage diagram. For each input (λf . Denote the complete set of parameters by Ξ = {µ. λg . θg . λg . the GCV criterion follows our discussion in Chapter 2. Zg (t).

Select (λh . involving jumps. Σg and S(λg . θf ) = 1− T 1 Trace S(λf .13). θf ))y n T 2 GCV(λf . Step 2c: Run Step 2a with the unknown parameters replaced by (γ.best choice of (λf . ˆ For future reference. one may choose the AdaSS kernel (3. λh . a combination of different covariance kernels can be included in the same MEV model. e. λg .29) −1 where S(λf .20) for given (λf . then estimate β by ridge regression. 66 . (For the ad hoc approach. θg ).) Form Σh . λg .k. θh ) and y = y − Bµ − Σf α − Σg β. Mat´rn and AdaSS r. (3. say. the above iterative procedures are named as the MEV Backfitting algorithm. θh ) by GCV. Estimate γ by ridge regression. Obtain σ 2 by (3. we may reconstruct the vintage data by y = Bµ + Σf α + Σg β + Σh γ.14). Step 2b: Run Step 2a with the unknown parameters replaced by (β. Form Σg . θg ) by GCV. β. θg ) and y = y − Bµ − Σf α − Σh γ. After obtaining the parameter estimates. (3. θf ) by minimizing 1 (I − S(λf . In order to capture a diversity of important features. The exponential. whereas other types of kernels can be readily employed.12). λh . remove the trend of y in the vintage direction. θh ). θf ) n 2 .g. Σh and S(λh .21). γ change less than a pre-specified tolerance. Select (λg . θf ) = Σf Σf Σf + λf Σf Σf . It is efficient and can automatically take into account selection of both smoothing and structural parameters. (3. θf . The following are several remarks for practical implementation: 1. Step 3: Repeat steps 2a–2d until convergence. families listed in (3. when the estimates of µ. Step 2d: re-estimate the µ by GLS (3. θh ).14) are of our primary e interest here. When the underlying function is heterogeneously smooth. α. θg .

g(t). 67 .28). tjl . t. v) is specified to be the intercept effect. For simplicity. Of our interest are the the semiparametric MEV regression models with both nonparametric f (m).4 Semiparametric Regression Having discussed the MEV decomposition models based on Gaussian processes. The evaluation of the GCV criterion (3. they may need the centering adjustment due to machine rounding. h(v) and parametric covariate effects. 3. where xjl denotes the covariates of j-th vintage at time tjl and age mjl = tjl − vj . vj ). the covariates xjl or xkjl are assumed to be deterministic. we consider in this section the inclusion of covariates in the following two scenarios: 1. one may replace µ0 by the mean of y in step 1 and step 2d. in which case the response vector y is reduced to the pointwise averages. tl ). we will spend more time in the second situation with multiple segments. v) = µ0 .e. In principal. For the ridge regression (3.4). we may write the MEV Kriging as ζ(x) = f (m) + g (t) + h(v) with ˆ I L J ˆ f (m) = µ0 + ˆ i=1 αi Kf (m. h(v) = j=1 γj Kh (v. i.22). i. vj ) ∈ Rp (or xjl in short) for a fixed segment. mi ). see (2. In ˆ ˆ this case. but in ˆ practice. t. the matrix Σf contains row-wise replicates and can be reduced by weighted least squares. no error in variables. 3. After a brief discussion of the first situation with a single segment. 2. µ(m.2. where zk represents the static segmentation variables. see (2. g (t) = ˆ ˆ l=1 ˆ ˆ βl Kg (t.29) can be modified accordingly.e. tkjl . vkj ) ∈ Rp and zk ∈ Rq for multiple segments that share the same vintage diagram. When the mean function µ(m. x(mjl . both g (t) and h(v) in step 2b and step 2c have mean zero. ˆ ˆ 4. x(mkjl .

20).30) where xjl ≡ x(mjl .30) can be iteratively estimated by the modified backfitting algorithm. π) by (3.1 Single Segment Given a single segment with dual-time observations {y(mjl .3. Then. where B = [B. g. θh ). vj )} for j = 1. and f. θf . .30) can be approximated by the kernel basis expansion through (3. J and l = 1. That is. g.2). π) = BT Σ + I B BT Σ + I y. . .e.4. tjl . This GLS estimator can be incorporated with . . π) initially to be the OLS estimate.30) corresponds to the linear mixed-effect modeling. instead of starting from the fully nonparametric approach.X] the established MEV Backfitting algorithm discussed in Section 3.31) . vj ). we consider the partial linear model that adds the linear covariate effects to the MEV decomposition (3. tjl . with X = [xjl ]n×p . . η(yjl ) = f (mjl ) + g(tjl ) + h(vj ) + π T xjl + ε = µT bjl + π T xjl + Zf (mjl ) + Zg (tjl ) + Zh (vj ) + ε (3.31) with zero Σ.31) for given (λf . tjl . −1 −1 −1 (µ. Only step 1 and step 2d need modified to be: Step 1: Set (µ. λh . i. . (3. Step 2d: re-estimate (µ. we may assume the f.2 may be not satisfied between the covariate space span{x} and the Gaussian processes.26). λg . then run 68 . since the orthogonality condition in Proposition 3. x(mjl . (3. vj ). h are nonparametric and modeled by Gaussian processes. the whole set of parameters in (3. (µ. θg . In this case. There is a word of caution about multicollinearity. h components in (3. The semiparametric regression model (3.3. . . π) are the parameters for the mean function. conditional on Σ.3. Lj . and the covariate effects π can be simply estimated in the same way we estimated µ in (3.

therefore equivalent to K separate single-segment models. g(t). x(mkjl . h(v). while the coefficients uf . 3. the functions f (m). Fortunately. tkjl . b. vkj ). c. . η(ykjl ) = f (mkjl ) + g(tkjl ) + h(vkj ) + π T xkjl + ω T zk + ε η(ykjl ) = fk (mkjl ) + gk (tkjl ) + hk (vkj ) + π T xkjl + ε k (k) (k) (3. .2 Multiple Segments Suppose we are given K segments of data that share the same vintage diagram. . part of Gaussian process components might be over-smoothed. uh represent the idiosyncratic multipliers. zk . the first approach assumes that the segments all perform the same MEV components and covariate effect π. Besides. vkj ). h and covariate effects. a. e. the smoothing parameters λf . h(v) represent the common factors across segments. . ug . There are various approaches to modeling cross-segment responses. where xkjl ∈ Rp and z ∈ Rq . ug . g(t). . λh may guard against possible multicollinearity and give the shrinkage estimate of the covariate effects π. As a trade-off.g. tkjl . both of these approaches can be treated by the single-segment modeling technique discussed in the last section. y(mkjl . Denote by N the total number of observations across K segments.backfitting. j = 1. uh ∈ R to the same underlying f (m). . J and l = 1. . Clearly.33) (k) η(ykjl ) = uf f (mkjl ) + ug g(tkjl ) + uh h(vkj ) + π T xkjl + W (zk ) + ε. it allows for the segment-specific add-on effects W (zk ) and segment-specific 69 (k) (k) (k) (k) (k) (k) . In another word. (3. The second approach takes the other extreme assumption such that each segment performs marginally differently in terms of f. . Lj . K.4.32) for k = 1. Of our interest of study is the third approach that allows for multiple segments to have different sensitivities uf . . . k Here. λg . and differ only through ω T zk . g. . .

(The smoothing spline kernels can be e also used after appropriate modification of the mean function. EW (zk )W (zk ) = λW KW (zk . the model (c) is postulated as a balance between the over-stringent model (a) and the over-flexible model (b). X has 70 .12). h(v) = j=1 γj Kg (v. Σg β + uh . Linear Segment Effects. tl ). (uf . The estimation of these two types of segment effects are detailed as follows.) Using the matrix notations.35) with basis approximation I L J f (m) = µ0 + i=1 αi Kf (m. consider (k) (k) η(ykjl ) = uf f (mkjl ) + u(k) g(tkjl ) + uh h(vkj ) + π T xkjl + ω T zk + ε g k (3. Σh γ + Xπ + Zω where the underscored vectors and matrices are formed based on multi-segment observations. mi ). uh ) are the extended vector of segment-wise multipliers.34) for the smoothing parameter λW > 0 and some pre-specificed family of covariance kernels KW . the model takes the form uf .32). we consider (c1) the linear modeling W (zk ) = ω T zk with parameter ω ∈ Rq and (c2) the Gaussian process modeling EW (zk ) = 0. Different strategies can be applied to model the segment-specific effects W (zk ). Thus. ug .covariate effect π k . g(t) = l=1 βl Kg (t. For example. vj ) through the exponential or Mat´rn kernels. zk ) (3. Among others. we may use the squared exponential covariance kernel in (3. Given the multi-segment vintage data (3. Bµ + Σf α + ug .

g and h. ˆ ˆ Note that in Step 2. The following algorithm is constructed by utilizing the MEV Backfitting algorithm as an intermediate routine. . uh are fixed. h(v) are by definition the common factors in the overall sense. In Step 3. since f (m). ug . α. θ} for the underlying f. g. λ. . When both the kernel parameters (λ. the regularized least squares can be used for parameter estimation. γ. Kg . ug . the concern of stability is one reason. g(t). ug . the segment-wise multipliers ˆ ˆ ˆ uk ’s are updated by regressing over f . π K ). using the same penalty terms as in (3. In so doing. . h functions and uf . g (t) and h(v) are estimated first ˆ by assuming uk = Euk = 1 for each of f. h(v) ˆ Step 3: Update (uf . ω) by OLS estimates for fitting y = Fuf + Gug + Huh + Xπ + Zω + ε where F. π. β. . g. respectively. including the segment variability in Step 3 improves both the estimation of (π. ω) to be the OLS estimates for fitting y = Xπ + Zω + ε based on all N observations across K segments.27). h(vkj ). including {µ. and Z is as usual. Step 2: Run the MEV Backfitting algorithm for the pseudo data y − Xπ − Zω ˆ ˆ for obtaining the estimates of f (m).K × p columns corresponding to the vectorized coefficients π = (π 1 . the common factors f (m). g (t). G. g (tkjl ). θ) for Kf . ω) 71 . Then. Kh and the (condensed K-vector) multipliers uf . h. ˆ Step 4: Repeat step 2 and step 3 until convergence. H are all N × K regression sub-matrices constructed based on ˆ ˆ f (mkjl ). Step 1: Set the initial values of (π. Iterative procedures can be used to estimate all the parameters simultaneously. ω across segments. Disregarding the segment variability in Step 2 is also reasonable. uh . uh ) and (π.

zk ) associated with the unknown scale parameter (denoted by θW ). ti ) + i=1 γi Kg (vkj . g. we have the following two iterative stages of model estimation.and the estimation of f. Spatial Segment Heterogeneity. ΣW are constructed accordingly. h are approximated in the same way as in (3.36) where f. The k regression sub-matrices B. Σg . vi ) + i=1 ωi KW (zk . Such fitting procedure is practically interesting and computationally tractable. all hav72 . let us directly approximate W (z) by the basis expansion W (z) = K k=1 ωk KW (z. g(t). . zi ) + ε by the regularized least squares 2 min y − Bµ − Σf α − Σg β − Σh γ − ΣW ω +λf α 2 Σf + λg β 2 Σg + λh γ 2 Σh + λW ω 2 ΣW . Σf .35) and W (zk ) is assumed to follow (3. estimate the common-factor functions f (m). Then.37) where y is the pseudo response vector of η(ykjl ) − π T xkjl for all (k. Σh . j. l). mi ) + i=1 K βi Kg (tkjl . h in the next iteration of Step 2. g(t).34). Despite of the or- thogonality constraint. . . Stage 1: for given π k and uk ≡ 1. K. k = 1. g. but the orthogonality is not guaranteed between W (zk ) and the span of xkjl . (3. h(v)} can be justified from a tensor-product space point of view. h(v) and the spatial segment effect W (zk ) in I L η(ykjl ) − π T xkjl = µ0 + k i=1 J αi Kf (mkjl . We move to consider the segment effect modeling by a Gaussian process (k) (k) η(ykjl ) = uf f (mkjl ) + u(k) g(tkjl ) + uh h(vkj ) + W (zk ) + π T xkjl + ε g k (3. . The orthogonality constraint between W (z) and {f (m).

Stage 2: for given f. The matrices used for regularization are of size I × I. . The parameter estimation can be obtained by the generalized least squares. . Obviously. θW need to be determined. K 5. h. Common-factor exogenous influence: g (t) = ˆ ˆ 3. θg . .ing N rows. ug . λW > 0 and the structural parameters θf . Segment-specific covariate effects: π k . . ˆ 3. θh . Let us call the extended algorithm MEVS Backfitting. J × J and K × K. λh . K 6. λg . tl ) γj Kh (v. vj ) ˆ J j=1 (k) (k) (k) ˆˆ ˆ 4. respectively.5 Applications in Credit Risk Modeling This section presents the applications of MEV decomposition methodology to the real data of (a) Moody’s speculative-grade corporate default rates shown in Ta73 . uh for k = 1. g . . where ΣW = KW (zk . h): uf . where the added letter “S” means the segment effect. Both the smoothing parameters λf . mi ) ˆ ˆ βl Kg (t. Iterate Stage 1 and Stage 2 until convergence. . . for k = 1. zk ). Idiosyncratic multipliers to (f . the regularized least squares in Stage 1 extends the MEV Backfitting algorithm to entertain a fourth kernel component KW . Common-factor vintage heterogeneity: h(v) = L l=1 I i=1 αi Kf (m. zk ) N ×N is obtained by evaluating every pair of obser- vations either within or between segments. Spatial segment heterogeneity: W (z) = K k=1 ωk KW (z. uh ) and π based on Cov[w] = σ 2 λW ΣW + I y = Fuf + Gug + Huh + Xπ + w. L × L. g. Common-factor maturation curve: f (m) = µ0 + ˆ 2. estimate the parameters (uf . then we obtain the following list of effect estimates (as a summary): ˆ 1. . ug .

and its follow-up performance is observed at each month end. the responses yjl is the product of three marginal effects. In our simulation setting. all the vintages are observed up to 60 months-on-book. To test the MEV decomposition modeling. . 0} to 48 and mjl = tjl − vj .1.ble 1. For each vintage j with origination time vj . and (b) the tweaked sample of retail loan loss rates discussed in Chapter II. the underlying truth exp(f0 (m)) takes the smooth form of inverse Gaussian hazard rate function (1. . we begin with a simulation study based on the synthetic vintage data.1 Simulation Study Let us synthetize the year 2000 to 2008 monthly vintages according to the diagram shown in Figure 1. Our focus is on understanding (or reconstruction) of the marginal effects first. . . Assume each vintage is originated at the very beginning of the month. the underlying exp(g0 (t)) is volatile with exponential growth and a jump at t = 23. 47 originated after 2005. Regardless of the noise. Both examples demonstrate the rising challenges for analysis of the credit risk data with dual-time coordinates. the dual-time responses are simulated by ε ∼ N (0. Let the horizontal observation window be left truncated from the beginning of 2005 (time 0) and right truncated by the end of 2008 (time 48). exp(f0 (m)) · exp(g0 (t)) · exp(h0 (v)). then perform data smoothing on the dual-time domain. (3.1. Vertically. .38) where tjl runs through max{vj . . to which we make only an initial attempt based on the MEV Backfitting algorithm. σ 2 ) log(yjl ) = f0 (mjl ) + g0 (tjl ) + h0 (vj ) + ε. 1. . consisting of vintages j = −60. 3. −1 originated before 2005 and j = 0. .02. and the underlying h0 (v) is assumed to be a dominating sine 74 .3. It thus ends with the rectangular diagram shown in Figure 3.13) with parameter c = 6 and b = −0.5.

and taking the inverse transform gives the fitted rates.2 (bottom panel). we obtain the synthetic vintage data. g0 . for simplicity we have enforced φk to be the same for regions without jump.14) e for g(t). They match the underlying truth except for slight boundary bias. the larger variation the corresponding Gaussian process holds.wave within [0. Adding the noise with σ = 0. the smaller the smoothing parameter is. calendar and vintage origination time are shown in the second panel of Figure 3. By (3.k. The noise-removed recon- 75 . therefore. the ordering of cross-validated λf > λh > λg confirms with the smoothness assumption for f0 . as well as binning for v ≤ 0 based on the prior knowledge that pre-2005 vintages can be treated as a single bucket. whose projection views in lifetime.18). Thus. The GCV-selected structural and smoothing parameters are tabulated in Table 3. We manually choose the order-3/2 Mat´rn kernel (3. then run the MEV Backfitting algorithm based on the log-transformed data. Since our data has very few observations in both ends of the vintage direction. the AdaSS r.2. compare them with the underlying truths in order to see the contamination effects by other marginals. See their plots in Figure 3. In fitting g(t) with AdaSS. and the squared exponential kernel (3. we performed binning for v ≥ 40. It is interesting to check the ordering of cross-validated smoothing parameters. see Figure 3. (3. h0 in our simulation setting.3 for the projection views. Then.1. This synthetic data can be used to demonstrate the flexibility of MEV decomposition technique such that a combination of different kernels can be employed. One may check the pointwise averages (or medians in the vintage box-plots) in each projection view.2 (top panel). 40] and relatively flat elsewhere.1.13) for f (m).12) for h(v). the MEV Backtting algorithm outputs the estimated marginal effects plotted in Figure 3. the smoothing parameters correspond to the reciprocal ratios of their variance scales to the nugget effect σ 2 . we have specified the knots τk to handle the jump by diagnostic from the raw marginal plots. Combining the three marginal estimates.

(3nd) MEV Backfitting algorithm upon convergence.2: Synthetic vintage data analysis: (top) underlying true marginal effects. (bottom) Estimation compared to the underlying truth.Figure 3. 76 . (2nd) simulation with noise.

650. 0.086 AdaSS r. MEV Kernel choice Structural parameter Smoothing parameter f (m) Mat´rn (κ = 3/2) e φ = 2.Table 3.000 7.3: Synthetic vintage data analysis: (top) projection views of fitted values.872 20. 8.182 g(t) φk = 0. 47 1. 77 . (bottom) data smoothing on vintage diagram.389 Figure 3.1: Synthetic vintage data analysis: MEV modeling exercise with GCVselected structural and smoothing parameters. 25.288.14) τk = 21.k.288 h(v) Squared exponential φ = 3.(3. Eq.

the vintage effects are merged for large v = 30. 39. the many spikes shown in the lifetime projection actually correspond to the exogenous effects.5. the exogenous effects g(t) for t = 2.3 to see the dual-time smoothing effect on the vintage diagram. it is evident that exogenous influence by macroeconomic environment is more fluctuating than both the maturation and vintage effects. Such truncation in both lifetime and calendar time corresponds to the trapezoidal prototype of vintage diagram illustrated in Figure 3. and the leftmost observation at t = 1 is removed because of its jumpy behavior. It is our purpose to separate these marginal effects. One may also compare the level plots of the raw data versus the reconstructed data in Figure 3. see e. Second.4 of Chapter 1 we have previewed the annual corporate default rates (in percentage) of Moody’s speculative-grade cohorts: 1970-2008. such marginal plots are contaminated by each other. Besides.4 gives the projection views of the empirical default rates in lifetime. First. 78 . . We made slight data preprocessing. since there are very limited number of observations for small calendar time. .struction can reveal the prominent marginal features.1. By checking the marginal averages or medians plotted on top of each view. 3. for vintages originated after 1988. the transform function is pre-specified to be η(y) = log(y/100) where y is the raw percentage measurements and zero responses are treated as missing values (one may also match them to a small positive value). 3 are merged to t = 4. . Each yearly cohort is regarded as a vintage.2 Corporate Default Rates In the Figure 1. the default rates are observed up to 20 years maximum in lifetime. Figure 1. It is clear that the MEV decomposition modeling performs like a smoother on the dual-time domain.g. calendar and vintage origination time. For vintages originated earlier than 1988. However. . the default rates are observed up to the end of year 2008.

5: MEV fitted values: Moody’s-rated Corporate Default Rates 79 . Figure 3. Shown in the left panel is the GCV selection of smoothing and structural parameters.4: Results of MEV Backfitting algorithm upon convergence (right panel) based on the squared exponential kernels.Figure 3.

Kf . say Zf (m).4. partly because of the small size of replicates. respectively. it is implied that the exogenous influence shares the largest variation. 6. The backfitting results upon convergence are shown in Figure 3.3 Retail Loan Loss Rates Recall the tweaked credit-risk sample discussed in Chapter II.1.e. Compared to the raw plots in Figure 1. for which the mean function is fixed to be the intercept effect. For each univariate Gaussian process fitting. Zh . Zg . the same tweaked sample 80 . the changes of estimated exogenous effects in Figure 3. otherwise the reciprocal conditional number of the ridged regression matrix would vanish). we used the grid search on the log scale of λ with only 3 different choices of φ = 1. It was used as a motivating example for the development of adaptive smoothing spline in fitting the pointwise averages that suppose to smooth out in lifetime (month-on-book. where GCV selects the largest scale parameter in each iteration. By (3. Kh are all chosen to be the squared exponential covariance kernels. while the maturation curve has the least variation. In this section. 2.12)).To run the MEV Backfitting algorithm. the MEV modeling could retain some of most interesting features. For demonstration. 3. the scale parameter φ in (3. followed by the vintage heterogeneity. the GCV criterion selects the smoothing parameter λf . The inverse transform η −1 (·) maps the fitted values to the original percentage scale.050 for Zf . and partly because of the contamination by the exogenous macroeconomic influence.5 plots the fitted values in both lifetime and calendar views. Furthermore. while smoothing out others.4 follow the NBER recession dates shown in Figure 1. as well as the structural parameter θf (i.5.025. Figure 3. Kg . An interesting result is the GCV-selected smoothing parameter λ.4. which turns out to be 9. However. 3 (φ = 3 is set to be the largest scale for stability.18). the raw responses behave rather abnormally in large month-on-book. in this case). 0.368.

5. together with the exponentiated exogenous and vintage heterogeneity multipliers. The rates behave rather dynamic and volatile in t. The rates show heterogeneity across vintages.is analyzed on the dual-time domain. from the vintage view. To model the loss rates. The rates behave relatively smooth in m. calendar and vintage origination time. subject to Eg = Eh = 0. 2. from the calendar view. They measure the loss rates in retail revolving exposures from January 2000 to December 2006. eg(t) and eh(v) . Figure 3. where the exponentiated maturation curve could be regarded as the baseline. 6. timescale binding was performed first for large monthon-book m ≥ 72 and small calendar time t ≤ 12. which is mostly typical in retail risk management. 4. Some of the interesting features for such vintage data are listed below: 1.1. The rates are fixed to be zero for small month-on-book m ≤ 4. 3. In this MEV modeling exercise.6 (bottom panel). The fitting results are presented in Figure 3. There are less and less observations when the month-on-book m increases. but also the exogenous influence and the vintage heterogeneity effects. for which we model not only the maturation curve. where the new vintages originated in 2006 are excluded since they have too short window of performance measurements. The tweaked sample of dual-time observations are supported on the triangular vintage diagram illustrated in Figure 3. and the GCV criterion was used to select the smoothing and structural parameters. the original responses are decomposed multiplicatively to be ef (m) . from the lifetime view. 81 . There are less and less observations when the calendar month t decreases.6 (top panel) shows the projection views in lifetime. the Mat´rn kernels were chosen to run the MEV e Backfitting algorithm. The MEV decomposition model takes the form of log(yjl ) = f (mjl ) + g(tjl ) + h(vj ) + ε. In this way. We pre-specified the log transform for the positive responses (upon removal of zero responses).

Each of these esˆ timated functions follows the general dynamic pattern of the corresponding marginal averages in the empirical plots above. exogenous influence and 82 . g (t). h(v) in the log response scale.6 Discussion For the vintage data that emerge in credit risk management. It is also worth pointing out that the exogenous spike could be captured relatively well by the Mat´rn kernel. calendar t and vintage origination time ˆ ˆ v. e 3.6: Vintage data analysis of retail loan loss rates: (top) projection views of emprical loss rates in lifetime m. ˆ ˆ ˆ which shows the estimated f (m). (bottom) MEV decomposition effects f (m). the maturation curve is rather smooth.Figure 3. g (t) and h(v) (at log scale). Compared to both the exogenous and vintage heterogeneity effects. we propose an MEV decomposition framework to estimate the maturation curve. The differences in local regions reflect the improvements in each marginal direction after removing the contamination of other marginals.

we have studied nonparametric smoothing based on Gaussian processes. it is straightforward to simulate the random paths for a sequence of inputs (x1 . . and demonstrated its flexibility through choosing covariance kernels. which also appears in the conventional cohort analysis under the age-period-cohort model. v) on the vintage diagram. It is however not directly performing crossvalidation on the dual-time domain. which would be an interesting subject of future study. To regularize the three-way marginal effects. there are other open problems associated with the vintage data analysis (VDA) worthy of future investigation. g(t). v. The model forecasting is often a practical need in financial risk management. Σ(c) ). . 2. the formula (3. as it is concerning about the future uncertainty. By Kriging theory. The new technique is then tested through a simulation study and applied to analyze both examples of corporate default rates and retail loss rates. 1. where the conditional mean and covariance 83 . h(v) in each marginal direction in order to know about their behaviors out of the training bounds of m. for which some preliminary results are presented. one needs to extrapolate f (m). The dual-time model assessment and validation remains an issue. Beyond our initial attempt made in this chapter. An efficient MEV Backfitting algorithm is provided for model estimation.vintage heterogeneity on the dual-time domain. Furthermore. . . for any x = (m. In the MEV Backfitting algorithm we have used the leave-one-out cross-validation for fitting each marginal component function. To make prediction within the MEV modeling framework. structural and smoothing parameters.22) provides the conditional expectation given the historical performance. t. One difficulty associated with the vintage data is the intrinsic identification problem due to their hyperplane nature. xp ) based on the multivariate conditional normal distribution Np (µ(c) . t.

one may use parametric time series models. The MEV decomposition we have considered so far is simplified to exclude the interaction effects involving two or all of age m. For example. time t and vintage v arguments. the out-of-sample test is recommended. Such parametric approach would benefit the loss forecasting as they become more and more validated through experiences. one may estimate the parametric f (m) with nonlinear regression technique. 4. there is a word of caution about making prediction of the future from the historical training data. In formulating the models by Gaussian processes.13). x )  i j p×p (3. Zh (v). 3. Before making forecasts in terms of calendar time.19).5. xjl ) i p×n with  Σ = K(x . like our synthetic case in Section 3. In the MEV modeling framework.are given by   (c)  µ = Bµ + Υ Σ + I −1 y − Bµ  Σ(c) = Σ − Υ Σ + I −1 ΥT     Υ = K(x . while g(t) can be interfered by future macroeconomic changes and h(v) is also subject to the future changes of vintage origination policy. Zg (t). As for the exogenous effect g(t) that is usually dynamic and volatile. Consider 84 . For extrapolation of MEV component functions. Fan and Yao (2003). parametric modeling approach is sometimes more tractable than using Gaussian processes.39) using the aggregated kernel given by (3.g.18) and the notations in (3. However. see e. Zh (v) is also postulated. Zg (t). extrapolating the maturation curve f (m) in lifetime is relatively safer than extrapolating the exogenous and vintage effects g(t) and h(v). since f (m) is of self-maturation nature. the mutual independence among Zf (m). if given the prior knowledge that the maturation curve f (m) follows the shape of the inverse Gaussian hazard rate (1. One of the ideas for taking into account the interaction effects is to cross-correlate Zf (m).

1989).2) plays a similar role as such a link function. except for that it is functioning on the rates that are assumed to be directly observed. Zf (m ) + Zg (t ) + Zh (v )] = EZf (m)Zf (m ) + EZf (m)Zg (t ) + EZf (m)Zh (v ) + EZg (t)Zf (m ) + EZg (t)Zg (t ) + EZg (t)Zh (v ) + EZh (v)Zf (m ) + EZh (v)Zg (t ) + EZh (v)Zh (v ). λg λh EZg (t)Zh (v) = where Kgh (t. if we are given the raw binary observations or unit counts. One more extension of VDA is to analyze the time-to-default data on the vintage 85 .5. In general. which reduces to (3.17) with covariance Cov[Zf (m) + Zg (t) + Zh (v). For example. while the loss rates in the synthetic retail data can be viewed as the ratio of loss units over the total units. it is straightforward to incorporate the crosscorrelation structure and follow immediately the MEV Kriging procedure. Thus. one may consider the generalized MEV models upon the specification of a link function.18) when the cross-correlations between Zf (m). the scope of VDA can be also generalized to study non-continuous types of observations. Results will be presented by a future report. the Moody’s default rates were actually calculated from some raw binary indicators (of default or not).the combined Gaussian process (3. ρ ≥ 0 in order to study the interaction effect between the vintage heterogeneity and the exogenous influence. Besides the open problems listed above. as an analogy to the generalized linear models (Mccullagh and Nelder. The transform η in (3. Zg (t) and Zh (v) are all zero. For examples we considered in Section 3. v) = ρI(t ≥ v). one of our works in process is to assume that σ2 Kgh (t. v).

2: By applying the general spline theorem to (3. since otherwise f0 (m) − m.22).t.24) and (3.l cjl I(tjl = tl ). 3. vj . which leads ζ(x) to have the same form as the Kriging predictor (3. 86 . g0 . respectively. βl = λ−1 g j. which is deferred to the next chapter where we will develop the dual-time survival analysis. Lj αi = λ−1 f j.l cjl I(mjl = mi ). γj = λ−1 h l=1 cjl (3. xjl ).26).23). Proof of Proposition 3. tl . K(·.40) that aggregate cjl for the replicated kernel bases for every mi . xjl ). xj l ). where cT ξ n = J j=1 Lj l=1 cjl K(x. at least one of the linear parts f0 . g0 (t) + t. µ and c and setting them to zero gives the solution −1 µ = c = BT (Σ + I)−1 B Σ+I −1 BT (Σ + I)−1 y (y − Bµ).2). xj l ) = K(xjl . h0 (v) − v would always satisfy (3. Taking the partial derivatives w.18) and comparing (3. it is not hard to derive the relationships between two sets of kernel coefficients. h0 is not identifiable.19). then we get 2 min y − Bµ − Σc + c Σ using the notations in (3.7 Technical Proofs Proof of Lemma 3.r.3: By (3.1: By that m − t + v = 0 on the vintage diagram Ω.diagram. Substitute it back into (3.23) and use the H1 reproducing property K(·. Proof of Proposition 3. the target function has a finite-dimensional representation ζ(x) = µ0 + µ1 m + µ2 t + cT ξ n .

αI )T . . . we have that Σc = Σf α + Σg β + Σh γ cT Σc = λf αT Σf α + λg β T Σg β + λh γ T Σh γ. using the new vectors of coefficients α = (α1 . .25) leads to (3. . β = (β1 . βL )T and γ = (γ1 . γJ )T . 87 .Then. . .27). as asserted. . . . . Plugging them into (3. . .

. i = 1. Uji . they share the same age at any follow-up calendar time. ∆ji ). . dual-time observations of survival data subject to truncation and censoring can be graphically represented by the Lexis diagram (Keiding. J (4. Age could be the lifetime of a person in a common sense. age refers to the duration of a credit account since its origination (either a corporate bond or a retail loan).1 Introduction Consider in this chapter the default events naturally observed on the dual-time domain. lifetime (or. . j = 1. let us simulate a Lexis diagram of dual-time-to-default observations: (Vj . . where the notion of hazard rate is also referred to be the default intensity in credit risk context.4. Mji . For many accounts having the same origin. .1) 88 . and group as a single vintage. and it could also refer to the elapsed time since initiation of a treatment. In credit risk modeling. As introduced in Chapter 1. . such that the events happening at the same calendar time may differ in age. To begin with. age) in one dimension and calendar time in the other. .CHAPTER IV Dual-time Survival Analysis 4. Refer to Chapter I about basics of survival analysis in lifetime scale. 1990). Of our interest is the time-to-default from multiple origins. . nj .

a memoryless exponential lifetime distribution). . mmax .. Cji . 47. besides the systematic censoring with mmax = 60 and tmax = 48. and ∆ji indicates the status of termination such that ∆ji = 1 if the default event occurs prior to getting censored and 0 otherwise. . Cji } : Mji = min{τji . Vj + s)ds ≥ − log X.2) for each given Vj = −60.2) nj times independently. Uji . 48] and vintage origination Vj = −60. Mji . an independent random censoring with the constant hazard rate λatt is used to represent the account attrition (i. −1.where Vj denotes the origination time of the jth vintage that consists of nj accounts. 1. . . t) or λj (m. 1] (attrition) ¯ ¯ Uji = max{Uji . For the censoring mechanism. t) the underlying dual-time hazard rate of vintage v = Vj at lifetime m and calendar time t. . one may book nj number of origination accounts by repeatedly running (4. 0. the observations are left truncated at time 0. Uji denotes the time of left truncation. 0. X ∼ Unif 0. For each vintage j. mmax . tmax − Vj } (default indicator) (4. with Uji ≡ 0 (left truncation) Book (Vj . one may simulate the dual-time-todefault data (4. For each account. 60].e. . So. . . 47. assume the multiplicative form λv (m. same as the synthetic vintage data in Chapter 3. tmax − Vj } (termination time) ∆ji = I τji ≤ min{Cji . ∆ji ) if Uji < min{τji . Vj } − Vj . 1 (default) Cji = inf m : λatt m ≥ − log X. . X ∼ Unif 0. −1.3) 89 . Let us fix the rectangular vintage diagram with age m ∈ [0. . Denote by λv (m. calendar time t ∈ [0. . . .1) according to: m τji = inf m : 0 λj (s. Mji denotes the lifetime of event termination. Among other types of dual-time default intensities. t) = λf (m)λg (t)λh (v) = exp f (m) + g(t) + h(v) (4. Let us begin with nj ≡ 1000 and specify the attrition rate λatt = 50bp (a basis point is 1%). . .5. 1. . Then.

namely 1.14) in lifetime and (4.1 (left) is an illustration of the Lexis diagram for one random simulation per vintage. 2.Figure 4. Figure 4. where the maturation curve f (m). For simulating the left-truncated survival accounts. Each marginal view of the hazard rates shows clearly the pattern matching our underlying assumption. the exogenous influence g(t) and the unobserved heterogeneity h(v) are interpreted in the sense of Chapter III and they are assumed to follow Figure 3. The exogenous performance in calendar time is relatively dynamic and volatile. respectively.2 (top panel). where each asterisk at the end of a line segment denotes a default event and the circle denotes the censorship. (right) empirical hazard rates in either lifetime or calendar time. To see default behavior in dual-time coordinates. The endogenous (or maturation) performance in lifetime is relatively smooth.1 (right).1: Dual-time-to-default: (left) Lexis diagram of sub-sampled simulations. we calculated the empirical hazard rates by (4.15) in calendar time. we assume the zero exogenous influence on their historical hazards. 90 . and the results are plotted in Figure 4.

. with very few exceptions that take either calendar time as the primary scale (e. There is also discussion on selection of the appropriate timescale in the context of multiple timescale reduction. see e. Arjas (1986)) or both time scales (see the survey by Keiding (1990)). . Gu and Ying (1997) by empirical process theory. such one-way baseline model specification is unable to capture the baseline hazard in calendar time. Despite that the Lexis diagram has appeared since Lexis (1875). Then. Mostly used is the CoxPH regression model with an arbitrary baseline hazard function in lifetime. statistical literature of survival analysis has dominantly focused on the lifetime only rather than the dual-time scales. (1993) with nine to one chapter coverage. it comes to Efron (2002) with symmetric treatment of dual timescales under a two-way proportional hazards model λji (m. j = 1. given the multiple dimensions that may include usage scales like mileage of a vehicle.g. In another word. Such timescale dimension reduction approach can be also referred to Oakes (1995) and Duchesne and Lawless (2000).k. . Farewell and Cox (1979) about a rotation technique based on a naive Cox proportional hazards (CoxPH) model involving only the linearized time-covariate effect. It is our purpose to develop the dual-time survival analysis in not only lifetime.g. survival data with staggered entry) has also concentrated on the use of lifetime as the basic timescale together with time-dependent covariates. . nj . see e. J. i = 1. t) = λf (m)λg (t) exp{θ T zji (m)}. . see also the related works of Slud (1984) and Gu and Lai (1991) on two-sample tests.4) 91 . . where the default risk is greatly affected by macroeconomic conditions. Andersen. . but also simultaneously in calendar time. Some weak convergence results can be referred to Sellke and Siegmund (1983) by martingale approximation and Billias. et al. lifetime has been taken for granted as the basic timescale of survival. . However. The literature of regression models based on dual-time-to-default data (a.g. (4.This is typically the case in credit risk modeling.a.

intensity-based semiparametric regression. hence name (4. which is defined by an endogenous distance-to-default process associated with the exogenous time-transformation. In Section 4. The simulation data described above will be used to assess the performance of nonparametric estimators. The aforementioned dualtime Cox model (4. we may allow for arbitrary (non-negative) λf (m) and λg (t). there seems to be no formal literature on dual-time-to-default risk modeling. Also discussed is the frailty type of vintage heterogeneity by a random-effect formulation.6 and provide some supplementary materials at the end. we demonstrate the applications of both nonparametric and semiparametric dualtime survival analysis to credit card and mortgage risk modeling. structural parameterization.3 we discuss the dual-time feasibility of structural models based on the first-passage-time triggering system. In credit risk.4 is devoted to the development of dual-time semiparametric Cox regression with both endogenous and exogenous baseline hazards and covariate effects. as well as the time-independent covariates.4) to be a two-way Cox regression model. for which the method of partial likelihood estimation plays a key role. based on our real data-analytic experiences in retail banking.Note that Efron (2002) considered only the special case of (4.1.2 we start with the generic form of likelihood function based on the dual-time-to-default data (4.4) with cubic polynomial expansion for log λf and log λg . In Section 4. It is our objective of this chapter to develop the dual-time survival analysis (DtSA) for credit risk modeling.5. and frailty specification accounting for vintage heterogeneity effects. We conclude the chapter in Section 4. This chapter is organized as follows.4) will be one of such developments. including the methods of nonparametric estimation. In general. Section 4. In Section 4. two-way and three-way nonparametric estimators on Lexis diagram. whereas the Efron’s approach with polynomial baselines will fail to capture the rather different behaviors of endogenous and exogenous hazards illustrated in Figure 4.1). then consider the one-way. 92 .

m+ = m.8) In what follows we discuss one-way. On the continuous domain.Mji ]   log[1 − λji (m)] .2 Nonparametric Methods Denote by λji (m) ≡ λji (m.6) On the discrete domain of m.1) with independent left-truncation and right-censoring. then the log-likelihood function has the generic form J nj = j=1 i=1 J nj ∆ji log λji (Mji ) + log Sji (Mji ) − log Sji (Uji ) Mji = j=1 i=1 ∆ji log λji (Mji ) − Uji λji (m)dm . t) with t ≡ Vj + m the hazard rates of lifetime-todefault τji on the Lexis diagram. consider the joint likelihood under either continuous or discrete setting: J nj j=1 i=1 pji (Mji ) + Sji (Uji ) ∆ji + Sji (Mji ) + Sji (Uji ) 1−∆ji (4. Given the dual-time-to-default data (4.5) can be rewritten as J nj s<m [1−λji (s)] and the likelihood function λji (Mji ) j=1 i=1 ∆ji [1 − λji (Mji )]1−∆ji m∈(Uji . (4.7) and the log-likelihood function is given by J nj = j=1 i=1    ∆ji log λji (Mji ) 1 − λji (Mji ) + m∈(Uji .Mji ) [1 − λji (m)] (4. two-way and three-way nonparametric approach to the maximum likelihood estimation (MLE) of the underlying hazard rates.  (4.4.5) where pji (m) denotes the density and Sji (m) denotes the survival function of τji . Sji (m) = (4. 93 . Sji (m) = e−Λji (m) with the cumulative hazard Λji (m) = m 0 λji (s)ds.

1 − λj (mi ) mi ≤m− (4. Given the data (4. nriskj (m) λj (m) = (4. Then.4. It is easy to show that the MLE of λj (m) is given by the following empirical hazard rates neventj (m) if neventj (m) ≥ 1 and NaN otherwise.2.1 Empirical Hazards Prior to discussion of one-way nonparametric estimation. Clearly.13) Sj (m) = for any m ≥ 0 under either continuous or discrete setting. let λj (m) be the underlying hazard rate in lifetime.10) on the discrete domain of m.    number of defaults: neventj (m) =    number of at-risk’s: nriskj (m) =  nj I(Mji = m. 94 . For each fixed Vj .8) can be rewritten as j = m neventj (m) log λj (m) + nriskj (m) − neventj (m) log[1 − λj (m)] (4. where the empirical hazard rates λ(mi ) ≡ ∆Λ(mi ) are also called the Nelson-Aalen increments. ∆ji = 1) i=1 nj (4. the jth-vintage contribution to the log-likelihood (4. let us review the classical approach to the empirical hazards vintage by vintage.1).9) I(Uji < m ≤ Mji ) i=1 for a finite number of lifetime points m with neventj (m) ≥ 1. the KaplanMeier estimator and the Nelson-Aalen estimator are related empirically by Sj (m) = mi ≤m− 1 − ∆Λ(mi ) .11) They can be used to express the well-known nonparametric estimators for cumulative hazard and survival function (Nelson-Aalen estimator) (Kaplan-Meier estimator) Λj (m) = mi ≤m λj (mi ).12) (4.

However. we are ready to present the one-way nonparametric estimation based on the complete dual-time-to-default data (4. they may be biased in other cases we shall discuss next. then we may immediately extend (4. in the calendar time (horizontal of Lexis diagram) assume λji (m. Both oneway nonparametric estimates are plotted in the right panel of Figure 4. Alternatively.11) by simple aggregation to obtain the one-way estimator in lifetime: J j=1 neventj (m) . one may obtain the empirical cumulative hazard and survival function by the Nelson-Aalen estimator (4. (4.14) whenever the numerator is positive.12) and the Kaplan-Meier estimator (4.16) For identifiability.Now.13) in each marginal dimension. for all (j. i).1). i). t) = λ(m) for all (j. which may capture roughly the endogenous and exogenous hazards. t) = λf (m)λg (t) = exp{f (m) + g(t)}. then we may obtain the one-way nonparametric estimator in calendar time: J j=1 neventj (t − Vj ) .15) whenever the numerator is positive.1. assume λji (m.2 DtBreslow Estimator The two-way nonparametric approach is to estimate the lifetime hazards λf (m) and the calendar hazards λg (t) simultaneously under the multiplicative model: λji (m. 4. J j=1 nriskj (m) λ(m) = (4.1 (upon linear interpolation). i).2. For example. assume that E log λg = Eg = 0. J j=1 nriskj (t − Vj ) λ(t) = (4. In the lifetime dimension (vertical of Lexis diagram). 95 . consider the simulation data introduced in Section 4. t) = λ(t) for all (j. Accordingly.

m = m[1] . we give an iterative DtBreslow estimator based on Breslow (1972). . t[Lt ] . . g (t) = ˆ l=1 βl I(t = t[l] ) (4. (4.18) to (4.18) and similarly given any fixed λf = λf .16) as the product of a baseline hazard function and a relative-risk multiplier.6) or (4. To take into the identifiability constraint E logg (t) = 0. . t = t[1] . One may find the MLE of the coefficients α.17) by maximizing (4. . J j=1 J j=1 λg (t) ← neventj (t − Vj ) λf (t − Vj ) · nriskj (t − Vj ) . . we obtain the DtBreslow estimator.17) based on Lm distinct lifetimes m[1] < · · · < m[Lm ] and Lt distinct calendar times t[1] < · · · < t[Lt ] such that there is at least one default event observed at the corresponding time. we may update λg (t) by λg (t) ← exp log λg (t) − mean log λg . we may restrict to the pointwise parametrization Lm Lt ˆ f (m) = l=1 αl I(m = m[l] ).Following the one-way empirical hazards (4. m[Lm ] (4. . Rather than the crude optimization. .20) By iterating (4.20) until convergence.19) Both marginal Breslow estimators are known to be the nonparametric MLE of the baseline hazards conditional on the relative-risk terms. . Given the dual-time-to-default data (4.8). by treating (4. the Breslow estimator of λf (m) is given by J j=1 J j=1 λf (m) ← neventj (m) λg (Vj + m) · nriskj (m) .11) above. β in (4. (4. It could be viewed as a natural extension of the one-way nonparametric estimators 96 .1) and any fixed λg = λg .

see Lemma 3. We carry out a simulation study to assess the performance of DtBreslow estimator. It is demonstrated that the one-way nonparametric methods suffer the potential bias in both cases.15). if we use the non-rectangular Lexis diagram (e.14) and (4. In (4. However. in calibrating λf (m)) depends on not only the exogenous scaling factor λg (t). For this rectangular diagram of survival data.14) and (4. g. Mji . For each vintage with accounts (Vj . t) = λf (m)λg (t)λh (Vj ) = exp f (m) + g(t) + h(Vj ) . Among other 97 . nj .15).21) For identifiability. then the algorithm could converge in a few iterations.2).16) with the vintage heterogeneity. ∆ji ) for i = 1.18). .g. but also the vintage-specific weights nriskj (m).1.21). it is clear that the lifting power of DtBreslow estimator (e. h are not identifiable. Due to the hyperplane nature of the vintage diagram such that t ≡ Vj + m in (4. Also plotted are the one-way nonparametric estimation and the underlying truth. we consider a three-way nonparametric procedure that is consistent with the MEV (maturation-exogenous-vintage) decomposition framework we have developed for vintage data analysis in Chatper III. see Figure 4. assume the underlying hazards to be λji (m. .2. the one-way and DtBreslow estimations have comparable performance. Using the simulation data by (4. By comparing (4.2 (top). (4.16) to (4. the DtBreslow estimator would demonstrate the significant improvement over the one-way estimation. Such MEV hazards model can be viewed as the extension of dual-time nonparametric model (4.3 MEV Decomposition Lastly. .(4. the two-way estimations of λf (m) and λg (t) are shown in Figure 4. .2 (middle and bottom). either pre-2005 vintages or post-2005 vintages). let Eg = Eh = 0. Uji .g. the linear trends of f. it is usually appropriate to set the initial exogenous multipliers λg ≡ 1. 4.

98 .2: DtBreslow estimator vs one-way nonparametric estimator using the data of (top) both pre-2005 and post-2005 vintages. (middle) pre-2005 vintages only.Figure 4. (bottom) post-2005 vintages only.

Note that the second strategy assumes the zero trend in vintage heterogeneity. we consider 1. which is an ad hoc approach when we have no prior knowledge about the trend of f. h. with g = log λg for t = t[1] . Later in Section 4.24)   λ (·) ← exp h − mean h h 99 . The nonparametric MLE for the MEV hazard model (4. .approaches to break up the linear dependency. Removing the marginal trend in the vintage dimension h(·). Binning in vintage. . .4 we discuss also a frailty type of vintage effect. c) Vintage effect for fixed λf (m) and λg (t)   λh (Vj ) ←  I(∆ji = 1) λf (m[l] )λg (Vj + m[l] ) · nriskj (m[l] ) . . . with h = log λh nj i=1 Lm l=1 (4. Specifically.23) λh (Vj )λf (t − Vj ) · nriskj (t − Vj )    λg (·) ← exp g − mean g . g. lifetime or calendar time for especially the marginal regions with sparse observations. we may estimate the three conditional marginal effects iteratively by: a) Maturation effect for fixed λg (t) and λh (v) J j=1 J j=1 λf (m) ← neventj (m) λh (Vj )λg (Vj + m) · nriskj (m) (4. m[Lm ] .22) for m = m[1] . t[Lt ] . .21) follows the DtBreslow estimator discussed above. . b) Exogenous effect for fixed λf (m) and λh (v)    λ (t) ←  g J j=1 J j=1 neventj (t − Vj ) (4. . 2.

For this reason. Then. then iterating the above three steps until convergence would give us the nonparametric estimation of the three-way marginal effects. Remarks: before ending this section. Besides. the dual-time-to-default data have essential difference from the usual bivariate lifetime data that are associated with either two separate failure modes per subject or a single mode for a pair of subjects. Furthermore. . .24). we obtain the estimation of maturation hazard λf (m).3). They are consistent with the underlying true hazard functions we have pre-specified in (4. one may also perform the trend-removal λh (·) ← exp{h − mean⊕trend h } if zero-trend can be assumed for the vintage heterogeneity. it is questionable whether the nonparametric methods developed for bivariate survival are applicable to Lexis diagram. The fitting results are also presented in Figure 4. which are plotted in Figure 4. we make several remarks about the nonparametric methods for dual-time survival analysis: 1. we have also tested the MEV nonparametric estimation for the dualtime data with either pre-205 vintages only (trapezoidal diagram) or the post-2005 vintages only (triangular diagram). exogenous hazard λg (t) and the vintage heterogeneity λh (v). see Keiding (1990). J (upon necessary binning). given the prior knowledge that pre-2005 vintages have relatively flat heterogeneity. .for j = 1. Consider again the simulation data with both pre-2005 and post2005 vintages observed on the rectangular Lexis diagram Figure 4.1. the vintages with Vj ≤ −36 are merged and the vintages with Vj ≥ 40 are cut off.3.3 (top). Setting the initial values λg ≡ 1 and λh ≡ 1. In (4. Since there are limited sample sizes for very old and very new vintages. . Our developments 100 . binning is performed in every half year. running the above MEV iterative procedure. which implies that our MEV decomposition is quite robust to the irregularity of the underlying Lexis diagram. First of all.

3: MEV modeling of empirical hazard rates. 101 . (bottom) post-2005 vintages only. based on the dual-time-todefault data of (top) both pre-2005 and post-2005 vintages. (middle) pre-2005 vintages only.Figure 4.

see (3.17) the endogenous and exogenous hazard rates to be pointwise.4. the large-sample properties of DtBreslow estimators are worthy of future study when the dimensions of both α and β increase to infinity at the essentially same speed. t on the continuous domains. The nonparametric models (4. One may post-process them by kernel smoothing 1 λf (m) = bf Lm Kf l=1 m − m[l] 1 λf (m[l] ). see (4. Smoothing techniques can be applied to one-way.4) in the previous chapter. we have assumed in (4.of DtBreslow and MEV nonparametric estimators for default intensities correspond to the age-period or age-period-cohort model in cohort or vintage data analysis of loss rates.e.2). or equivalently assumed the cumulative hazards to be step functions. two-way or three-way nonparametric estimates. λg (t) = bf bg Lt Kg l=1 t − t[l] λg (t[l] ) bg for m.12). statistical properties of DtBreslow estimates λf (m) and λg (t) can be studied by the standard likelihood theory for the finite-dimensional parameters α or β. the DtBreslow estimator with output of two-way empirical hazard rates λf (m) and λg (t). In the finite-sample (or discrete) case. Consider e. 3. one needs to be careful about the intrinsic identification problem. for which we have discussed the practical solutions in both of these chapters. For either default intensities or loss rates on Lexis diagram. However on the continuous domain. Nelson-Aalen increments on discrete time points.21) underlying the DtBreslow and MEV estimators are special cases of the dual-time Cox models in Section 4. where Kf (·) and Kg (·) together with bandwidths bf and bg are the kernel functions that are bounded and vanish outside 102 . where the smoothness assumptions follow our discussion in the previous chapter.16) and (4.g. 2. see (3. In this case. i.

then perform Gaussian process modeling discussed in Section 3.3 with either pre-2005 only or post-2005 only vintages. 4. In this case. For large-sample observations.1). one may first calculate the empirical hazard rates λj (m) per vintage by (4. Aalen. we may write λj (m) = λj (m)(1 + εj (m)) with Eεj (m) = 0 and Eε2 (m) ≈ j 2 σj (m) λ2 (m) j = 1 1 − → 0. 1].g. the variance of λj (m) can be estimated by a tie-corrected formula 2 σj (m) = nriskj (m) − neventj (m) · neventj (m) nrisk3 (m) j and λj (m) are uncorrelated in m.3. see e. §7). et al. An important message delivered by Figure 4. (2008.11). 103 . neventj (m) nriskj (m) as nj → ∞ on the discrete time domain. p. by Taylor expansion.2 and Figure 4. log λj (m) ≈ log λj (m) + εj (m). The challenging issues in these smoothing methods are known to be the selection of bandwidths or smoothing parameters. Both the DtBreslow and MEV nonparametric methods may correct the bias of lifetime-only estimation.[−1. see Ramlau-Hansen (1983). Alternatively. one may use the penalized likelihood formulation via smoothing spline. as demonstrated in Figure 4. An alternative way of dual-time hazard smoothing can be based on the vintage data analysis (VDA) developed in the previous chapter. see Gu (2002. More details can be referred to Wang (2005) and references therein. use them as the pseudo-responses of vintage data structure as in (3. as well as the correction of boundary effects.85).2 (middle and bottom) is that the nonparametric estimation of λf (m) in conventional lifetime-only survival analysis might be misleading when there exists exogenous hazard λg (t) in the calendar dimension. Then. 5. corresponding to the MEV model (3.15) with non-stationary noise.

10)–(1.6) for the latent distance-to-default process w.r. Without getting complicated. 4. . Recall the hazard rate (or. let us assume that Wj and Wj are uncorrelated whenever j = j . Let Xj (m) be the endogenous distant-to-defalut (DD) process in lifetime and assume it follows a drifted Wiener process Xj (m) = cj + bj m + Wj (m). bj ∈ R denotes the parameter of trend of credit deterioration and Wj (t) is the Wiener process. and (1.25) where Xj (0) ≡ cj ∈ R+ denotes the parameter of initial distance-to-default.6). In this section. we consider the extension of structural approach to dual-time-to-default analysis.3 Structural Models We have introduced in Chapter 1.1 First-passage-time Parameterization Consider the Lexis diagram of dual-time default events for vintages j = 1.13) for the concrete expression of hazard rate based on the inverse Gaussian lifetime distribution. cj . m≥0 (4. For a quick review. J. . c j + bj m −cj + bj m −2bj cj √ √ −e Φ m m λj (m. . one may refer to (1.1 the structural approach to credit risk modeling.9) for the notion of first-passage-time.4.3. By (4. bj ) = Φ (4.2. which is popular in the sense that the default events can be triggered by a latent stochastic process when it hits the default boundary. the MLE of (cj . default intensity) of the first-passage-time of Xj (m) crossing zero: (cj + bj m)2 c √ j exp − 2m 2πm3 . zero boundary. (1. see (1.13). bj ) can be obtained by maximizing the 104 . .26) whose numerator and denominator correspond to the density and survival functions.t.

In our case. §10): 1. all the hazard rate curves will converge to the constant 0. et al. let us focus on the bounded interval m ∈ [ .02. bj ) + log Sj (Mji . a default event tends to happen sooner.log-likelihood of the j-th vintage observations nj j = i=1 ∆ji log λj (Mji . Then. both the initial distance-to-default cj and the trend of credit deterioration bj have interesting structural interpretations. the quasi-stationary distribution is φ(x) = b2 xe−bx for x ≥ 0 and the limiting hazard rate is given by m→∞ lim λ(m) = 1 ∂φ(x) 2 ∂x = x=0 b2 . with maximum at m∗ such that the derivative λ (m∗ ) = 0. See Figure 1.26) for m ∈ (0. when c gradually approaches zero. let φm (x) be the conditional density of X(m) given that min X(s) > 0 and it is said to 0<s≤m be quasi-stationary if φm (x) is m-invariant. cj . 2. We supplement the gradients of the log-likelihood function at the end of the chapter. et al. bj ) − log Sj (Uji . and it determines the level of stabilized hazard rate as m → ∞. In theory. With reference to Aalen and Gjessing (2001) or Aalen. In Aalen. In Figure 1.2 for the illustration with fixed b and varying c values.2 with b = −0. (4.27) where the maximization can be carried out by nonlinear optimization. bj ). (2008. 2 which does not depend on the c-parameter. (2008. As their literal names indicate. the hazard rate would be shaped from (a) monotonic increasing to (b) first-increasingthen-decreasing and finally to (c) monotonic decreasing.0002 after a long run. hence moving the concentration of the defaults towards left. The b-parameter represents the trend of DD process. When c becomes smaller. ∞) is always first-increasing-then-decreasing. §10). mmax ] for some > 0. cj . 105 . cj . The c-parameter represents the initial value of DD process. In practice. the hazard function (4.

g. To achieve that.28) where Xj (m) is the endogenous DD process (4. §6.29) w.The above first-passage-time parameterization is only on the endogenous or maturation hazard in lifetime. in the sense of letting the ‘average’ DD process be captured endogenously. 106 . see e.25) and Ψj (m) is an increasing timetransformation function such that all vintages share the same integrand ψ(t) > 0 in calendar time. and Dj (m) = bj 0 ψ(Vj + s) − ψ(Vj + s) ds. Next.30) Let us introduce a new process Xj (m) such that dXj (m) = introduce a time-varying default boundary m ψ(Vj + m)dXj (m). one may rewrite it as ∗ dXj (m) = ψ(Vj + m)dXj (m) − bj ψ(Vj + m) − ψ(Vj + m) dm. Alternatively.t. Lawless (2003. the zero default boundary.4. we model the dual-time DD process for each vintage Vj by a subordinated process m ∗ Xj (m) = Xj (Ψj (m)). with Ψj (m) = 0 ψ(Vj + s)ds (4.16) with correspondence σi = 1 and ψ(·) = e−µi . The exogenous function ψ(t) rescales the original lifetime increment ∆m to be ψ(Vj + m)∆m. (4. Such timetransformation approach can be viewed as a natural generalization of the log-locationscale transformation (1. It has been used to extend the accelerated failure time (AFT) models with time-varying covariates. We set the constraint E log ψ = 0 on the exogenous effect. we extend the first-passage-time structure taking into account the exogenous effect in calendar time. and therefore transforms the diffusion dXj (m) = bj dm + dWj (m) to ∗ dXj (m) = bj ψ(Vj + m)dm + ψ(Vj + m)dWj (m) (4.r.3).

1. Then.6).26). In our case the exogenous time-scaling effects are often dynamic and volatile. for which we recommend a piecewise estimation procedure based on the discretized Lexis diagram. m] = P Xj (s) > 0 : s ∈ [0. cj . Nonparametric techniques can be used if extra properties like the smoothness can be imposed onto the function ψ(t). it is straightforward to obtain the survival function of the first-passage-time τj∗ by ∗ ∗ Sj (m) = P Xj (s) > 0 : s ∈ [0. The complication lies in the arbij trary formation of ψ(t) involved in the time-transformation function. By time-transformation (4. with j given ∗ by (4. the exogenous function ψ(t) not only maps the benchmark hazard rate to the rescaled time Ψj (m).28).32). the joint log-likelihood is given by = J j=1 j . Discretize the Lexis diagram with ∆m = ∆t such that t = tmin + l∆t for l = 107 . Now consider the parameter estimation of both vintage-specific structural parameters (cj . by (4.31) where the benchmark survival Sj (·) is given by the denominator of (4. bj ) and the exogenous time-scaling function ψ(t) from the dual-time-to-default data (4. but also scales it by ψ(t). By (4. cj . Ψj (m)] m = Sj (Ψj (m)) = Sj 0 ψ(Vj + s)ds (4.30).21). the hazard rate of τj∗ is obtained by λ∗ (m) = j ∗ − log Sj (m) − log Sj (Ψj (m)) dΨj (m) = = λj (Ψj (m))Ψj (m) dm dΨj (m) dm m = λj 0 ψ(Vj + s)ds ψ(Vj + m) (4. From (4.∗ Then.32) with λj (·) given in (4. it is clear that the time-to-default τj∗ = inf{m ≥ 0 : Xj (m) ≤ 0} = inf{m ≥ 0 : X(m) ≤ Dj (m)}.27) based on λ∗ (m. This is the same as the AFT regression model (1.1).26). bj ) and Sj (m. bj ).

27) with Mji replaced by Ψj (Mji ).0. In Step (3. ψL ) in a time-frequency perspective. . . Specify ψ(t) to be piecewise constant ψ(t) = ψl . 2. 1. Run nonlinear optimization for estimating (ci . . get the corresponding likelihood contribution: [l] = (j. Then. i) : Vj + Mji = tmin + l∆t}. . for Vj + m ≥ tmin which reduces to tmin − Vj + (Vj +m−tmin )/∆t l=1 ψl ∆t in the case of left truncation.. . For each of l = 1. . (b) For all fixed (cj . . where tmin is set to be the left boundary and tmin + L∆t corresponds to the right boundary. . for t − tmin ∈ (l − 1)∆t. . . ψL ) for the purpose of reducing the dimension of parameter space in constrained optimization. . ψL ) by maximizing subject to the constraint L l=1 L l=1 [l] log ψl = 0. 2. . . L. Such ideas will be investigated in our future work. . obtain the MLE (cj . bj ). . One may also apply the multi-resolution wavelet bases to represent {ψ1 . 108 . . 2. . obtain the MLE of (ψ1 .b). . . l∆t . . .i)∈G[l] ∆ji log λj (Ψj (Mji )) + log ψl + log Sj (Ψj (Mji )) − log Sj (Ψj (Uji )) where G[l] = {(j. tmin − Vj } if tmin is the systematic left truncation in calendar time. . and Ψj (Uji ) = max{0. . bi ) and ψl iteratively: (a) For fixed ψ and each fixed vintage j = 1. the time-transformation for the vintage with origination time Vj can be expressed as (Vj +m−tmin )/∆t Ψj (m) = l=(Vj −tmin )/∆t ψ(tmin + l∆t)∆t. one may perform further binning for (ψ1 . 3. . bj ) by maximizing (4. J. . L and ψ(t) ≡ 1 for t ≤ tmin . l = 1. . .

the default occurs at τji = inf{m > ∗ 0 : Xji (m) ≤ 0}.34) are not time-varying.31) and (4. and suppose the subject-specific DD process   ∗  X (m) = Xji (Ψji (m)). we ∗ may specify Ψji (m) = m exp{θT zji } and obtain the AFT regression model log τji = θT zji + log τji with inverse Gaussian distributed τji .32). Mji ] be the dynamic covariates observed since left truncation. ji  X (m) = c + b m + W (m)  ji ji j ji (4.35) with an independent Wiener process Wji (m). we obtain the hazard rate λ∗ (m) = ji λji (Ψji (m))Ψji (m). Note that when the covariates in (4. θ). the initial value cji and the vintage∗ specific trend of credit deterioration bj .3.33) exp{θT zji (s)}ds. or cji + bj Ψji (m) exp − 2Ψji (m) 2πΨ3 (m) ji cji Φ cji + bj Ψji (m) Ψji (m) −e −2bj cji 2 Ψji (m) (4.2 Incorporation of Covariate Effects It is straightforward to extend the above structural parameterization to incorpo˜ rate the subject-specific covariates. the time-transformation (4. let zji be the static covariates observed upon origination (including the intercept and otherwise centered predictors) and zji (m) for m ∈ (Uji .28). i). m ≥ Uji Uji Ψji (m) = Uji − Vj + (4. Let us specify ˜ cji = exp{η T zji } m (4.36) λ∗ (m) = ji Φ −cji + bj Ψji (m) Ψji (m) 109 . Then. By the same arguments as in (4.34) becomes a parametric version of (4. In the presence of calendar-time state variables x(t) such that zji (m) = x(Vj +m) for all (j.4.34) subject to the unknown parameters (η. For each account i with origination Vj .

e. for which we make an attempt to use the time-transformation technique (4.26) is based on the fixed effects of initial distanceto-default and trend of deterioration. The parameters of interest include (η. 2. there has been limited discussion on incorporation of dynamic covariates in the first-passage-time approach. (2008. the endogenous hazard rate is given by p(m)/S(m) with (c + µt)2 exp − p(m) = 2(t + σ 2 t2 ) 2π(m3 + σ 2 m4 ) c + µm −c + µm − 2σ 2 cm 2 2 √ S(m) = Φ √ − e−2µc+2σ c Φ m + σ 2 m2 m + σ 2 m2 c . For example. . One may refer to Aalen. In the first-passage-time parameterization with inverse Gaussian lifetime distri110 . . the log-likelihood function follows the generic form (4. There is no unique way to incorporate the static covariates in the first-passagetime parameterization.28) under nonparametric setting and (4. J. when the trend parameter b ∼ N (µ. However. §10) about the random-effect extension.34) under parameter setting. . .1). Standard nonlinear optimization programs can be employed. The endogenous hazard rate (4. see Lee and Whitmore (2006). while these structural parameters can be random effects due to the incomplete information available (Giesecke. 2006). For example. the “optim” algorithm of R (stats-library) and the “fminunc” algorithm of MATLAB (Optimization Toolbox).g. et al.∗ and the survival function Sji (m) = Sji (Ψji (m)) given by the denominator in (4. 3.6).36). while they may also be incorporated into other structural parameters. Remarks: 1. and they can be estimated by MLE. Given the dual-time-to-default data (4.33). we considered including them by the initial distance-to-default parameter (4. θ) and bj for j = 1. σ 2 ) and the initial value c is fixed.

38) is too relaxed to capture the interrelation of vintage-specific baselines. (4.4 Dual-time Cox Regression The semiparametric Cox models are popular for their effectiveness of estimating the covariate effects. Mji ]. Using the traditional CoxPH modeling approach. Suppose that we are given the dual-time-to-default data (4.26) or (4.6) with the plugged-in (4.37) or adopt the stratified Cox model with arbitrary vintage-specific baselines λji (m) = λj0 (m) exp{θ T zji (m)} (4. (4. one may either adopt an arbitrary baseline in lifetime λji (m) = λ0 (m) exp{θ T zji (m)}. one may rely on the crude nonlinear optimization algorithms. . which may include ˜ ˜ the static covariates zji (m) ≡ zji . . see the supplementary material. these two Cox models may not be suitable for (4.38) for i = 1.1) together with the covariates zji (m) observed for m ∈ (Uji . However.36) are rather messy. 4.bution. . L.1 whose vintage-specific hazards are interrelated not only in lifetime m but also in calendar time t.37) is too stringent to capture the variation of vitnage-specific baselines. . For example. the log-likelihood functions (4. nj and j = 1. 2. Our discussion in this section mainly illustrates the feasibility of dual-time structural approach. 111 . Without gradient provided.27) and (4. . (4.1) observed on the Lexis diagram in the following sense: 1. and the parameter estimation follows the standard MLE procedure. . we have simulated a Lexis diagram in Figure 4. . especially when we attempt to evaluate the derivatives of the log-likelihood function in order to run gradient search. .

21). we restrict ourselves to the dual-time Cox models with MEV baselines and relative-risk multipliers λji (m.37) and (4. t).1 Dual-time Cox Models The dual-time Cox models considered in this section take the following multiplicative hazards form in general.38).4. one may always find Lm distinct lifetime points m[1] < · · · < m[Lm ] and L distinct calendar time points t[1] < · · · < t[Lt ] such that there exist at least one default event observed at the corresponding marginal time. the nonparametric model (4. Some necessary binning or trendremoval need to be imposed on g.39) where λ0 (m. let us treat the endogenous baseline hazard λf (m) and the exogenous baseline hazard λg (t) as pointwise functions of the form (4. Following the least-information assumption of Breslow (1972). t) = λf (m)λg (t)λh (t − v).38) can be included as special cases. λji (m. t) is a bivariate base hazard function on Lexis diagram. For the discrete set of vintage originations.e. we assume the MEV decomposition framework of the base hazard λ0 (m. i. t) = λ0 (m. In order to balance between (4. h to ensure the model estimability.1) on Lexis diagram.17) subject to Lt l=1 βl = 0.1. the traditional models (4. Upon the specification of λ0 (m.4.37) and (4. the vintage heterogeneity effects can be regarded 112 . t) = λf (m)λg (t)λh (Vj ) exp{θ T zji (m)} = exp{f (m) + g(t) + h(Vj ) + θ T zji (m)} (4. Given the finite-sample observations (4.40) where λf (m) = exp{f (m)} represents the arbitrary maturation/endogenous baseline. see Lemma 3. t) exp{θ T zji (m)} (4. Thus. λg (t) = exp{g(t)} and λh (v) = exp{h(v)} represent the exogenous multiplier and the vintage heterogeneity subject to Eg = Eh = 0.

β and γ to satisfy the zero sum constraints for β and γ. In what follows. We may either assume the pointwise function of the form h(v) = J j=1 γj I(v = Vj ) subject to constraints on both mean and trend (i. both endogenous and exogenous baseline hazards λf (m) and λg (t) defined on the continuous dual-time scales (m. . 113 .e. essentially. to which the theory of partial likelihood applies.40) involves two nonparametric components that need to be profiled out for the purpose of estimating the parametric effects. (Note that it is rather straightforward to convert the resulting parameter estimates α. as the dual-time Cox models (4.40) is parametrized with dummy variables to be Lm Lt λji (m. t) tend to be infinite-dimensional. κ = 1.41) where the vintage buckets Gκ = {j : Vj ∈ (νκ−1 .42) + κ=2 γκ I(j ∈ Gκ ) + θ T zji (m) where β1 ≡ 0 and γ1 ≡ 0 are excluded from the model due to the identifiability constraints.45) to be studied.6). if j ∈ Gκ . β. one may then find the MLE of (α. t) = exp l=1 K αl I(m = m[l] ) + l=2 βl I(t = t[l] ) (4. γ. or assume the piecewise-constant functional form upon appropriate binning in vintage originations h(Vj ) = γκ . . the ad hoc approach). Plugging it into the joint likelihood (4. . the semiparametric model (4. θ). Then. The semiparametric estimation problem in this case is challenging.) As the sample size tends to infinity. K (4. νκ ]} are pre-specified based on Vj− ≡ ν0 < · · · < νK ≡ VJ (where K < J). hence making intractable the above pointwise parameterization.as the categorical covariates effects. we take an “intermediate approach” to semiparametric estimation via exogenous timescale discretization. . so is it for (4.

tl ] in (4. 1975).4. (4. tl ])}L . and zji (m) as a long vector κ=2 l=2 xji (m). . assume the exogenous baseline λg (t) is defined on the L cutting points (and zero otherwise) L g(t) = l=1 βl I(t = tmin + l∆t).2 Partial Likelihood Estimation Consider an intermediate reformulation of the dual-time Cox models (4.40): L λji (m) = λf (m) exp l=2 K βl I(Vj + m ∈ (tl−1 . the parameter ξ can be estimated by maximizing 114 . θ) ≡ ξ. and the vintage effect is bucketed by (4. g(t) for any t ∈ (tl−1 . write for each (j. Then. Then.44) subject to L l=1 βl = 0. the (K −1)-vector of {I(Vj ∈ (νκ−1 . (If the Lexis diagram is discrete itself. γ.45) where λf (m) is an arbitrary baseline hazard. the reformulated CoxPH model (4. consider the one-way CoxPH reformulation of (4.40) is shifted to g(tl ) in (4.42).41). L (4.43) for some time increment ∆t bounded away from zero. . We consider the partial likelihood estimation of the parameters (β. i.45) is equivalent to the fully parametrized version (4. The irregular time spacing is feasible. tl ]) γκ I(j ∈ Gκ ) + θ T zji (m) κ=2 + (4. too. by Cox (1972. . for l = 0.44). It is easy to see that if the underlying Lexis diagram is discrete.) On each sub-interval. where tmin corresponds to the left boundary and tmin + L∆t corresponds to the the right boundary of the calendar time under consideration. For convenience of discussion. νκ ])}K .4. m) the (L − 1)-vector of {I(Vj + m ∈ (tl−1 . ∆t is automatically defined. .40) by exogenous timescale discretization: tl = tmin + l∆t. 1.

the endogenous cumulative baseline hazard Λf (m) = m 0 λf (s)ds can be estimated by the Breslow estimator J nj Λf (m) = j=1 i=1 I(m ≥ Mji )∆ji (j .9). It is well-known in the CoxPH context that both estimators ξ and Λf (·) are consistent and asymptotically normal. Given the maximum partial likelihood estimator ξ.the partial likelihood J nj PL(ξ) = j=1 i=1 exp{ξ T xji (Mji )} T (j . Without the account-level covariates. i ) : Uj i < Mji ≤ Mj i } is the at-risk set at each observed Mji . the hazard rates λji (m) roll up to λj (m). N Λf (m) − Λf (m) converges to a zero-mean Gaussian process.2 can be studied by the theory of partial likelihood above.i )∈Rji exp ξ xj i (Mji ) ∆ji (4. h) = j=1 m exp{g(Vj + m) + h(Vj )} J j =1 neventj (m) nriskj (m) exp{g(Vj + m) + h(Vj )} (4.41).48) where g(t) is parameterized by (4.46) where Rji = {(j . √ Specifically. J PL(g.45) with θ ≡ 0. (1993). since the underlying nonparametric models are special cases of (4. h) and centering in the sense of 115 .47) which is a step function with increments at m[1] < · · · < m[Lm ] . For the nonparametric baseline. see also Andersen. see Andersen and Gill (1982) based on counting process martingale theory.44) and h(v) is by (4. under which the partial likelihood function (4. Let λg (t) = exp{g(t)} and λh (v) = exp{h(v)} upon maximization of PL(g. et al. The DtBreslow and MEV estimators we have discussed in Section 4.46) can be expressed through the counting notations introduced in (4. where √ I(ξ) = −∂ 2 log PL(ξ) ∂ξξ T . N (ξ − ξ) converges to a zero-mean multivariate normal distribution with covariance matrix that can be consistently estimated by {I(ξ)/N }−1 .i )∈Rji exp ξ xj i (Mji ) T (4.

49) are nearly identical to the plots shown in Figure 4. where the raw data (4. Standard softwares with survival package or toolbox can be used to perform the parameter estimation. They provide an alternative approach to model the vintage effects as the random block effects.22) of the MEV estimator. we may convert (4. 4. by (4.2. They also reduce to (4.3 Frailty-type Vintage Effects Recall the frailty models introduced in Section 1. see the supplementary material in the last section of the chapter.46) to find θ and using (4.47). .1) are converted to the counting process format upon the same binning preprocessing as in Section 4. .18) of DtBreslow estimator in the absence of vintage heterogeneity.45) can be carried out by maximizing the partial likelihood (4.3.3 about their double roles in charactering both the odd effects of unobserved heterogeneity and the dependence of correlated defaults.48) and λf by (4. Let Zκ for κ = 1. we have tested the partial likelihood estimation using the coxph of R:survival package and the simulation data in Figure 4.g. K be a random sample from 116 .1.1) to an enlarged counting process format with piecewise-constant approximation of time-varying covariates xji (m). In the case when the coxph procedure in these survival packages (e.49) It is clear that the increments of Λf (m) correspond to the maturation effect (4.Eg = Eh = 0.3. R:survival) requires the time-indepenent covariates. (4. we have the Breslow estimator of the cumulative version of endogenous baseline hazards m J j=1 J j=1 Λf (m) = 0 neventj (s)ds nriskj (s)λg (Vj + s)λh (Vj ) . λh by maximizing (4. Then. The implementation of dual-time Cox modeling based on (4. .47) to estimate the baseline hazard. The numerical results of λg . For example.4. .

L (d) (x) is its d-th derivative. The log-likelihood based on the dual-timeto-default data (4. . For a fixed δ. see the supplementary material at the end of the chapter. see Nielsen. nj and j ∈ Gκ for κ = 1. It is assumed that given the frailty Zκ . Alternatively. while the withinbucket dependence is captured by the same realization Zκ . 117 . We also write λji (m. (4. It has the Laplace transform L (x) = (1+ δx)−1/δ whose d-th derivative is given by L (d) (x) = (−δ)d (1 + δx)−1/δ−d d i=1 (1/δ + i − 1). . .45). Consider the dual-time Cox frailty models λji (m. Then. where Zκ represents the shared frailty level of the vintage bucket Gκ according to (4. we may write (4.50) for i = 1. The gamma distribution Gamma(δ −1 .some frailty distribution.41). t|Zκ ) = λf (m)λg (t)Zκ exp{θ T zji (m)}. dκ = j∈Gκ nj i=1 ∆ji and Aκ = j∈Gκ nj i=1 Mji Uji ˜T ˜ λf (s) exp ξ xji (s) ds. if j ∈ Gκ (4. the observations for all vintages in Gκ are independent. . et al. Similar to the intermediate approach taken by the partial likelihood (4. often used is the EM (expectation-maximization) algorithm for estimating (λf (·). t|Zκ ) as λji (m|Zκ ) with t ≡ Vj + m. δ) with mean 1 and variance δ is the most frequently used frailty due to its simplicity.50) l=2 ˜T ˜ as λji (m|Zκ ) = λf (m)Zκ exp ξ xji (m) . K.51) where L (x) = EZ [exp{−xZ} is the Laplace transform of Z. . (1992) or the supplementary material. the gamma frailty models can be estimated by the penalized likelihood approach. In this frailty model setting. .46) under ˜ ˜ exogenously discretized dual-time model (4. . . tl ])}L and zji (m).51) as a kind of regularization. ξ).1) is given by J nj T K = j=1 i=1 ˜ ˜ ∆ji log λf (Mji ) + ξ xji (m) + κ=1 log (−1)dκ L (dκ ) (Aκ ) . let us write ξ = (β. by treating the second term of (4. θ) and let xji (m) join the vectors of {I(Vj + m ∈ (tl−1 . the between-bucket heterogeneity is captured by realizations of the random variate Z.

we recommend a two-stage procedure with the first stage fitting the dual-time Cox models (4. Such z(t) might be entertained by the traditional lifetime Cox models (4.Therneau and Grambsch (2000. and the second stage modeling the correlation between z(t) and the estimated λg (t) based on the time series techniques. Remarks: 1. In practice.6) justified that for each fixed δ.52) where ξ = (β. such that the existing theory of partial likelihood and the asymptotic results can be applied. γ. The dual-time Cox model (4.46). The dual-time Cox regression considered in this section is tricked to the standard CoxPH problems upon exogenous timescale discretization. in order to investigate the exogenous covariate effects of z(t).40) includes only the endogenous covariates zji (m) that vary across individuals. §9. it is also interesting to see the effects of exogenous covariates z(t) that are invariant to all the individuals at the same calendar time. the EM algorithmic solution coincides with the maximizer of the following penalized log-likelihood J nj ∆ji j=1 i=1 1 log λf (Mji ) + ξ xji (m) − δ T K γκ − exp{γκ } κ=1 (4. We have not studied the dual-time asymptotic behavior when both lifetime and calendar time are considered to be absolutely continuous. since exp{θ T z(t)} would be absorbed by the exogenous baseline λg (t). they are not estimable if included as a log-linear term by (4. They also provide the programs (in R:survival package) with Newton-Raphson iterative solution as an inner loop and selection of δ in an outer loop. 2.40). 118 . Fortunately.40). θ) and xji (m) are the same as in (4.50) up to certain modifications. Therefore.37) upon the time shift. however. the time-discretization trick works perfectly. In practice with finite-sample observations. these programs can be used to fit the dual-time Cox frailty models (4.

These attributes may be either treated as the endogenous covariates. acquisition channels. In this section we demonstrate the application of dual-time survival analysis to credit card and mortgage portfolios in retail banking. see Figure 4.4. 4. Other variables like the credit line and utilization rate are not considered here. the Merton model) are not directly applicable to retail credit.g. It turns out that many existing models for corporate credit risk (e. there have been large gaps among academic researchers. Medium if 640 ≤ FICO < 740 and High if FICO ≥ 740. some tweaked samples are considered here.4. and use the mortgage loanlevel data to illustrate the dual-time Cox regression modeling. banks’ practitioners and governmental regulators. or used to define the multiple segments. we demonstrate the credit card risk modeling with both segmentation and covariates.5 Applications in Retail Credit Risk Modeling The retail credit risk modeling has become of increasing importance in the recent years.2). see among others the special issues edited by Berlin and Mester (Journal of Banking and Finance. 2005). Here.1 Credit Card Portfolios As we have introduced in Section 1. we consider two generic segments and a categorical covariate with three levels. we have assessed them under the simulation mechanism (4. Before applying these DtSA techniques. Based on our real dataanalytic experiences. 119 .2 and Figure 4. geographical regions and the like. the credit card portfolios range from product types. we employ the pooled credit card data for illustrating the dual-time nonparametric methods. Specifically.3. In quantitative understanding of the risk determinants of retail credit portfolios. where the categorical variable is defined by the credit score buckets upon loan origination: Low if FICO < 640. For simplicity. 2004) and Wallace (Real Estate Economics.5. only for the purpose of methodological illustration.

the 200501 vintage has 3 attrition accounts (i. . See Table 4. . . . 1 Medium . . (b) right censored by the end of 2008. . . we explore the empirical hazards of the two hypothetical segments.3 about the vintage diagram of dual-time data collection. .e.1. . . . one may aggregate the observations at the vintage level for each FICO bucket in each segment. . First of all. . . . . which resembles the rectangular Lexis diagram of Figure 4. . . . . . . W: counts. . 1 Low 200502 . Note that given the pooled data with weights W for D = 0. We applied the one-way. . . . 1. . . . . Let us consider the card vintages originated from year 2001 to 2008 with observations (a) left truncated at the beginning of 2005. . . . . . . . . Refer to Figure 1. . D: status. 1 Low 200501 .1 also illustrates that in the lifetime interval [3.1 for an illustration of the counting format of the input data. . . . In this simple case. .Table 4. Note that the last column of Table 4. namely the number of survived loans (D = 0) and the number of defaulted loans (D = 1). . where each single vintage at each lifetime interval [U. . .2 for each segment of dual-time-to-default data.1: Illustrative data format of pooled credit card loans SegID FICO V U M D W 1 Low 200501 0 3 0 988 1 Low 200501 0 3 1 2 1 Low 200501 3 4 0 993 1 Low 200501 3 4 1 5 1 Low 200501 4 5 0 979 1 Low 200501 4 5 1 11 . . . . . . . it is rather straightforward to modify 120 . regardless of the FICO variable. . . [U. . two-way and three-way nonparametric methods developed in Section 4. . . 2 . . . and (c) up censored by 48 months-on-book maximum. . M ] has two records of weight W . . 993 − 979 − 11) which are examples of being randomly censored. . . . . .M]: lifetime interval. . 4]. . . V: vintage. . . . .

Figure 4.4: Nonparametric analysis of credit card risk: (top) one-way empirical hazards. 121 . (bottom) MEV decomposition. (middle) DtBrewlow estimation.

but the exogenous cycle effects show dramatic difference. and we still take the nonparametric approach for each of 2 × 3 segments.16). The middle panel shows the results of DtBreslow estimator that simultaneously calibrates the endogenous seasoning λf (m) and the exogenous cycle effects λg (t) under the two-way hazard model (4. the credit cycle of Segment 2 grows relatively steady. the seasoning effects are similar. the vintage performance in both segments have deteriorated from pre-2005 to 2007. where we may compare the performances of two segments as follows. consider the three FICO buckets for each segment. The top panel shows the empirical hazard rates in both lifetime and calendar time.the iterative nonparametric estimators in (4.14) – (4. and made a turn when entering 2008. By such one-way estimates.5. 3. lower hazards) than Segment 2. Conditional on the endogenous baseline. The numeric results for both segments are shown in Figure 4. 2. and it has also better calendar-time performance than Segment 2 except for the latest couple of months. The MEV decomposition of empirical hazards are shown in Figure 4. where we consider only the yearly vintages for simplicity. while Segment 1 has a sharp exogenous trend since t = 30 (middle 2007) and exceeds Segment 1 since about t = 40. It is found that in lifetime the low. Compared to the one-way calibration above.4.4. Next. Specifically. since the more heterogeneity effects are explained by the vintage originations. the less variations are attributed to the endogenous and exogenous baselines. Note that Segment 2 has a dramatic improvement of vintage originations in 2008. medium 122 .24). Segment 1 has shown better lifetime performance (i.e. and one may also use the weighted partial likelihood estimation based on our discussion in Section 4. 1. The estimated of λf (m) and λg (t) are both affected after adding the vintage effects. The bottom panel takes into the vintage heterogeneity.

The exogenous cycles of low. Figure 4. it is interesting to note that for Segment 1 the high FICO vintages keep deteriorating from pre-2005 to 2008. Segment 2 (bottom panel).4. where for Segment 2 the hazard rates are evidently non-proportional.and high FICO buckets have decreasing levels of endogenous baseline hazard rates. 123 . This example mainly illustrates the effectiveness of dual-time nonparametric methods when there are limited or no covariates provided. The vintage heterogeneity in other FICO buckets are mostly consistent with the overall pattern in Figure 4. By the next example of mortgage loans. we shall demonstrate the dual-time Cox regression modeling with various covariate effects on competing risks. which is a risky signal.5: MEV decomposition of credit card risk for low. medium and high FICO buckets: Segment 1 (top panel). As for the vintage heterogeneity. medium and high FICO buckets co-move (more or less) within each segment.

a parametric setting).S. Considered here is a tweaked sample of mortgages loans originated from year 2001 to 2007 with Lexis diagram of observations (a) left truncated at the beginning of 2005. the prime mortgages generally target the borrowers with good credit histories and normal down payments. residential mortgage market is huge. et al.S. where we use the up-to-date data sources from S&P/Case-Shiller and U.7 million loans) with a distribution of $1. and (c) up censored by 60 months-on-book maximum.2 Mortgage Competing Risks The U. see U. (b) right censored by October 2008.a. the 2007-08 collapse of mortgage market has rather complicated causes.6 for the plots of 2000-08 home price indices and unemployment rate. and the remaining being government guaranteed. the first-lien mortgage debt is estimated to be about $10 trillion (for 54. We make an attempt via dual-time survival analysis to understand the risk determinants of mortgage default and prepayment.2 trillion prime and near-prime combined. As of March 2008. $8. In brief. Federal Researve report by Frame. moral hazard).37) with presumed fixed lifetime baselines (therefore. see Figure 4. where the regional macroeconomic conditions have become worse and worse since middle 2006.4. The rise in mortgage defaults is unprecedented. 124 . (2008) and Goodman. Sherlund (2008) considered subprime mortgages by proportional hazards models (4. while the subprime mortgages are made to borrowers with poor credit and high leverage.5. Bureau of Labor Statistics. Quantitatively. the fall of home prices. the increased leverage. One may refer to Gerardi. Mayer.2 trillion subprime mortgages. the originate-thensecuritize incentives (a. In retrospect.k. and the worsening unemployment rates. Pence and Sherlund (2009) described the rising defaults in nonprime mortgages in terms of underwriting standards and macroeconomic factors. These loans resemble the nonconforming mortgages in California. (2008) for some further details of the subprime issue. Lehnert and Prescott (2008). including the loose underwriting standards.S. et al.

prepayment 125 .6: Home prices and unemployment rate of California state: 2000-2008.7: MEV decomposition of mortgage hazards: default vs.Figure 4. Figure 4.

as well as the vintage heterogeneity effects. we may fit two marginal hazard rate models for default and prepayment. The fitting results of MEV hazard decomposition are shown in Figure 4.6. One practically simple approach is to reduce the multiple events to individual survival analysis by treating alternative events as censored. In contrast. 2005 2006. §8) about robust variance estimation for the marginal Cox models of competing risks.7 for both default and prepayment. The exogenous cycle of the default risk has been low (with exogenous multiplier less than 1) prior to t = 20 (August 2006). It is obvious that the rise in mortgage defaults follows closely 126 . §9) for the modeling issues on survival analysis of competing risks. Conditional on the estimated endogenous baseline hazards of default or prepayment. In contrast. the vintage heterogeneity of prepayment keeps decreasing from pre2005 to 2007. et al. (2008). then become rising sharply up to about 50 times more risker at about t = 43 (July 2008). the exogenous cycle of the prepayment risk has shown a modest town-turn trend during the same calendar time period. respectively. One may refer to Therneau and Grambsch (2000. We begin with nonparametric exploration of endogenous and exogenous hazards.For a mortgage loan with competing risks caused by default and prepayment. the monthly vintages are grouped yearly to be pre-2005. the observed event time corresponds to time to the first event or the censoring time. Compare the estimated exogenous effects on default hazards to the macroeconomic conditions in Figure 4. Similar to the credit card risk modeling above. These findings match the empirical evidences presented lately by Gerardi. One may refer to Lawless (2003. we find that 1. 2. The vintage heterogeneity plot implies that the 2005-06 originations have much higher probability of default than the mortgage originations in other years. Then. 2007 originations.

t) = λf (m)λ(q) (t) exp θ1 CLTVji + θ2 FICOji + θ3 NoteRateji (m) g + θ4 DocFullji + θ5 IOji + θ6. because (a) the house prices and unemployment in California are themselves highly correlated from 2005 to 2008.p Purpose. we try to decode the vintage effect of unobserved heterogeneity by adding the loan-level covariates. One may of course perform multivariate time series analysis in order to study the correlation between exogenous hazards and macroeconomic indicators. §1) or Wikipedia. The dual-time Cox regression models considered here take the form of (4. where CLTV stands for the combined loan-to-value leverage ratio. We also include the mortgage note rate. adjustable).r Purpose. we consider only the few mortgage covariates listed in Table 4. and (b) the exogenous effects on default hazards could be also caused by other macroeconomic indicators.4) with both endogenous and exogenous baselines. However. (4. Next.purchaseji + θ6. and they could be way more complicated in real situations.the fall of home prices (positively correlated) and the worsening unemployment rate (negatively correlated). FICO is the same measurement of credit score as in the credit card risk. standard deviation and range statistics presented for the continuous covariates CLTV. either fixed or floating (i.2. and other underwriting covariates upon loan origination can be referred to Goodman. it is not possible to quantify what percentage of exogenous effects should be attributed to either house prices or unemployment. Note that the categorical variable settings are simplified in Table 4. based on our demonstration data.e. et al. After incorporating the underwriting and dynamic covariates in Table 4.. For demonstration purpose. The mean. which is beyond the scope of the present thesis.2.53) (q) (q) (q) (q) 127 .2. (2008. to demonstrate the capability of Cox modeling with time-varying covariates. FICO and NoteRate are calculated from our tweaked sample. we have (q) (q) (q) (q) (q) λji (m.refiji .

p and Purpose.841 −0.285 FICO −0. high NoteRate.944 0.5. refi or cash-out refi Table 4. and cash-out refi Purpose.432 −0.4.Table 4. 128 .3: Maximum partial likelihood estimation of mortgage covariate effects in dual-time Cox regression models. From Table 4.r −0. while in practice one may need strive to find the appropriate functional forms or break them into multiple buckets.091 DocFull 1. σ = 35 and range [530. the continuous type of covariates FICO. where NoteRate could be dynamic and others are static upon origination.307 where the superscript (q) indicates the competing-risk hazards of default or prepayment.2: Loan-level covariates considered in mortgage credit risk modeling. CLTV µ = 78.385 NoteRate 1.1 and range [1.284 0.3. low FICO.2862 −1.500 0.656 −0.202 0.3. 840] NoteRate µ = 6.r against the standard level of cash-out refi).p −1. we find that conditional on the nonparametric dual-time baselines. For simplicity. σ = 13 and range [10. we obtain the numerical results tabulated in Table 4.185 Purpose. limited documentation. we have assumed the log-linear effects of all 3 continuous covariates on both default and prepayment hazards. The default risk is driven up by high CLTV. σ = 1.141 IO Purpose. and the 3-level Purpose covariate is broken into 2 dummy variables (Purpose. where the coefficient estimates are all statistically significant based on our demonstration data (some other non-significant covariates were actually screened out and not considered here). Covariate Default Prepayment CLTV 2. By the partial likelihood estimation discussed in Section 4. Interest-Only. 1. 11] Documentation full or limited IO Indicator Interest-Only or not Loan Purpose purchase. CLTV and NoteRate are all scaled to have zero mean and unit variance. 125] FICO µ = 706.

Some interesting findings in credit card and mortgage risk analysis are presented. 129 . For the structural approach based on inverse Gaussian first-passage-time distribution. . while its implementation is open to future investigation. which can be applied to model both corporate and retail credit. We have also remarked throughout the chapters the possible extensions and other open problems of interest. Developed in this thesis is mainly the dual-time analytics. which may shed some light on understanding the ongoing credit crisis from a new dual-time perspective. To end this thesis. high NoteRate. 3. It is actually reasonable to claim that a homeowner with good credit and low leverage would more likely prepay rather than choose to default. structural parameterization and semiparametric Cox regression. FICO and Purpose reveals the competing nature of default and prepayment risks. 4. the end is also a new beginning . So. . including assessment of the methodology through the simulation study. we have discussed the feasibility of its dual-time extension and the calibration procedure. Computational results are provided for other dual-time methods developed in this chapter.6 Summary This chapter is devoted completely to the survival analysis on Lexis diagram with coverage of nonparametric estimators. Interest-Only. we have demonstrated the application of dual-time survival analysis to the retail credit risk modeling. The covariates CLTV. high FICO. we conclude that statistical methods play a key role in credit risk modeling. limited documentation. including VDA (vintage data analysis) and DtSA (dual-time survival analysis). The prepayment risk is driven up by low CLTV. and purchase Purpose. Finally.2.

The log-likelihood function is given by n Mi = i=1 ∆i log λ(Mi . Press. λ (m. c. η)/∂η and ∂ 2 λ(m. η) ∆i Mi λ (s. η)/∂η. η)ds where λ (m. . Knowing the parametric likelihood given by (4. η) − Ui λ (s. By supplying the gradients. ∆i ). the optimization algorithms in most of standard softwares are faster and more reliable. . see e. i = 1.g. Mi . n each subject to left truncation and right censoring. Alternatively. b)T the structural parameters of inverse Gaussian hazard. η) and take their derivatives based on the explicit expressions below. η)ds Ui Mi λ (Mi .7 Supplementary Materials (A) Calibration of Inverse Gaussian Hazard Function Consider the calibration of the two-parameter inverse Gaussian hazard rate function λ(m. η)/∂η∂η T 130 . The complications lie in the evaluations of ∂λ(m.4.54) Taking the first and second order derivatives. η) (λ (Mi . η) = ∂ 2 λ(t. . et al. (2007). Denote by η = (c. η)ds (4. Provided below are some calculations of the derivatives for the gradient search purpose. b) of the form (4. the MLE requires nonlinear optimization that can be carried out by gradient search (e. η) − Ui λ(s.27). η)/∂η∂η T . η) λ2 (Mi . one may replace the second term of (4.54) by log S(Mi . Note that the evaluation of the integral terms above can be carried out by numerical integration methods like the composite trapezoidal or Simpon’s rule. η) − log S(Ui . and a⊗2 = aaT .g. we have the score and Hessian functions ∂ = ∂η ∂2 ∂η∂η T = i=1 n i=1 n ∆i − λ(Mi . η) = ∂λ(m.26) from the n observations (Ui . Newton-Raphason algorithm). η))⊗2 − λ(Mi . .

η) λ(m. 2π z1 = z2 = evaluate first the derivatives of the survival function (i. η) 1 ∂λ(m. η) ∂S(t. η) 2 ∂ λ(m. η) ∂S(m. Denoting η1 + η2 m √ . η) ∂η2 2 ∂ S(m. η)m 2 ∂η2 ∂η2 − − ∂S(m. η) ∂S(m. η) = −λ0 (m. η) ∂S(m. η) + η2 − − ∂η1 m η1 S(m. η) = − (η1 + η2 m) − λ(m. η) can be evaluated explicitly by η1 1 λ0 (m. η) ∂S(m.e. η) λ(m. m 1 2 φ(z) = √ e−z /2 . η) ∂η1 ∂η2 2 131 . η) = − (η1 + η2 m) − λ(m. η) ∂η1 ∂η2 S(m. η) 1 ∂λ(m.26)): ∂S(m. messy though. η) λ(m. η) ∂η1 S(m. S(m. which are. η) = − + η2 − + − λ(m. provided below. η) ∂η1 ∂η2 S (m. η) λ(m. η) ∂η1 2 λ(m. η) − + 2 2 S(m. η) − − + 2 . the partial derivatives of λ(m. η) 2 ∂η1 ∂ 2 S(m. η) ∂ 2 S(m. η) ∂ 2 S(m. η) ∂λ(m. η) 2 2 ∂η1 ∂η1 m η1 η1 m λ(m. η) ∂η1 ∂λ(m. η) ∂η1 ∂η2 ∂ 2 S(m. η) ∂S(m. η) ∂λ(m. η) − + 2 2 S(m. η) 2 ∂η2 1 = 2η2 e−2η1 η2 Φ(z2 ) + √ φ(z1 ) + e−2η1 η2 φ(z2 ) m √ = 2η1 e−2η1 η2 Φ(z2 ) + m φ(z1 ) − e−2η1 η2 φ(z2 ) 4η2 1 2 = −4η2 e−2η1 η2 Φ(z2 ) − √ e−2η1 η2 φ(z2 ) − φ(z1 )z1 − e−2η1 η2 φ(z2 )z2 m m = (2 − 4η1 η2 )e−2η1 η2 Φ(z2 ) − φ(z1 )z1 − e−2η1 η2 φ(z2 )z2 √ 2 = −4η1 e−2η1 η2 Φ(z2 ) + 4η1 me−2η1 η2 φ(z2 ) − m φ(z1 )z1 − e−2η1 η2 φ(z2 )z2 . the denominator of (4. η) ∂S(m. η) (η1 + η2 m) − ∂η2 S(m. η) ∂ 2 S(m. η) ∂η2 2 ∂ λ(m. η) ∂η1 ∂η1 S (m. Then. η) 1 ∂λ(m. η) ∂η2 S(m. η) ∂λ(m.26). η) η1 1 1 1 ∂ 2 λ(m. η) ∂S(m. η) = −λ(m. η) ∂η1 ∂S(m. η) ∂η2 ∂η2 S (m. m −η1 + η2 m √ . η) ∂η2 ∂λ(m. η) ∂η1 ∂η2 ∂η1 λ(m.based on the inverse Gaussian baseline hazard (4.

(4.56) in which Mi = min{τi . (4. i = 1. x(m)) = λ0 (m) exp{θ T x(m)} (4. see Cox (1972. the complete likelihood is given by n L= i=1 fi (Mi ) Si (Ui ) ∆i Si (Mi +) Si (Ui ) 1−∆i n = i=1 [λi (Mi )]∆i Si (Mi +)Si−1 (Ui ). (4. Given the observations (Ui . The trick is to construct little time segments such that xi (m) can be viewed constant within each segment. . Ci }. n that is subject to left truncation Ui and right censoring Ci .58) Taking λ0 as the nuisance parameter. xi (m)). The model estimation with time-varying covariates can be implemented by the standard software package that handles only the time-independent covariates by default.55) where λ0 (m) is an arbitrary baseline hazard function and θ is the p-vector of regression coefficients. Mi ]. . .4. See Figure 4. . 1975) or Section 4. Mi .(B) Implementation of Cox Modeling with Time-varying Covariates Consider the Cox regression model with time-varying covariates x(m) ∈ Rp : λ(m. n. λ0 ) = i=1 λ0 (Mi ) exp{θ T xi (Mi )} exp − Ui λ0 (s) exp{β T xi (s)}ds .8 for an illustration with the construction 132 . one may estimate β by the partial likelihood. ∆i = I(τi < Ci ). the likelihood function is n ∆i Mi L(θ.55). Provided below is the implementation of maximum likelihood estimation of θ and λ0 (m) by tricking it to be a standard CoxPH modeling with time-indepdent covariates. Let τi be the event time for each account i = 1.57) Under the dynamic Cox model (4. . . . ∆i . m ∈ [Ui . .

61) which becomes the likelihood function based on the enlarged data (4. For each account i.1 < · · · < Ui. n (4. . . l = 1.56) to the following independent observations each with the constant covariates Li (Ui.l Si (Ui. .l )Si−1 (Ui. Ui. ∆i. . (4.l .l−1 .Li ≡ Mi such that xi (m) = xi. . Ti ] to be Ui ≡ Ui.l )Si−1 (Ui. 133 .l−1 ). illustrated. . i = 1. (4.Figure 4. It is straightforward to show that the likelihood (4. .60) where ∆i. xi. m ∈ (Ui.l ). .59) n i=1 Thus.l = ∆i for l = Li and 0 otherwise. we have converted the n-sample survival data (4. Therefore. Li . Ui.l−1 .60). let us partition m ∈ [Ui .l )]∆i.l .57) can be reformulated by n Li n Li L= i=1 [λi (Mi )] ∆i l=1 Si (Ui.l−1 ) = i=1 l=1 [λi (Ui.8: Split time-varying covariates into little time segments. of monthly segments.0 < Ui.l ].l .

under the prameterized logistic model (4.l (4.63). .60). Given the survival observations (4.59).l ) exp{θ xi. . n. see Therneau and Grambsch (2000) for computational details. . .g.l are constructed by (4. one may perform the MLE for the parameters (η. the parameter estimation by maximizing (4.l [1 − λi (m)] m∈(Ui . (4. λ0 ) = i=1 l=1 λ0 (Ui.59). n (4. see e. 134 . splines).under the assumption (4.Mi ) = i=1 l=1 λi (Mi ) [1 − λi (Mi )]1−∆i. it is easy to show that the discrete version of joint likelihood we have formulated in Section 4.58) can be equivalently handled by maximizing n Li ∆i. one may trick the implementation of MLE by logistic regression.g.l are the same as in (4. .l Ui.l } Ui.l−1 T λ0 (s)ds .l } T exp − exp{θ xi. Faraway (2006).64) where ∆i.60) with piecewise-constant covariates. provided that the underlying hazard rate function (4. . . one may perform the MLE directly by applying the CoxPH procedure (R: survival package) to the transformed data (4.62) In R. Furthermore.l L(θ. where xi. .l ) for l = 1.2 can be rewritten as n ∆i L = i=1 n λi (Mi ) Li [1 − λi (Mi )]1−∆i ∆i. . for i = 1. . . θ) by a standard GLM procedure. xi. .55) admits the linear approximation up to the logit link: logit(λi (m)) ≈ η T φ(m) + θ T xi (m). Therefore.63) where logit(x) = log(x/1 − x) and φ(m) is the vector of pre-specified basis functions (e. This becomes clearly the likelihood function of the independent binary data (∆il . Li and i = 1.60).

66) in which dκ = j∈Gκ nj i=1 ∆ji and Aκ = j∈Gκ nj i=1 Mji Uji ˜T ˜ λf (s) exp ξ xji (s) ds.3). δ) with L (d) (x) = (−δ)d (1 + 135 . (2008. it is easy to write down the likelihood as K nj ∆ji Mji L= κ=1 EZκ j∈Gκ i=1 λji (Mji |Zκ ) exp − Uji λji (s|Zκ )ds . we obtain the joint likelihood K nj L= κ=1 j∈Gκ i=1 ˜T ˜ λf (Mji ) exp ξ xji (m) ∆ji (−1)dκ L (dκ ) (Aκ ). et al. . 1.g. (4. §7). . .2.(C) Frailty-type Vintage Effects in Section 4. E-step: given the fixed values of (ξ.3 The likelihood function. see e. (4. . By the independence of Z1 . Thus.4. estimate the individual frailty levels {Zκ }K by the conditional expectation κ=1 L (dκ +1) (Aκ ) Zκ = E[Zκ |Gκ ] = − L (dκ ) (Aκ ) (4. For the gamma frailty Gamma(δ −1 .65) ˜T ˜ Substituting λji (m|Zκ ) = λf (m)Zκ exp ξ xji (m) . et al. Aalen. we obtain for each Gκ the likelihood contribution nj Lκ = j∈Gκ i=1 ˜T ˜ λf (Mji ) exp ξ xji (m) ∆ji EZ Z dκ exp{−ZAκ } (4.66) can be derived from the dk -th derivative of Laplace transform of Z which satisfies that EZ Z dκ exp{−ZAκ } = (−1)dκ L (dκ ) (Aκ ). EM algorithm for fraitly model estimation. λf ). Zk .68) which corresponds to the empirical Bayes estimate. The expectation term in (4. see Aalen. (2008.67) or the log-likelihood function of the form (4.51). §7.

estimate (ξ. .i )∈Rji Zκ(j ) exp ξ xj i (Mji ) T T ∆ji  (4. details omitted.69) 2. . λ∗ (·. M-step: given the frailty levels {Zκ }K .51). To estimate the frailty parameter δ. then find the maximizer of the profiled log-likelihood (4. 1 + δAκ Zκ = (4. f ∗ 136 . . . one may run EM algorithm for δ on a grid to obtain (ξ (δ). In R. the frailty level estimates are given by 1 + δdκ . λf ) by partial likelihood κ=1 maximization and Breslow estimator with appropriate modification based on the fixed frailty multipliers J nj   PL(ξ) = j=1 i=1 J nj exp{ξ xji (Mji )} (j .70) Λf (m) = j=1 i=1 I(m ≥ Mji )∆ji (j . K. the partial likelihood maximization can be implemented by the standard CoxPH procedure based on the offset type of log-frailty predictors as provided.i )∈Rji Zκ(j ) exp ξ xj i (Mji ) T .71) where κ(j) indicates the bucket Gκ to which Vj belongs. κ = 1.δx)−1/δ−d d i=1 (1/δ + i − 1). (4. δ)).

BIBLIOGRAPHY 137 .

D. P. (1996). M. (2007). K. [11] Black. 49. (1976). (2008).. 51. H. O. C. T. [2] Aalen. Wiley. and Gill. Cox’s regression model for counting processes: a large sample study. eds. Borgan. (1972). Annals of Statistics. and Mester. (1982). 1100–1120. Journal of Statistical Planning and Inference. R. (1993). [10] Bilias.BIBLIOGRAPHY [1] Aalen. 28. [3] Abramovich. New York: Dover. New York. M. L. and Cox.). [12] Bluhm. Springer. Modeling data with multiple time dimensions. M. J. E. [8] Berlin. (1986). (eds. Baskets and CDOs. and Steinberg. New York. L. and Rutkowski. O. L. Stanford heart transplantation data revisited: a real time approach. F. N. K. Valuing corporate securities: Some effects of bond indenture provisions. and Overbeck. 31. R. O. 138 . and Prentice. Annals of Statistics. pp 65–81. 25. and and Gjessing. and Ying. Handbook of Mathematical Functions (10th printing). L. and Keiding. K. (2001). Berlin. D. Special issue of “Retail credit risk management and measurement”. Valuation and Hedging. 16. D. Computational Statistics & Data Analysis. [9] Bielecki. Boca Raton: Chapman and Hall. O. Springer-Verlag. Survival and Event History Analysis: A Process Point of View. Statistical Science. Springer. J. and Stegun.) (2004). P. 721–899. New York. 327–341. 351–367. Understanding the shape of the hazard rate: a process point of view (with discussion). (2004). H. [5] Andersen. R. F. (2007) Structured Credit Portfolio Analysis. A. Statistical Models Based on Couting Processes. M. Improved inference in nonparametric regression using Lk -smoothing splines. Gill. Credit Risk: Modeling. 662–682. [4] Abramowitz.. [6] Andersen. S. 1–22. [7] Arjas. C. Gu. Y. J. R.. K. Towards a general asymptotic theory for Cox model with staggered entry. Z. 10. I. (1997). and Gjessing. Journal of Finance. Borgan. H. [13] Breeden.. 4761–4785. In: Modern Statistical Methods in Chronic Disease Epidemiology (Moolgavkar. Journal of Banking and Finance. M.

(1993). 662–682. R. Multi-period corporate default prediction with stochastic covariates. 275–307. Journal of Finance. D. Econometrika. B. T. A. 157–179. L. and Management. and Saita.. Low-frequency signals in long tree-ring chronologies for reconstruction past temperature variability. D. L. B. R. E. 635–665. and Saita. R. E. Duffie. and Singleton. Pan. D. G. Transform analysis and option pricing for affine jump-diffusions.. The two-way proportional hazards model. Discussion of “Regression models and life-tables” by Cox. [24] Duchesne. 119–149. I. [27] Duffie. 68.. S. Eckner. [21] Das. Common failings: how corporate defaults are correlated. 187–220. and Schweingruber (2002). Science.. (1985).. New York: Wiley. L. A theory of the term structure of interest rates. N. A. Cook. Kapadia. A.. Biometrika. 62. Methodology. [23] Donoho. (1994). New York. Quigley. N. heterogeneity and the exercise of mortgage options. (2008). 385–407. and Lawless. [16] Chhikara. to appear. J. 83. D. J. Measurement. C. Lesmond. and Van Order. B.. (1989). R. (2002). (1975). D. 269–276. Lifetime Data Analysis. Horel. 25. K. (1972). (2007). R. and Folks. [29] Efron. A. L. and Johnstone. 81 (3). Regression models and life-tables. (1972). 6. F.. Alternative time scales and failure time models. J.[14] Breslow. J. 425-455. R. Ingersoll. S. Credit Risk: Pricing. D. 34. J. D. and Applications. 93–117. [25] Duffie. K. [19] Cox. S. 1343–1376. [18] Cox. Journal of Finance. 62. [26] Duffie. Journal of the Royal Statistical Society: Ser. [30] Esper. [22] Deng. 53. Saita. N. Dekker. Princeton University Press. M. J. Journal of Finance. M. Journal of the Royal Statistical Society.. Ideal spatial adaptation by wavelet shrinkage. 295. and Wang. (2000). Journal of the Royal Statistical Society: Ser. L. R. Series B. Y. E. 2250–2253. 68. and Wei. The Inverse Gaussian Distribution: Theory. 139 . [17] Cox. J. K. (2003). D. Mortgage terminations. [20] Cressie. J. (2000). and Ross. Biometrika. and Singleton. 216–217. (2000). [15] Chen. Corporate yield spreads and bond liquidity. Partial likelihood. 62. L. [28] Duffie. Econometrica. Statistics for Spatial Data. D. (2007). Journal of Financial Economics. Frailty correlated default. J.. 34. (2007). Econometrica. D.

J. 21. [80] Friedman. (2008). (2008). K. (2008). Statist. A. 73–75. 30. Hoboken NJ. A snapshot of mortgage conditions with an emphasis on subprime mortgage performance. T. [32] Fan. H. S. 632–641. and Wahba G. Analysis of longitudinal data with semiparametric estimation of covariance function. K. (2003). 29. L. Chapman & Hall. (2006). 2008 (2). T. S. A. J. Making sense of the subprime crisis. [40] Fu. Technometrics. J. G. A smoothing cohort model in age-period-cohort analysis with applications to homicide arrest rates and lung cancer mortality rates. Q. Baca Raton: Chapman and Hall. [42] Giesecke. Federal Reserve. 28. (2000). [45] Gramacy.. W. Ridge estimator in singular design with applications to ageperiod-cohort analysis of disease rates. New York. 1–141. and Sudjianto. Boca Raton. and Gijbels. P. (2008). H. and Willen. [39] Fu. R. F.. Baca Raton: Chapman and Hall.[31] Fan. Nonlinear Time Series: Nonparametric and Parametric Methods. J. Appl. Huang. (2008). Mixed Effects and Nonparametric Regression Models. and Fabozzi. T. Lehnert. and Yao. U.. 103. August 2008. K. Journal of the American Statistical Association. J. J. [43] Golub. H. D. W. [41] Gerardi. and Lee. J. R. V. Heath.. 1119–1130. (1979). and Cox. Annals of Statistics. Lucas. Lehnert. 19. Li. 263–278. (2006). [34] Fang. T. Default and information. R. Li.. A. and Prescott. D. 36. 215–223. 2281–2303. Journal of Economic Dynamics and Control.. Subprime Mortgage Credit Derivatives.. (1996). S. [44] Goodman. (2007). (1979). Bayesian treed Gaussian process models with an application to computer modeling. Wiley. Sociological Methods and Research. M.. J. S.S. Design and Modeling for Compute Experiments. I. 69–159. 327–361. 140 . J. K. (1991). Journal of the American Statistical Association. and Li. [33] Fan. [36] Farewell. B. Multivariate adaptive regression splines (with discussion). Local Polynomial Modelling and Its Applications. [35] Faraway.. Extending the Linear Model with R: Generalized Linear. H. Springer. Brookings Papers on Economic Activity. 102. Sherlund. (2006). Generalized cross-validation as a method for choosing a good ridge parameter.. Communications in Statistics – Theory and Method. A. Zimmerman. J. N. M. A note on multiple time scales in life testing. R. [37] Frame.

N. Smith. D. and Wahba. 50. R. Math. and printed in u Mathematical Demography (ed. [55] Kupper. Smoothing Spline ANOVA Models. Statistical Models and Methods for Lifetime Data. [58] Lawless. [49] Hastie.. F. [47] Gu. Springer. [54] Kimeldorf.. and Silverman. Strasso burg: Tr¨bner. 332. Lond. S. and Greenberg. [48] Gu. 38. Ann. R. 1977). London. Math. 53–85. M. New York. T. On Cox processes and credit-risky securities. Wiley. Soc. Hoboken NJ. W. (1875). (2000). 41. (1991). Some results on Tchebycheffian spline functions. 99–120. Pages 5–7 translated to English by Keyfitz. C. P.[46] Green. G. G.. [56] Lando. Statistical Science. Appl. L. J. A. G. D. [57] Lando. (1995). and Wahba. [53] Kimeldorf.-L. R. Anal. J.. (1998). D. A. (1994). Stat. 495–502. New York: Chapman and Hall. 501–513. Analysis of Multivariate Survival Data. A correspondence between Bayesian estimation of stochastic processes and smoothing by splines. B. Chapman & Hall. L. 811–830. Weak convergence of time-sequential censored rank statistics with applications to sequential testing in clinical trials. (2004). Phil. A. G. 141 . 487–509. Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. and Lai. (1990). and Whitmore.. [60] Lexis. New York. Janis. Springer. Credit Risk Modeling: Theory and Applications. Review of Derivatives Research 2. Springer. T. Karmous. [52] Keiding. Statistical age-period-cohort analysis: a review and critique. G.. (1990). L. Keyfitz. Statistical inference in the Lexis diagram. Statist. T. (1985). G. Trans. (1970). Berlin. (1971). 19. and N. N. 21. G. W. Generalized Additive Models. 33. G. 82–95. and Tibshirani. (2006). Princeton University Press. P. 1403–1433. Journal of Chronic Disease. [50] Hougaard. [51] Jarrow. Einleitung in die Theorie der Bev¨lkerungsstatistik. [59] Lee. Ann. M. and Turnbull. Pricing derivatives on financial securities subject to credit risk. Journal of Finance. J. Nonparametric Regression and Generalized Linear Models. M. N. B. (2003). (2002).

142 . H. 113–125. T. Mason.. A. D. [71] Nelsen. Life Distributions: Structural of Nonparametric. (2009). Bayesian confidence intervals for a smoothing spline. 282–01. Some methodological issues in cohort analysis of archival data. S. Pence. G. Springer. Journal of The American Statistical Association. C. (2007) Numerical Recipes: The Art of Scientific Computing (3rd. K. R. P. N. Semiparametric. (1998). (1995). K. Variable bandwidth kernel estimators u u of regression curves. W. [69] Moody’s Special Comment: Corporate Default and Recovery Rates. [62] Madan. London: Chapman and Hall. 242–258. K. H. Multiple Time Scales in Survival Analysis. E. (1989). and Holmes. Winsborough. 15. Statist. (1992).. Biometrika. 19. Review of Derivatives Research. M. Speckman. (1988). Annals of Statistics. and Parametric Families. 1134–1143. J. P. [67] McCulloch.). W. W. 2. C. An introduction to copulas (2nd ed.[61] Luo. Journal of Economic Perspectives. (2006). W.. D. [72] Nielsen. B. D. [66] Mayer. H. A. C. O. [70] M¨ller. ed. Gill.. T. H. Teukolsky. W. Pricing the risks of default. (1974). 49. [68] Merton. Springer. [64] Mason. and Stadtm¨ller. A counting process approach to maximum likelihood estimation in frailty models Scan.. and Olkin. Cohort analysis. I. [65] Marshall. and Unal.. P. 23. H. Generalized Linear Models (2nd ed. In: International Encyclopedia of the Social and Behavioral Sciences. G. (1973). and Flannery.. I. [63] Mason. 83.. R. On the pricing of corporate debt: The risk structure of interest rates. H. U. [73] Nychka. and Poole. Lifetime Data Analysis. C.).. 1920-2008. A. Data release: February 2009. K. 107–116. Journal of Finance. B. (2007). (2006). S. W. and Sherlund. J. Andersen. 25–43. Journal of the American Statistical Association. 2189–2194. Cambridge University Press. [74] Oakes. 38. A. (1997). R. D. The rise in mortgage defaults. Vetterling. 93. [75] Pintore. and Wahba. New York: Elsevier. and Wolfinger. [76] Press. American Sociological Review. Z. 1213–1252. 1. New York. 27–50. 121–160. G. (1987). G. 7–18. H. A. (2002). and Sørensen. 92. Hybrid adaptive splines. C. and Nelder. Spatially adaptive smoothing splines. H.).

A. Ann. Statist. [87] Slud. R. (2000). Curve registration. I and II. A. J. [86] Silverman. Bank of America. 11. Smoothing splines: regression. Annals of Statistics. O. 60.[77] Ramlau-Hansen. (1984). A. and Williams. derivatives and deconvolution. [80] Rice. 321–354. Statistical Properties of Inverse Gaussian Distributions. Gaussian Processes for Machine Learning. E. D. (1979). W. and Zhang. Modeling Survival Data: Extending the Cox Model. [88] Stein.S. (1983). 551–571. and Grambsch. pp. Enterprise Quantitative Risk Management. [81] Ruppert. U. K. C. L. 351–363. Demography 16. 2008-63. G. 47. and future of subprime mortgages. S. Technical report. (1985). V. E. and Li. K. and Carroll. 205–223. 362–377 and 696–705.. 453–466. D. J.. E. C. Springer. (1983). GAMEED documentation. (2006). (1998).. Annals of Mathematical Statistics. (1966). Statist.. A. 315–326. (2008). 16. B. (2000). New York. (2006). [92] Vaupel. R. Annals of Mathematical Statistics. 141–156. 289–295. The impact of heterogeneity in individual frailty on the dynamics of mortality. Statist. X. M. 143 . M. B. The MIT Press. and Siegmund. M. E.und steigversuche an teilchen mit o u Brownscher bewegung. 12. Finance and Economics Discussion Series. Sequential analysis of the proportional hazards model. R. (1957). Aust. Physikalische Zeitschrift. [90] Therneau. 1–52. [85] Sherlund. J. [83] Selke. present. Z¨r theorie der fall. [79] Rasmussen. Journal of the Royal Statistical Society: Ser. and Stallard. M.. N. H. [82] Schr¨dinger. K. [78] Ramsay. Journal of the Royal Statistical Society: Ser. P. I. N. Some aspects of the spline smoothing approch to nonparametric curve fitting (with discussion). Interpolation of Spatial Data: Some Theory for Kriging. J. Radon-Nikodynm derivatives of Gaussian measures. Smoothing counting process intensities by means of kernel functions. 28. Spatially-adaptive penalties for spline fitting. Biometrika. [89] Sujianto. J. 11. Federal Reserve Board. and Rosenblatt. (1983). (1999). Manton. Li. [84] Shepp. The past. New York: Springer-Verlag. Sequential linear rank tests for two-sample censored survival data. M. Ann. (1915). L. 37. 70. 439–454. 42(2). M. T. B. Z. W. C. [91] Tweedie.

G. 45. G.[93] Vasicek. Journal of The Royal Statistical Society. (ed. Spline Models for Observational Data. T. An equilibrium characterization of the term structure. and Colton. 33. [100] Wang. New York. 4986–4997. P. The Free Encylopedia. Ser. B. G. O. (1977). Wiley. URL: http://en. (1990).org 144 . Journal of the Royal Statistical Society: Ser. In Arbmitage. Special issue of “Innovations in mortgage modeling”. 57. Bayesian confidence intervals for the cross-validated smoothing spline. [94] Vidakovic. Journal of Financial Economics. B. [97] Wahba. [95] Wahba. [99] Wand. and Jones. M.). et al. (1995).) (2005). (1983). J. N. (2005). (Eds. 177–188. P. C. Smoothing hazard rate. Volume 7. [101] Wikipedia. Chapman & Hall. 587–782. Kernel Smoothing. London. pp. [96] Wahba. Discussion of “Wavelet shrinkage: asymptopia” by Donoho.-L. Philadelphia: SIAM. (1995). [98] Wallace. Real Estate Economics. SSBMS-NSF Regional Conference Series in Applied Mathematics. (1999). 5. Wiley. B.): Encyclopedia of Biostatistics (2nd ed. M.wikipedia. 133–150. 360-361. Statistical Modeling by Wavelets.

Sign up to vote on this title
UsefulNot useful