You are on page 1of 14

Biometrics 61, 92105

March 2005

Survival Model Predictive Accuracy and ROC Curves

Patrick J. Heagerty
Department of Biostatistics, University of Washington, P.O. Box 357232, Seattle,
Washington 98195-7232, U.S.A.

Yingye Zheng
Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, MP 702, P.O. Box 19024,
Seattle, Washington 98109-1024, U.S.A.

Summary. The predictive accuracy of a survival model can be summarized using extensions of the pro-
portion of variation explained by the model, or R2 , commonly used for continuous response models, or
using extensions of sensitivity and specicity, which are commonly used for binary response models. In this
article we propose new time-dependent accuracy summaries based on time-specic versions of sensitivity
and specicity calculated over risk sets. We connect the accuracy summaries to a previously proposed global
concordance measure, which is a variant of Kendalls tau. In addition, we show how standard Cox regression
output can be used to obtain estimates of time-dependent sensitivity and specicity, and time-dependent
receiver operating characteristic (ROC) curves. Semiparametric estimation methods appropriate for both
proportional and nonproportional hazards data are introduced, evaluated in simulations, and illustrated
using two familiar survival data sets.
Key words: Cox regression; Discrimination; Prediction; Sensitivity; Specicity.

1. Introduction require a dierent estimation approach, and have a dierent

In this article we propose a new method for characterizing the ultimate objective. The goals of this article are to introduce
predictive accuracy of a regression model when the outcome new time-dependent sensitivity, specicity, and ROC concepts
of interest is a censored survival time. We focus on data ob- appropriate for survival regression models; to demonstrate the
tained from a prospective study in which a continuous follow- connection between time-dependent ROC methods and clas-
up time is observed for each participant, but where follow-up sical concordance summaries such as Kendalls tau or the c
can be terminated either by the occurrence of the event of index (Harrell, Lee, and Mark, 1996); and to show how stan-
interest or by censoring. Thus the essential outcome informa- dard Cox regression estimation methods directly provide the
tion is the combination of the status at the end of follow-up ingredients needed to calculate the proposed time-dependent
(binary) and the length of follow-up (continuous). Because accuracy summaries.
censored data share features of both continuous response data
and binary data, the accuracy concepts that are standard for 1.1 Notation
either response type may be extended for survival outcomes. Let Ti be the survival time for subject i, and assume that
Previous research has focused on extending the proportion we only observe the minimum of Ti and Ci , where Ci rep-
of variation explained by the covariates, or R2 , to censored resents an independent censoring time. Dene the follow-up
data models (Schemper and Henderson, 2000; OQuigley and time Xi = min(Ti , Ci ), and let i = 1(Ti Ci ) denote the
Xu, 2001). In addition, limited work has explored the use censoring indicator. The survival time Ti can also be rep-
of familiar binary outcome methods such as receiver operat- resented through the counting process, N i (t) = 1(Ti t),
ing characteristic (ROC) curves for application in the longi- or the corresponding increment, dN i (t) = N i (t) N i (t).
tudinal setting (Etzioni et al., 1999; Heagerty, Lumley, and Note that we focus on the counting process N i (t) which is
Pepe, 2000; Slate and Turnbull, 2000). Time-dependent ROC dened solely in terms of the survival time Ti rather than the
curves oer an alternative to the use of R2 extensions for more common notation Ni (t) = 1(Xi t, i = 1), which de-
survival data. However, the goal of an ROC analysis is to pends on the censoring time (Fleming and Harrington, 1991).
characterize the prognostic potential of a marker (or model) Let Ri (t) = 1(Xi t) denote the at-risk indicator. We also
by focusing on the correct classication rates. Methods that assume that for each subject we have a collection of time-
summarize the proportion of variation explained by covariates invariant covariates, Z i = (Z i1 , Z i2 , . . . , Zip ).

Survival Model Predictive Accuracy and ROC Curves 93

We focus here on using Cox model methods to both gen- OQuigley and Xu (2001) also develop R2 summaries for
erate a model score and to evaluate the prognostic potential Cox regression. In their approach the role of survival time
of the model score. However, the evaluation methods that we and covariate are reversed, and the proportion of variation
propose can be used to summarize the accuracy of a prog- in the covariate that is explained by survival is proposed.
nostic score generated through any alternative regression or The authors exploit partial likelihood estimation methods be-
predictive method, and in this case varying coecient meth- cause the methods provide model-based estimates of the dis-
ods (Hastie and Tibshirani, 1993) such as locally weighted tribution of covariates conditional on survival time. Focusing
partial likelihood estimation (Cai and Sun, 2003) provide a on a scalar covariate, Xu and OQuigley (2000) show that
convenient approach for estimating key accuracy summaries. i (, t) = Ri (t) exp(Z i )/W (t) can be used to estimate
Therefore, we briey introduce the relevant aspects of par- the distribution of the covariate, Z i , conditional  on the
tial likelihood estimation. Under the proportional hazards event occurring at time t, P (Z i z | Ti = t) = j j (, t)
assumption, (t | Z i ) = 0 (t) exp(Z Ti ), where (t | Z i ) = 1(Zj z). OQuigley and Xu (2001) obtain estimates of the
lim0 1 P [Ti [t, t + ) | Z i , Ti t]. The partial likelihood conditional variance var(Z i | Ti = t) and propose a global
score equations can be written as summary by integrating estimates of the marginal and condi-
   tional variance over the survival distribution. Our approach is
  similar in that we also use i (, t) to estimate conditional dis-
0= i Z i k (, Xi )Z k ,
tributions, but rather than computing variances we estimate
i k
time-dependent versions of sensitivity and specicity dened
 k (, t) = Rk (t) exp(Z Tk )/W (t), with W (t) = in the following section.
Rj (t) exp(Z Tj ). Solving these equations yields the con-
sistent and asymptotically normal maximum partial likeli- 1.3 Overview
hood estimator (MPLE) (Cox, 1972). In Section 2 we briey review ROC methods proposed for
summarizing the accuracy of a prognostic marker or model
1.2 Proportion of Variance Approaches when the outcome of interest is a survival time. We then
Two main approaches exist for characterizing the proportion develop new denitions of time-dependent sensitivity and
of variation explained by a survival model. Schemper and specicity that are strongly connected to partial likelihood
Henderson (2000) overview an approach where the survival concepts. Time-dependent accuracy measures can be used
time is characterized by a counting process representation, to calculate time-specic ROC curves, and time-specic area
N i (t) = 1(Ti t), and time-integrated variances are used to under the curve (AUC) summaries. We show that a global
form the summary measure. Alternatively, OQuigley and Xu concordance measure is the integral, or weighted average, of
(2001) consider the proportion of variation in the covariate, time-specic AUC measures. In Section 3 we discuss the es-
Z i , that is explained by the survival time Ti . timation of time-dependent ROC and AUC summaries and
Schemper and Henderson (2000) build on earlier work that provide a method that is applicable to a proportional haz-
extends R2 to Cox regression. Their approach focuses on using ards model, and a more general method that can be used to
the counting process, N i (t), and marginal and conditional ex- characterize any scalar prognostic score even if proportional
pectations given by the survival functions S(t) = E[1 N i (t)] hazards do not obtain. Finally, in Section 4 we analyze two
and S(t | Z i ) = E[1 N i (t) | Z i ], respectively. Because the well-known data sets. We conclude the article with a brief
vital status indicator N i (t) is a binary variable, Schemper discussion.
and Henderson (2000) propose using the marginal variance
S(t)[1 S(t)] and the conditional variance S(t | Z i )[1 2. Censored Survival and Predictive Accuracy
S(t | Z i )] to characterize the proportion of variation explained 2.1 Background on ROC Curve Analysis
by the covariates Z i . In particular, a nite time range (0, ) When outcomes Yi are binary the accuracy of a prediction
is considered and time-average variances are formed: or classication rule is typically summarized through correct
classication rates dened as sensitivity, P (pi > c | Yi = 1),
D( ) = S(t)[1 S(t)] f (t) dt f (t) dt and specicity, P (pi c | Yi = 0), where pi is a prediction,
0 0
and c is a criterion for classifying the prediction as positive
(pi > c) or negative (pi c). When no a priori value of c is in-
DZ ( ) = EZ {S(t | Z)[1 S(t | Z)]} f (t) dt f (t) dt, dicated the full spectrum of sensitivities and specicities can
0 0
be characterized using an ROC curve that plots the true
where f (t) is the marginal density of Ti . Our representation positive rate (sensitivity) versus the false positive rate
above diers by a factor of 2 from the proposal of Schemper (1-specicity) for all c (, +).
and Henderson (2000) as they also consider the mean absolute An ROC curve provides complete information on the set
deviation, E[|N i (t) S(t)|] = 2S(t)[1 S(t)]. Finally, the of all possible combinations of true-positive and false-positive
summary V ( ) = D( ) DZ ( )/D( ) is proposed as the rates, but is also more generally useful as a graphical char-
proportion of variation explained by covariates. Similarly, our acterization of the magnitude of separation between the case
approach views survival data through the counting process and control marker distributions. If case measurements and
representation, N i (t), but because N i (t) is a binary outcome control measurements have no overlap then the ROC curve
we explore the extension of standard binary response accuracy takes the value 1 (perfect true-positive rate) for any false-
summaries such as ROC curves rather than considering an positive rate greater than 0. In this situation the marker
extension of R2 . is perfect at discriminating between cases and controls.
94 Biometrics, March 2005

Alternatively, if the case and control distributions are iden- of times t1

, t2
, . . . , tm

) is important and scientic interest lies

tical then the ROC curve lies on the 45 line indicating that in discriminating between subjects who die prior to a given
the marker is useless for separating cases from controls. time t
and those that survive beyond t
. ROC curves are
The area under the ROC curve, or AUC, is known to rep- dened as ROCt (p) = TPCt {[FPDt ]1 (p)} where TPCt (c) =
resent a measure of concordance between the marker and the P (Mi > c | Ni (t) = 1), FPDt (c) = P (Mi > c | Ni (t) = 0), and

disease status indicator (Hanley and McNeil, 1982). Speci- [FPDt (p)]1 = inf c {c : FPDt (c) p}. In the absence of censoring
cally, the AUC measures the probability that the marker value ROCt (p) can be estimated using the empirical distribution
for a randomly selected case exceeds the marker value for a of the marker separately among cases and controls. With cen-
randomly selected control and is directly related to the Mann sored survival times Heagerty et al. (2000) develop a non-
Whitney U statistic (Hanley and McNeil, 1982; Pepe, 2003). parametric estimator based on the nearest-neighbor bivariate
Finally, ROC curves are particularly useful for comparing the distribution estimator of Akritas (1994). A substantive ap-
discriminatory capacity of dierent potential biomarkers. For plication that demonstrates use of cumulative/dynamic ROC
example, if for each value of specicity one marker always curves for a Cox regression model can be found in Fan et al.
has a higher sensitivity, then this marker will be a uniformly (2002).
better diagnostic measurement. See Zhou, McClish, and 2.2.2 Incident/static. Etzioni et al. (1999) and Slate and
Obuchowski (2002) or Pepe (2003) for more discussion of ROC Turnbull (2000) adopt an alternative denition of time-
analysis. dependent sensitivity and specicity using
In this section we rst review previous proposals for gener-

alizing the concepts of sensitivity and specicity for applica- sensitivityI (c, t) : P (Mi > c | Ti = t)=P Mi > c | dNi (t) = 1
tion to survival endpoints. Denitions of sensitivity and speci-

specicityD (c, t ) : P Mi c | Ti > t =P Mi c | Ni (t ) = 0 ,

city are given in terms of the actual survival time Ti . Cen-
soring needs to be addressed for valid estimation. We then where dN i (t) = N i (t) N i (t). Using this denition, each
show that a certain choice of time-dependent true-positive subject does not change disease status and is treated as either
and false-positive denitions leads to time-dependent ROC a case or a control. Cases are stratied according to the time
curves and time-dependent AUC summaries that are directly at which the event occurs (incident) and controls are dened
related to a previously proposed concordance summary for as those subjects who are event free through a xed follow-up
survival data. period, (0, t ) (static). These denitions facilitate the use of
standard regression approaches for characterizing sensitivity
2.2 Extensions of Sensitivity and Specicity
and specicity because the event time, Ti , can simply be used
For survival data there are several potential extensions of as a covariate. To estimate the quantiles of the conditional
cross-sectional sensitivity and specicity. Rather than a sim- distribution of the marker, Mi , given the event time, Ti = t,
ple binary outcome, Yi = 1, a survival time can be viewed as a Etzioni et al. (1999) and Slate and Turnbull (2000) consider
time-varying binary outcome by focusing on the counting pro- parametric methods that assume a normal distribution, but
cess representation N i (t) = 1(Ti t). Accuracy extensions which allow the mean and variance to be functions of the
are classied according to whether the cases used to dene measurement time, disease status, and the event time for the
time-dependent sensitivity are incident cases where Ti = t, cases. Cai et al. (2003) propose methods for estimating time-
or equivalently dN i (t) = 1, is used to dene cases for time dependent sensitivity and specicity when the event time is
t, or cumulative cases where Ti t or N i (t) = 1 is used. We censored. Recently, Zheng and Heagerty (2004) have proposed
also consider whether controls are static, dened as subjects regression quantile methods, which relax the parametric dis-
with Ti > t for a xed value of t , or whether controls are tributional assumptions of previous approaches.
dynamic and dened for time t as those subjects with Ti > t. 2.2.3 Incident/dynamic. In this article we focus on the
We use the superscripts C and I to denote dierent denitions following denitions of sensitivity and specicity:
of sensitivity, and use the superscripts D and D to denote dif-

ferent denitions of specicity. In this section we focus on a sensitivityI (c, t) : P (Mi > c | Ti = t) = P Mi > c | dNi (t) = 1
scalar marker value Mi that is used as a predictor of death.

When our interest is in the accuracy of a regression model we specicityD (c, t) : P (Mi c | Ti > t) = P Mi c | Ni (t) = 0 .
will use Mi = Z Ti . Using this approach a subject can play the role of a control for
2.2.1 Cumulative/dynamic. For a baseline marker value, an early time, t < Ti , but then play the role of case when t =
Mi , Heagerty et al. (2000) propose versions of time-dependent Ti . This dynamic status parallels the multiple contributions
sensitivity and specicity using the denitions that a subject can make to the partial likelihood function.

Here sensitivity measures the expected fraction of subjects
sensitivityC (c, t) : P (Mi > c | Ti t) = P Mi > c | Ni (t) = 1
with a marker greater than c among the subpopulation of

specicityD (c, t) : P (Mi c | Ti > t) = P Mi c | Ni (t) = 0 . individuals who die at time t, while specicity measures the
fraction of subjects with a marker less than or equal to c
Using this approach, at any xed time t the entire population among those who survive beyond time t. Incident sensitivity
is classied as either a case or a control on the basis of vital and dynamic specicity are dened by dichotomizing the risk
status at time t. Also, each individual plays the role of a con- set at time t into those observed to die (cases) and those
trol for times t < Ti , but then contributes as a case for later observed to survive (controls). In Section 3 we discuss how
times, t Ti . Cumulative/dynamic accuracy summaries are the observed marker data among risk sets can be used to
most appropriate when a specic time t
(or a small collection estimate time-dependent accuracy concepts.
Survival Model Predictive Accuracy and ROC Curves 95

Incident sensitivity and dynamic specicity have some ap- is a weighted average of the area under time-specic ROC
pealing characteristics relative to the alternative denitions. curves,
First, incident sensitivity and dynamic specicity are based
P [Mj > Mk | Tj < Tk ]
on classication of the risk set at time t into case(s) and

controls, and are, therefore, a natural companion to hazard
models. Second, the denitions easily allow extension to time- = 2 P [{Mj > Mk } | {Tj = t} {t < Tk }]
dependent covariates using P [Mi (t) > c | Ti = t] to dene in-
cident sensitivity and P [Mi (t) c | Ti > t] to dene dynamic P [{Tj = t} {t < Tk }] dt
specicity with a longitudinal marker Mi (t). Use of cumu- 
lative sensitivity does not permit a time-varying marker. Fi- = AUC(t) w(t) dt = ET [AUC(T ) 2 S(T )]
nally, use of incident sensitivity and dynamic specicity allows t

both time-specic accuracy summaries and, as shown in Sec- with w(t) = 2 f (t) S(t).
tion 2.4, allows time-averaged summaries that directly relate
to a familiar global concordance measure. In contrast, meth- In this notation AUC(t) is based on the I/D denition of sen-
ods have not been proposed for meaningfully averaging the sitivity and specicity, AUC(t) = P (Mj > Mk | Tj = t, Tk > t).
time-specic incident/static or cumulative/dynamic accuracy See the Appendix for a derivation.
summaries. In practice we would typically restrict attention to a xed
follow-up period (0, ). The concordance summary can be
2.3 Time-Dependent ROC Curves modied to account for nite follow-up:
After selecting denitions for time-dependent sensitivity and 

specicity, ROC curves can be computed and interpreted. In C = AUC(t) w (t) dt,
this article we focus on incident/dynamic (I/D) ROC curves 
dened as the function ROCt (p), where p denotes the dy- where w (t) = 2 f (t) S(t)/W , W = 0 2 f (t) S(t) dt =
namic false-positive rate, and ROCt (p) denotes the corre- 1 S 2 ( ). The restricted concordance summary remains a
sponding incident true-positive rate. Specically, let cp be weighted average of the time-specic AUCs with the weights
dened as the threshold that yields a false-positive rate of rescaled such that they integrate to 1.0 over the range (0, ).
p: P (Mi > cp | Ti > t) = 1 specicityD (cp , t) = p. The true- The interpretation of C is a slight modication of the origi-
positive rate, ROCt (p), is the sensitivity that is obtained nal concordance, where C = P [Mj > Mk | Tj < Tk , Tj < ].
I/D Thus C is the probability that the predictions for a random
using this threshold, or ROCt (p) = sensitivityI (cp , t) =
pair of subjects are concordant with their outcomes, given
P (Mi > cp | Ti = t). Using the true and false-positive
that the smaller event time occurs in (0, ).
rate functions TPIt (c) = sensitivityI (c, t) and TPDt (c) = 1
The concordance summary C is directly related to Kendalls
specicityD (c, t) allows the ROC curve to be written
tau. Specically, C = K/2 + 1/2, where K denotes Kendalls
as the composition of TPIt (c) and the inverse function
tau (see Agresti, 2002, p. 60 for denition). Korn and Simon
[TPDt ]1 (p) = cp :
(1990) and Harrell et al. (1996) discuss the use of Kendalls
ROCt (p) = TPIt FPDt (p) tau (K or a ) with survival data and propose modications
1 to account for censored observations.
for p [0, 1]. We use the notation AUC(t) = 0 ROCt (p)dp 2.5 Example: Gaussian Marker and Log-Normal Disease Time
to denote the area under the I/D ROC curve for time t.
To illustrate time-dependent accuracy concepts we consider
2.4 Time-Dependent AUC and Concordance a simple example where the marker Mi and the log of sur-
In the previous subsection we discussed how ROC methods vival time log(Ti ) follow a bivariate normal distribution. By
can be used to characterize the ability of a marker to dis- convention we consider a higher marker value as indicative of
tinguish cases at time t from controls at time t. However, in earlier disease onset and, therefore, explore bivariate distri-
many applications no a priori time t is identied, and a global butions with a negative correlation between the marker and
accuracy summary is desired. In this subsection we show how log(time).
time-dependent ROC curves are related to a standard con- If [Mi , log(Ti )] has a bivariate normal distribution with
cordance summary. The global summary we adopt is mean (0, 0) and unit standard deviations then time-dependent
incident sensitivity and cumulative 1-specicity are
C = P [Mj > Mk | Tj < Tk ],  

log(t) c
which indicates the probability that the subject who died at P Mi > c | dNi (t) = 1 = TPIt (c) = 
(1 2 )
the earlier time has a larger value of the marker. This is not
the usual form (i.e., P [Mj > Mk | Tj > Tk ]), but reects the
S2N [c, log(t); ]
conventions for ROC analysis. P Mi > c | Ni (t) = 0 = FPDt (c) = ,
[ log(t)]
In order to understand the relationship between this dis-
crimination summary and ROC curves we assume indepen- where (x) = P (X < x) for X N (0, 1) and S2N [x, y; ] =
dence of observations (Mj , Tj ) and (Mk , Tk ), and assume that P (X > x, Y > y) for (X, Y) bivariate mean 0 unit normal
Tj is continuous such that P (Tk = Tj ) = 0. We use P(x) with correlation .
to denote probability or density depending on the context. Figure 1a shows I/D ROC curves for = 0.8. The solid
These assumptions imply that the concordance summary C line corresponds to t = exp(2) and has an AUC of 0.923
96 Biometrics, March 2005

(a) I/D ROC curves for log-normal



log(t) = -2
log(t) = -1

log(t) = 0
log(t) = 1
log(t) = 2

0.0 0.2 0.4 0.6 0.8 1.0


(b) AUC(t) curves for log-normal


rho = -0.9
rho = -0.8

rho = -0.7
rho = -0.6



0.05 0.10 0.50 1.00 5.00 10.00


Figure 1. Incident/dynamic ROC and AUC plots for a bivariate (log) normal distribution. (a) Incident/dynamic ROC
curves for a scalar marker and a disease time where {Mi , log(Ti )} is bivariate normal with = 0.8. (b) Plots of AUC(t) for
a scalar marker and a disease time where {Mi , log(Ti )} is bivariate normal with taking the values (0.9, 0.8, 0.7, 0.6).

indicating very good separation between the distribution for a positive test, then by denition, only 10% of the controls
Mi among subjects with Ti = exp(2) as compared to the (i.e., log(Ti ) > 2) would have a value of Mi greater than 1.19.
marker distribution for subjects with Ti > exp(2). Further- The ROC plot shows that for this false-positive rate of 10%
more, if the threshold value c10% = 1.19 were used to indicate a sensitivity, or true-positive rate, of 75% can be obtained:
Survival Model Predictive Accuracy and ROC Curves 97

TPIt (1.19) = 0.752. If we consider a later time such as log(t) = the marker given failure: E(Mi | Ti = t) = k Mk k (, t).
0 we nd less overall discrimination with an AUC of 0.741. However, Xu and OQuigley (2000) show that these weights
Again, specic operating points can be identied; for example, can also be used to estimate the distribution of the covariate
the ROC curve shows that if the false-positive rate is again conditional on death at time t:
controlled at 10% then a true-positive rate of only 30% is now I 
obtained (here c10% = 0.320). One of the key advantages of
 t (c) = P (Mi > c | Ti = t) =
TP 1(Mk > c) k (, t), (1)
an ROC curve is that it facilitates comparisons across dier- k

ent conditions in terms of the sensitivity of a marker where where the estimate P (Mi > c | Ti = t) is a consistent estima-
the specicity is controlled at a xed level for each condition. tor when the Cox model for Mi holds. Estimation of us-
Here we have evaluated the temporal variation in sensitivity ing partial likelihood provides a semiparametric estimate for
while controlling 1-specicity at 10%. TPIt (c). An empirical estimator can be used for FPDt (c):
In Figure 1b we show the AUC(t) functions for dierent
values of . For each value of we nd a decreasing AUC(t)  t (c) = P (Mi > c | Ti > t)
with increasing time. In addition, with decreasing correlation 
between the marker and the disease time we nd uniformly = 1(Mk > c) Rk (t+)/W R (t+), (2)
decreasing values for AUC(t). A global accuracy summary k
can be obtained using C, which integrates AUC(t) using the 
weight function proportional to 2 f (t) S(t). Figure 1b also where Rk (t+) = lim0 Rk (t + ||), and W R (t+) = k Rk (t+).
displays the weight function, which for this example is w(t) = The term W R (t+) denotes the size of the control set at
2 (t)[ 1 (t)], where (x) and (x) are the standard nor- time t, where we dene the control set as the risk set minus
mal density and distribution functions, respectively. In this  t (c) is the empirical
subjects who fail at time t. Essentially, FP
bivariate normal situation there exists an analytical solution distribution function for marker values among the control set,
for the concordance: C = sin1 ()/ + 0.5. For = 0.9  t (c) is an exponential tilt of the empirical distribution
and TP
we nd C = 0.827, while with = 0.6 we nd C = 0.703. function for the marker among risk set subjects (Anderson,
Therefore, when the marker Mi and log-survival time have a 1979).
correlation of 0.9 there is a 82.7% chance that for a random
3.2 Estimation: TPIt (c) and FPDt (c) under
pair of observations the marker value for the earlier survival
Nonproportional Hazards
time is greater than the marker value for the larger survival
time. This concordance probability is reduced to 70.3% when In order to use equation (1) to estimate incident sensitiv-
= 0.6. ity the proportional hazards assumption must be satised.
However, this aspect can be relaxed by adopting a varying-
coecient model of the form (t | Mi ) = 0 (t) exp[Mi (t)]. The
3. Estimation of Incident/Dynamic time-varying coecient function (t) can be estimated either
Time-Dependent Accuracy in a one-step fashion based on routine Cox model residuals,
In this section we propose methods for the estimation of time- or through locally weighted partial likelihood methods. Note
dependent accuracy summaries using a single scalar marker that if proportional hazards do obtain then (t) 1 when
Mi . When interest is in the accuracy of a survival regres- Mi = Z Ti .
sion model we propose using the linear predictor as a scalar Grambsch and Therneau (1994) describe residual-based
marker, Mi = Z Ti , and then using nonparametric or semi- methods for assessing the proportional hazards model that
parametric methods to characterize the time-dependent sen- can also be used to obtain estimates of time-varying coef-
sitivity and specicity of the model score. In particular, we cient functions. In order to dene the residuals we adopt
discuss how the Cox model and partial likelihood concepts can the following notation: S (p) (, t) = k Rk (t) exp(Z Tk ) Z p
k ,
be conveniently used to provide semiparametric estimates of where Z p T
k refers to 1, Z k , and Z k Z k for p = 0, 1, 2, respec-
I/D accuracy. However, the methods that we propose do not tively. The scaled Schoenfeld residuals are dened for each
require the model score, Mi , to be derived from a propor- observed ordered failure time, t(j) , as the vector
tional hazards model and are potentially applicable for any
prognostic scale. rj () = V 1 [, t(j) ]{Z (j) e[, t(j) ]},

where e[, t(j) ] = S (1) [, t(j) ]/S (0) [, t(j) ], V [, t(j) ] = S (2) [,
3.1 Estimation: TPIt (c) and FPDt (c) under t(j) ]/S (0) [, t(j) ] e[, t(j) ]e[, t(j) ]T , and Z(j ) denotes the co-
Proportional Hazards variate for the subject observed to die at time t(j) . Grambsch
Properties of the partial likelihood function make estimation and Therneau (1994) show that E{rj | F[t(j) ]} [(t) 0 ],
of I/D ROC curves a natural companion to Cox regression. where 0 is the time-averaged coecient and F(t) is the right-
Here we assume that the censoring time Ci is independent of continuous ltration specifying the survival process history.
the failure time Ti and marker Mi . To clearly distinguish be- This property is used to obtain focused tests of proportion-
tween the general model score, Mi = Z Ti , and a Cox model ality, and to obtain estimates of the time-varying coecient
that uses this score, we denote as the proportional haz- function, k (t) corresponding to covariate Z i,k . As a graphi-
ards regression parameter (t | Mi ) = 0 (t) exp(Mi ). It is well cal diagnostic tool standard regression-smoothing techniques

known that under a proportional hazards model the weights, are now commonly applied to the points [t(j) , k + rj,k ()] fol-
i (, t) = Ri (t) exp(Mi )/W (t) introduced in Section 1.1, lowing a Cox model t in order to obtain estimates of time-
are used to compute an estimate of the expected value of dependent coecient functions, k (t).
98 Biometrics, March 2005

For the evaluation of the accuracy of a marker, Mi , the 3.4 Inference for Incident/Dynamic Accuracy Summaries
smoothing of Schoenfeld residuals can be used to obtain a I
Xu and OQuigley (2000) show that the estimator TP  t (c)
simple estimate of I/D AUC(t) by exploiting standard Cox
given in equation (1) is consistent provided that the propor-
model output. First a Cox model of the form 0 (t) exp(Mi ) is
tional hazards model obtains, and provided the independent
t, followed by use of regression-smoothing methods to obtain
observations are subject to independent censoring. Parallel
(t). Second, equation (2) can still be used to obtain estimates
arguments apply for the estimator obtained using a varying-
of false-positive rates, and (1) can now be evaluated using (t)
coecient model given in equation (3) whenever a consistent
rather than a constant value :
estimator of (t) is used. Cai and Sun (2003) show that the
I  locally weighted MPLE is consistent under standard regu-
 t (c) = P (Mi > c | Ti = t) =
TP 1(Mk > c) k [(t), t]. (3) D
larity conditions. In addition, because FP t (c) is an empiri-
cal distribution function calculated over the control set (i.e.,
By using equation (3) we are adopting the exible semi- the risk set minus the case), consistency obtains provided the
parametric hazard model, 0 (t) exp[Mi (t)], which no longer control set represents an unbiased sample (i.e., independent
assumes proportionality, but rather only assumes smoothly censoring). Therefore, consistent estimates of time-dependent
varying hazard ratios over time. sensitivity and specicity and corresponding AUC(t) and C
More formal exible semiparametric statistical methods summaries are obtained under the proportional hazards as-
can be used to estimate a varying-coecient hazard model sumption using equations (1) and (2), and under more gen-
and subsequently produce time-dependent accuracy sum- eral nonproportional hazards assumptions using equation (3).
maries based on minimal model assumptions. For example, Finally, because the accuracy summaries are dened over the
Hastie and Tibshirani (1993) discuss both smooth paramet- joint distribution of the marker Mi and the survival time Ti ,
ric methods and nonparametric penalized likelihood meth- the nonparametric bootstrap of Efron (1979) based on resam-
ods for estimating the function (t) in the model i (t) = pling of observations (Mi , Xi , i ) may be used to compute
0 (t) exp[Mi (t)]. More recently Cai and Sun (2003) char- standard errors or to provide condence intervals.
acterize the properties of locally weighted partial likelihood 3.5 Discrete Times and General Hazard Models
methods used to obtain varying coecient estimates. Using
Our motivation for developing tools to summarize predictive
kernel weights that are specied as a function of time, t,
accuracy stems from interest in characterizing the prognostic
allows use of local-linear estimation methods. Cai and Sun
potential of Cox models for continuous survival times. How-
(2003) prove the pointwise consistency and asymptotic nor-
ever, the basic time-dependent accuracy concepts and the es-
mality of the resulting function estimator, (t). Smooth para-
timation method outlined in Section 3.2 generalizes to discrete
metric and/or nonparametric methods allow valid estimation
survival times and/or alternative hazard regression models.
of accuracy summaries such as AUC(t) based on the mini-
The key to estimation of TPIt (c) presented in Sections 3.1
mal model assumptions because models of the form i (t) =
and 3.2 is that a hazard model can be used to reweight the em-
0 (t) exp[Mi (t)] only assume linearity in Mi and smoothly
pirical distribution of Mi calculated over the risk set at time
varying hazard ratios over time. The linearity assumption can
t. Equations (1) and (3) show specic details for Cox models.
be relaxed by using a model with single or multiple transfor-
More generally, let P (Ti = t | Ti t, Mi ) denote the hazard,
mations of Mi and a vector of time-varying coecients.
where P (t) represents either density for continuous survival
3.3 Estimation: ROCt (p), AUC(t), and C times or probability for discrete times. A hazard regression
Given estimates of TPIt (c) and FPDt (c) the area under the model can be formulated as g[P (Ti = t | Ti t, Mi )] = (t) +
ROC curve at time t, AUC(t), and the integrated area, C , Mi (t), where g(x) is a link function. The Cox model is a spe-
can be calculated. The estimated ROC curve is given as cial case where a log link is used; (t) = log 0 (t); and (t)
 D 1  under the proportional hazards assumption. Following ar-
 t (p) = TP
ROC t t
FP (p) , guments given in Xu and OQuigley (2000) the general model
where  t ]1 (p) = inf c {c : FP
[FP  t (c) p}. The estimated P (Mi = m | Ti = t)

AUC(t) is simply AUC(t)  t (p) dp estimated using
= ROC g 1 [(t) + m (t)] P (Mi = m | Ti t), (4)
standard numerical integration methods such as the trapezoid
rule. Finally, the estimated concordance is given by where P (Mi = m | Ti t) denotes either the marker den-
 sity or probability depending on whether a continuous or dis-
C =  w (t) dt,
crete marker distribution is assumed. See the Appendix for
a derivation. Equation (4) shows that P (Mi = m | Ti = t)
can be estimated from separate estimates of the hazard

where AUC(t) is given above and w (t) = 2 f(t) S(t)/ model and the distribution of the marker conditional on Ti
[1 S ( )]. The KaplanMeier estimator can be used for S(t),
t. Therefore, the general estimation approach outlined in
and a discrete approximation to f(t) can be used based on the Section 3.2 can be adopted for either discrete survival times
increments in the KaplanMeier estimator. If KaplanMeier or for general hazard regression models provided that con-

is used to estimate f (t) and S(t) then AUC(t) only needs to sistent estimates of [(t), (t)] and P (Mi = m | Ti t) are
be evaluated at the observed failure times in order to calculate available. Tied survival times impact choice of a method for
C . estimating the hazard model parameters. In addition, with
Survival Model Predictive Accuracy and ROC Curves 99

discrete survival
 times calculation of the concordance sum- timation for the model 0 (t) exp[Mi (t)] using the method of
mary C = AUC(t) w(t) dt requires modication to account Cai and Sun (2003); and simple local linear smoothing of the
for the fact that P (Tj = Tk ) = 0 and, therefore, the constant scaled Schoenfeld residuals. For local MPL estimation and lo-
2 in the weight w(t) = 2 f (t) S(t) needs to be computed as cal linear smoothing we used an Epanechnikov kernel with a
1/P (Tj < Tk ). Finally, Cox models are convenient because span of n1/5 where n is the number of observations.
the baseline hazard, (t) = log 0 (t), drops out of (4), and is In order to estimate AUC(t) and C using semiparamet-
thus not required for estimation of TPIt (c). ric methods the model for the survival time conditional on
the marker, 0 (t) exp[Mi (t)], is combined with the observed
3.6 Simulations to Evaluate Incident/Dynamic Estimation marker distribution within each risk set according to the
In order to demonstrate the feasibility of using Cox regres- methods described in Section 3.2. We have adopted a survival
sion methods and the marker distribution among risk sets for model that assumes that the log hazard increases linearly in
estimating I/D ROC curves and global concordance we con- Mi for each time t. The true data-generating model is actu-
ducted a set of simulation studies. ally nonlinear with a concave risk function. Therefore, for this
For each of m = 500 simulated data sets a sample of n = simulation our estimation used a rst-order approximation to
200 marker values, Mi , and survival times, Ti , were gener- the true conditional hazard surface.
ated such that (Mi , log Ti ) is bivariate normal with a correla- Table 1 displays the mean and standard deviation for the
tion of = 0.7. An independent log-normal censoring time estimate of AUC(t) at various values of t when data are gener-
was generated to yield a xed expected fraction of censored ated with 20% and with 40% censoring. When 20% of the ob-
observations (either 20% or 40% censored). For each simu- servations are censored we nd that the MLE for AUC(t) has
lated data set we estimated the I/D AUC(t) function and the minimal bias for log(t) between 2 and 2. Estimates based on
concordance summary C using the largest observed survival the locally weighted MPLE and the residual smoother yield
time to truncate follow-up time. We applied four methods of approximately unbiased estimates for all but the most ex-
estimation to the censored data: maximum likelihood assum- treme values of time with some negative bias observed for
ing a bivariate normal distribution for the survival time and both the semiparametric estimators. For example, at log(t) =
the marker; maximum partial likelihood using the Cox model, 
2 the mean AUC(t) using the locally weighted MPLE is
which for this example incorrectly assumes proportional haz- 0.860 (relative bias of 1 0.860/0.884 = 3%) and using
ards; locally weighted maximum partial likelihood (MPL) es- the residual smoother the average is 0.881 (relative bias of

Table 1
Simulation results for estimation of I/D accuracy. Data (Mi , log Ti ) were generated as bivariate normal with a correlation of
= 0.7. The sample size for each simulated data set was N = 200. The AUC(t) curve and the integrated curve, C , were
estimated using: maximum likelihood assuming a bivariate normal model; Cox model, which assumes proportional hazards; local
maximum partial likelihood for the varying-coecient model (t) = 0 (t) exp[(t)Mi ]; and a local linear smooth of the scaled
Schoenfeld residuals to estimate the varying-coecient model.

MLE Cox model Local MPLE Residual smooth

Log time AUC(t) Mean SD Mean SD Mean SD Mean SD
20% censoring
2.0 0.884 0.884 0.018 0.743 0.028 0.860 0.052 0.881 0.044
1.5 0.833 0.834 0.019 0.734 0.026 0.817 0.033 0.829 0.035
1.0 0.782 0.782 0.019 0.725 0.024 0.768 0.031 0.771 0.033
0.5 0.734 0.734 0.019 0.716 0.023 0.722 0.032 0.720 0.033
0.0 0.693 0.693 0.018 0.707 0.021 0.688 0.034 0.686 0.034
0.5 0.660 0.660 0.016 0.700 0.023 0.655 0.041 0.657 0.040
1.0 0.634 0.634 0.015 0.691 0.028 0.633 0.044 0.637 0.041
1.5 0.614 0.614 0.013 0.670 0.044 0.621 0.064 0.622 0.048
2.0 0.598 0.598 0.012 0.600 0.075 0.579 0.076 0.573 0.060
C 0.741 0.741 0.016 0.720 0.020 0.737 0.018 0.740 0.018
40% censoring
2.0 0.884 0.884 0.019 0.749 0.031 0.859 0.054 0.875 0.048
1.5 0.833 0.834 0.021 0.742 0.029 0.818 0.035 0.827 0.037
1.0 0.782 0.782 0.021 0.732 0.026 0.770 0.035 0.772 0.035
0.5 0.734 0.734 0.020 0.722 0.024 0.724 0.038 0.722 0.039
0.0 0.693 0.693 0.019 0.712 0.024 0.689 0.042 0.687 0.041
0.5 0.660 0.660 0.018 0.702 0.026 0.654 0.045 0.655 0.043
1.0 0.634 0.635 0.016 0.689 0.035 0.633 0.057 0.637 0.048
1.5 0.614 0.614 0.015 0.653 0.055 0.617 0.075 0.614 0.051
2.0 0.598 0.599 0.013 0.560 0.073 0.555 0.075 0.546 0.058
C 0.741 0.741 0.017 0.727 0.022 0.740 0.021 0.742 0.021
100 Biometrics, March 2005

1 0.881/0.884 < 1%), while at log(t) = 2 the locally Table 2

weighted MPLE mean estimate is 0.579 (relative bias = 1 Cox regression estimates for the VA lung cancer data where
0.579/0.598 = 3%) and for the residual smoother the mean follow-up is truncated at 500 days. The reference category for
is 0.573 (relative bias = 1 0.573/0.598 = 4%). As ex- cell type is squamous.
pected for local regression methods Table 1 shows that the
Covariate Estimate SE Z
nonparametric methods yield substantially greater variances
for specic values of t compared to the MLE. Treatment 0.323 0.206 1.566
Incorrectly assuming proportional hazards lead to biased Age/10 0.086 0.093 0.937
estimates. Table 1 shows that the estimated AUC(t) obtained Karnofsky score 0.032 0.005 5.931
using equation (1) with an estimated Cox model coecient Cell type (small) 0.841 0.270 3.116
is negatively biased for log(t) < 0. For example, at log(t) = Cell type (adeno) 1.151 0.295 3.896
2 we obtain a negative bias of 1 0.743/0.884 = 16%. Cell type (large) 0.350 0.285 1.231
For log(t) > 0 the estimates obtained using the Cox model
and equation (1) are positively biased indicating that direct
use of the proportional hazards assumption produces an esti- tus measure known as the Karnofsky score. Schemper and
mated AUC(t) curve that is atter than the target with early Henderson (2000) use these covariates plus a treatment indi-
underestimation and late overestimation. cator and report an R2 of V = 0.24. This would suggest that
When censoring is increased to 40% similar patterns are the covariates explain only 24% of the time-integrated vari-

found for all estimators. Table 1 shows that the bias in AUC(t) ance in survival status.
is slightly larger with increased censoring. For example, at For comparison we use the same covariates and Cox regres-
log(t) = 2 the mean estimate for the locally weighted MPLE sion to create estimates of ROCt (p) for select t, the AUC(t)
is 0.555 (relative bias of 1 0.555/0.598 = 7%) and for the function, and the concordance summary C . For our analysis
residual smoother it is 0.546 (relative bias of 1 0.546/0.598 = we terminate follow-up at 500 days. Estimated model coef-
9%). Therefore, even with 40% censoring the smooth semi- cients and standard errors are given in Table 2. Using the
parametric methods appear to perform adequately. proportional hazards assumption we can employ equations
Finally, Table 1 also shows the results for the estima- (1) and (2) to estimate time-specic I/D ROC curves, and
tion of the global concordance summary C . In the simu- 
then integrate the ROC curve to obtain AUC(t). Estimates
lations we estimate C using the analytical results for the of AUC(t) and pointwise 90% condence intervals are dis-
MLE: C = sin1 ()/2 + 1/2. For the methods that adopt a played in Figure 2a. Over the rst 60 days of follow-up the
varying-coecient hazard model we set equal to the largest AUC(t) ranges between 0.66 and 0.73. The substantive inter-
uncensored survival time in the observed data and, therefore, pretation is: on any day, t, between 0 and 60, the probability
truncate follow-up at slightly dierent times for each simu- that a subject who dies on day t having a model score greater
lated data set. However, even with 40% censoring the largest than a subject who survives beyond day t is at least 0.66. The
uncensored time had a median value of exp(2.30) with an accuracy summaries suggest good short-term discriminatory
interquartile range of exp(2.04) to exp(2.65), and thus typ- potential of the model score. The estimated AUC(t) function
ically very little mass in the survival distribution is lost be- tends to decline over time to approximately 0.65 for 100 <
cause S[exp(2.30)] = 1 (2.30) = 0.01. With 20% censoring t < 300. Estimates of AUC(t) also become increasingly vari-
the mean estimate for the MLE, locally weighted MPLE, and able over time due to the diminishing size of the risk set. Using
residual smoother are 0.741 (SD = 0.016), 0.737 (SD = 0.018), a
 follow-up of = 365 days yields a concordance estimate of
and 0.740 (SD = 0.018), respectively. In contrast the estimate  w (t)/dt = 0.713 with a standard error of 0.026.
obtained naively assuming proportional hazards is negatively This implies that conditional on one event occurring within
biased with an average estimate of 0.720 (relative bias = 1 the rst year, the probability that the model score is larger
0.720/0.741 = 3%). These results suggest that the smooth for the subject with the smaller event time is 71.3%. The con-
semiparametric methods yield little bias, and for this example cordance estimate C is relatively modest in magnitude, but
exhibit high eciency relative to the MLE. A similar pattern is signicantly dierent from the null value of 0.50 (95% CI
is seen with 40% censoring where slightly increased standard for C : 0.661, 0.765).
deviations are observed relative to results obtained with 20% To characterize the model score, Mi = Z Ti , using fewer
censoring. assumptions we relax the proportional hazards assumption
for Mi by using a varying coecient model: 0 (t) exp[Mi (t)].
4. Examples Note that we are still focusing on use of the Cox model with a
In this section we illustrate the proposed methods using two proportional hazards assumption to generate the model score,
well-studied data sets. but are relaxing the assumptions needed to characterize model
accuracy. This highlights the fact that dierent methods can
4.1 VA Lung Cancer Data be used for generating and evaluating a survival regression
Kalbeisch and Prentice (2002) present and analyze Veterans model score. For the VA lung cancer data we simply use a
Administration (VA) lung cancer data from a clinical trial kernel smooth of the scaled Schoenfeld residuals to estimate
in which males with inoperable cancer were randomized to a (t). The estimate of (t) suggests a decreasing log-relative
standard treatment or a test therapy. Baseline covariates that hazard with increasing time (not shown).
were considered important predictors of mortality include: pa- Figure 2b shows estimates of AUC(t) based on equations
tient age, histological type of tumor, and a performance sta- (2) and (3), which relax the proportional hazards assumption.
Survival Model Predictive Accuracy and ROC Curves 101

(a) AUC based on Cox model




0 100 200 300 400

Time (days)

(b) AUC based on varying-coefficient Cox model




0 100 200 300 400

Time (days)

Figure 2. Incident/dynamic AUC plots for the VA lung cancer data. (a) Accuracy of the model score (linear predictor) under
the assumption of proportional hazards. Estimates of I/D AUC(t) versus time with pointwise 90% condence intervals. Using

 w (t) dt = 0.713 (SE = 0.026). (b) Accuracy of the model score (linear predictor) based
= 365 we obtain C = 0 AUC(t)
on a varying-coecient multiplicative hazard model. Estimates of I/D AUC(t) versus time with pointwise 90% condence

 w (t) dt = 0.738 (SE = 0.022).
intervals. Using = 365 we obtain C = 0 AUC(t)
102 Biometrics, March 2005

I/D ROC curves for the model score Table 3

Cox regression estimates for the PBC data

Covariate Estimate SE Z
Model 1
Log(bilirubin) 0.877 0.099 8.866

Log(prothrombin time) 3.013 1.025 2.939

Edema 0.785 0.300 2.617
Albumin 0.944 0.237 3.985
Age 0.033 0.009 3.881

Model 2

Log(prothrombin time) 4.141 0.870 4.758

Edema 1.190 0.295 4.031
Albumin 1.314 0.223 5.897

Age 0.024 0.009 2.660

t = 30
t = 60

t = 90
t = 120 4.2 Mayo PBC Data
Next, we consider data from a randomized placebo-controlled
trial of the drug D-penicillamine (DPCA) for the treatment of

primary biliary cirrhosis (PBC) conducted at the Mayo Clinic

0.0 0.2 0.4 0.6 0.8 1.0
between 1974 and 1984 (Fleming and Harrington, 1991).
1-specificity Among the 312 subjects randomized to the study, 125 died
by the end of the follow-up. Although the study established
Figure 3. Incident/dynamic ROC curves for the VA lung that DPCA is not eective for the treatment of PBC, the data
cancer data. A model score is derived using Cox regression have been used to develop a commonly used clinical predic-
with Karnofsky score, age, and cell type. ROC curves are esti- tion model. We use this example to illustrate how ROC curves
mated using a varying-coecient Cox model with the derived and/or AUC(t) summaries can be used to compare dierent
model score as the single predictor. model scores.
We rst consider a Cox model containing ve covariates:
log(bilirubin), albumin, log(prothrombin time), edema, and
First, notice that the short-term accuracy of the model score age. Table 3 gives the regression estimates using the propor-
remains good with AUC(t) between 0.70 and 0.78 over the tional hazard, model with mortality as the response. Except
rst 60 days of follow-up. Second, the discriminatory ability for log(prothrombin time), all covariates are strong predictors
of the model score declines substantially over time, and esti- of survival. The model has been used to create a widely used
mates of AUC(t) approach 0.50 after approximately 300 days, prognostic score. We now address the basic question: How well
suggesting that the model score is essentially useless at dis- does the model score discriminate subjects who are likely to
criminating incident cases from controls after 300 days. The die from subjects who are likely to survive? In addition, we
1-year concordance is estimated as C = 0.738, a slight in- consider whether the accuracy of the score changes over time.
crease from the estimate obtained assuming proportional haz- Using the tted linear predictor from the Cox model, we con-
ards. In this example the AUC(t) curve is particularly useful struct I/D time-dependent ROC curves and associated sum-
for displaying the fact that the baseline model score is good maries for the Mayo model. Figure 4a plots AUC(t) eval-
at discriminating early cases from early controls, but is of de- uated at each failure time. The model score has very good
creasing prognostic utility with increasing temporal distance discriminatory capacity for distinguishing those patients who
from the baseline measurement. Declining prognostic value die at time t from those who live beyond time t. The accuracy
is not surprising, particularly because the Karnofsky score is especially good for follow-up times less than 1000 days, with
is actually a time-varying health status measure, but only early AUC(t) estimates exceeding 0.85. The accuracy of the
the baseline value is available for the regression model. Fig- 
model score gradually decreases with time. Based on AUC(t)
ure 3 shows select estimates of I/D ROC curves based on and the KaplanMeier estimator of the marginal survival dis-
the varying-coecient model. Similar to the plot of AUC(t) tribution we estimate a concordance summary, C , of 0.80,
the ROC curves show that predictive accuracy is uniformly with xed at 4000 days for this and subsequent analysis.
decreasing with increasing time since baseline. For example, To quantify the impact of a single covariate on the accu-
controlling the dynamic false-positive rate at 20% leads to an racy of prediction we t a second Cox regression model that
incident sensitivity of 56% at 30 days, decreasing to 45%, 42%, does not include the covariate log(bilirubin). Table 3 displays
and 38% for 60, 90, and 120 days. The ROC curves also show coecient estimates for this new four-covariate model. The
details regarding the trade-o between sensitivity and speci- estimate of C drops from 0.80 to 0.73 when log(bilirubin)
city. If a stricter false-positive rate of 10% was desired then is excluded from the model. In addition, we can use the es-
the corresponding sensitivity would only be 40% at 30 days timated AUC(t) curves shown in Figure 4a to quantify for
and less than 30% for follow-up times of 60 days or greater. each follow-up time t the additional predictive accuracy that
Survival Model Predictive Accuracy and ROC Curves 103

(a) AUC based on Cox model

5 covariates: iAUC = 0.796

4 covariates: iAUC = 0.733


0 1000 2000 3000 4000

Time (days)

(b) AUC based on varying-coefficient Cox model


5 covariates: iAUC = 0.805


4 covariates: iAUC = 0.719



0 1000 2000 3000 4000

Time (days)

Figure 4. Incident/dynamic AUC plots for the Mayo PBC data. (a) Accuracy of the model score using ve covariates ()
log(bilirubin), log(prothrombin), edema, albumin, and age, and the model score using four covariates (+), where log(bilirubin)
is excluded. Lines plot the estimates of I/D AUC(t) versus time under the assumption of proportional hazards. (b) Accuracy
of the model score using ve covariates () log(bilirubin), log(prothrombin), edema, albumin, and age, and the model score
using four covariates (+), where log(bilirubin) is excluded. Estimation is based on a varying-coecient multiplicative hazard
model. Lines plot the estimates of I/D AUC(t) versus time.
104 Biometrics, March 2005

is obtained by using bilirubin in addition to the other model marker, Mi , or covariates, Z i , would be useful. Second, we
covariates. Relative to the ve-covariate model the estimated have proposed estimators that assume a prospective study
AUC(t) for the four-covariate model is approximately 0.10 design. Extension to casecohort data may be important

units below the ve-covariate model AUC(t) for t between 0 for characterizing the accuracy of markers for rare diseases.
and 2000 days. Third, development of analytical approximations that charac-
We then relax the proportional hazard assumption and use terize the large sample distribution of the proposed estimators
the time-varying coecient models as described in Section 3.2 would facilitate approximate inference for time-dependent
to characterize the accuracy of the model score Mi = Z Ti . ROC curves, the AUC(t) curve, or the concordance summary
The bottom panel of Figure 4 displays the AUC function C . Finally, exploration of time-dependent accuracy methods
based on the estimated time-varying coecient obtained us- with a longitudinal marker, Mi (t), would be important for
ing locally weighted MPL. Early estimates of AUC(t) now ex- the common prospective medical setting in which predictive
ceed 0.90 and decline sharply to approximately 0.75 at 2000 covariate information is updated over time.
days for the ve-covariate model and to less than 0.65 at
2000 days for the four-covariate model. Using the estimated Resume
AUC(t) reveals that the Mayo model is excellent at short-
term prediction but that the predictive accuracy declines to Ladequation dun modele de survie peut etre resumee grace
 a des extensions du pourcentage de variabilite expliquee par
AUC(t) < 0.80 by 1 year for the model without bilirubin, and le modele, ou R2, utilise habituellement pour les modeles

to AUC(t) < 0.80 by 5 years for the ve-covariate model. Fi- expliquant une reponse continue, ou grace a des extensions
nally, using the time-varying coecient produces a global con- de la sensibilite et specicite, utilisees habituellement pour
cordance summary of 0.80 for the ve-covariate model and predire une reponse binaire. Dans cet article nous proposons
0.72 for the model that excludes bilirubin. une version dependant du temps de ladequation, en utilisant
des fonctions du temps de la sensibilite et la specicite cal-
culees sur les groupes a risque. Nous relions les resumes de
5. Discussion ladequation a une mesure globale de la concordance, proposee
This article introduces a new version of time-dependent sen- auparavant, qui est une extension du tau de Kendall. De plus,
sitivity, specicity, and associated ROC curves that are useful nous montrons comment utiliser les resultats obtenus par un
for characterizing the predictive accuracy of a scalar marker, modele de Cox an dobtenir les estimations de la sensibilite et
such as a derived model score, when the outcome is a cen- la specicite dependant du temps ainsi que des courbes ROC
(Receiver Operating Characteristic) dependant du temps. Des
sored survival time. We show that the area under the time-
methodes destimation semi-parametrique adaptees a la fois
specic ROC curves can be plotted as a function of time to aux modeles a hasards proportionnels et non proportionnels
characterize temporal changes in accuracy, and can be inte- sont presentees, evaluees par des simulations et illustrees par
grated using the marginal distribution of the failure time to deux jeux de donnees de survie.
provide a global concordance summary. Incident sensitivity
and dynamic specicity are shown to be easily estimated us-
ing a tted hazard model and the empirical distribution of
the marker data within risk sets. Using only a routine Cox Agresti, A. (2002). Categorical Data Analysis, 2nd edition.
model output allows estimates of accuracy that assume pro- New York: John Wiley & Sons.
portional hazards and simple regression smoothing of scaled Akritas, M. G. (1994). Nearest neighbor estimation of a bi-
Schoenfeld residuals provides accuracy summaries appropri- variate distribution under random censoring. Annals of
ate for markers that do not satisfy proportional hazards. Sim- Statistics 22, 12991327.
ulations suggest that residual smoothing and locally weighted Anderson, J. A. (1979). Multivariate logistic compounds.
partial likelihood estimators both provide feasible and accu- Biometrika 66, 1726.
rate estimates. Cai, T., Pepe, M. S., Lumley, T., Zheng, Y., and Jenny, N. S.
Our methods explicitly decouple the generation of a pre- (2003). The sensitivity and specicity of markers for
dictive score from the evaluation of prognostic accuracy. An event times. University of Washington Technical Report
investigator may use Cox regression to create a model score 188, 130.
Mi = Z Ti that is a time-invariant linear combination of base- Cai, Z. and Sun, Y. (2003). Local linear estimation for time-
line covariates Z i . However, using the exible methods pro- dependent coecients in Coxs regression models. Scan-
posed in Section 3.2 to evaluate the prognostic potential of dinavian Journal of Statistics 30, 93111.
Mi does not require commitment to the proportional hazards Cox, D. R. (1972). Regression models and life-tables (with
assumption. A practical advantage of using Mi = Z Ti is that discussion). Journal of the Royal Statistical Society, Series
a single scoring of the baseline covariates is conducted to B, Methodological 34, 187220.
generate Mi , but if proportional hazards is clearly violated Efron, B. (1979). Bootstrap methods: Another look at the
then a more general model such as 0 (t) exp[Z Ti (t)] may be jackknife. Annals of Statistics 7, 126.
appropriate, and would lead to a time-varying score Mi (t) = Etzioni, R., Pepe, M., Longton, G., Hu, C., and Goodman,
Z Ti (t). G. (1999). Incorporating the time dimension in receiver
A number of aspects warrant additional research. First, operating characteristic curves: A case study of prostate
estimation methods proposed in Sections 3.1 and 3.2 as- cancer. Medical Decision Making 19, 242251.
sume that the censoring time is independent of the survival Fan, V., Au, D., Heagerty, P., Deyo, R., McDonell, M., and
time. Relaxation to allow conditional independence given the Fihn, S. (2002). Validation of case-mix measures derived
Survival Model Predictive Accuracy and ROC Curves 105

from self-reports of diagnoses and health. Journal of Clin- Appendix

ical Epidemiology 55, 371380.
Concordance as Function of AUC(t)
Fleming, T. R. and Harrington, D. P. (1991). Counting Pro-
cesses and Survival Analysis. New York: John Wiley & Assume independent observations (Mj , Tj ) and (Mk , Tk ), and
Sons. assume that Tj is continuous such that P (Tk = Tj ) = 0. Let
Grambsch, P. M. and Therneau, T. M. (1994). Proportional P (x) denote probability or density depending on the context:
hazards tests and diagnostics based on weighted residu- 1
P [Tj < Tk ] = (by independence)
als (Corr: 1995, 82, 668). Biometrika 81, 515526. 2

Hanley, J. A. and McNeil, B. (1982). The meaning and use

of the area under the receiver operating characteristic P [Mj > Mk | Tj < Tk ]
(ROC) curve. Radiology 143, 2936.
Harrell, F. E., Lee, K. L., and Mark, D. B. (1996). Multi- = P [{Mj > Mk } {Tj < Tk }] 2
variable prognostic models: Issues in developing models,
evaluating assumptions and adequacy, and measuring

and reducing errors. Statistics in Medicine 15, 361 = P [{Mj > Mk } {Tj = t} {t < Tk }] 2 dt
Hastie, T. and Tibshirani, R. (1993). Varying-coecient mod- 
els. Journal of the Royal Statistical Society, Series B 55, = P [{Mj > Mk } | {Tj = t} {t < Tk }] 2
757796. t

Heagerty, P. J., Lumley, T., and Pepe, M. S. (2000). Time-

dependent ROC curves for censored survival data and a P [{Tj = t} {t < Tk }] dt
diagnostic marker. Biometrics 56, 337344. 
Kalbeisch, J. D. and Prentice, R. L. (2002). The Statistical = AUC(t) 2 P [Tj = t] P [t < Tk ] dt
Analysis of Failure Time Data. New York: John Wiley & t
Korn, E. L. and Simon, R. (1990). Measures of explained vari-
= AUC(t) w(t) dt = ET [AUC(T ) 2 S(T )],
ation for survival data. Statistics in Medicine 9, 487503. t
OQuigley, J. and Xu, R. (2001). Explained variation in pro-
portional hazards regression. In Handbook of Statistics in with w(t) = 2 f (t) S(t).
Clinical Oncology, J. Crowley (ed), 397409. New York:
Marcel Dekker.
Pepe, M. S. (2003). The Statistical Evaluation of Medical Tests Hazard as Bridge from P(Mi = m | Ti t)
for Classication and Prediction. Oxford: Oxford Univer- to P(Mi = m | Ti = t)
sity Press. Let P(x) denote probability or density depending on the con-
Schemper, M. and Henderson, R. (2000). Predictive accuracy text and specic assumptions. For either continuous or dis-
and explained variation in Cox regression. Biometrics 56, crete survival times the conditional hazard can be dened as
Slate, E. H. and Turnbull, B. W. (2000). Statistical models (t | Mi = m) = P (Ti = t | Mi = m)/P (Ti t | Mi = m).
for longitudinal biomarkers of disease onset. Statistics in Let P(m) denote the marginal density or distribution of the
Medicine 19, 617637. marker M. Following Xu and OQuigley (2000) we obtain the
Xu, R. and OQuigley, J. (2000). Proportional hazards es- following general relationship:
timate of the conditional survival function. Journal of
the Royal Statistical Society, Series B, Methodological 62, P (Mi = m | Ti = t)
Zheng, Y. and Heagerty, P. (2004). Semiparametric estimation = P (Ti = t | Mi = m) P (Mi = m)/P (Ti = t)
of time-dependent ROC curves for longitudinal marker
data. Biostatistics 5, 615632. = (t | Mi = m) P (Ti t | Mi = m)
Zhou, X.-H., McClish, D. K., and Obuchowski, N. A. (2002).
Statistical Methods in Diagnostic Medicine. New York: P (Mi = m)/P (Ti = t)
John Wiley & Sons.
= (t | Mi = m) P (Mi = m | Ti t) P (Ti t)/P (Ti = t)
Received August 2003. Revised March 2004.
Accepted March 2004. P (Mi = m | Ti = t) (t | Mi = m) P (Mi = m | Ti t).