Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, noncommercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=astata.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a notforprofit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.
http://www.jstor.org
LogisticRegression,SurvivalAnalysis,and the
KaplanMeier Curve
BRADLEYEFRON*
* BradleyEfronis Professor,DepartmentofStatistics,
StanfordUni ? 1988AmericanStatistical
Association
CA 94305.Thisarticle
Stanford,
versity, wasstimulatedbya talkofWei Journal Association
oftheAmericanStatistical
YangTsai concerning estimation.
isotonichazardrate and Methods
June1988,Vol.83, No.402,Theory
414
Efron:Survival Analysis Via KaplanMeier 415
1.0
Table 1. Data forArmA oftheHeadandNeckCancerStudy
ConductedbytheNorthern California
OncologyGroup,
Discretized
byMonths
0.8
ll
Month n, s, s,' Month n, s, s
o.6  1 51 1 0 25 7 0 0
2 50 2 0 26 7 0 0
0 3 48 5 1 27 7 0 0
a.o * 4 42 2 0 28 7 0 0
2 5 40 8 0 29 7 0 0
6 32 7 0 30 7 0 0
7 25 0 1 31 7 0 0
0.2.B 8 24 3 0 32 7 0 0
9 21 2 0 33 7 0 0
10 19 2 1 34 7 0 0
0.0 A 11 16 0 1 35 7 0 0
0 20 40 60 80 12 15 0 0 36 7 0 0
Months 13 15 0 0 37 7 1 1
Figure1. KaplanMeierEstimated SurvivalCurves,ArmsA and B. 14 15 3 0 38 5 1 0
15 12 1 0 39 4 0 0
Theseestimatesare takenfroma studycomparing radiationtherapy
16 11 0 0 40 4 0 0
alone (A) versusradiation
plus chemotherapy (B) forthetreatment of 17 11 0 0 41 4 0 1
head and neck cancer. Treatment B is significantlybetteraccording 18 11 1 1 42 3 0 0
to theMantelHaenszel test,significancelevel.01 (see Tables 1 and 19 9 0 0 43 3 0 0
2). "Death" actuallymeans "recurrence ofdisease." Theerrorbars 20 9 2 0 44 3 0 0
indicate? one standarderror. 21 7 0 0 45 3 0 1
22 7 0 0 46 2 0 0
23 7 0 0 47 2 1 1
24 7 0 0
For example,n3 = 48 patientswerealiveat thebegin
ningofthethirdmonthofobservation, duringwhichS3 = Total 628 42 9
5 patientsdied and s3 = 1 patientwas lostto followup. NOTE: n,is the numberofpatientsat riskat thebeginningofmonthi, s, the numberofobserved
Thisleftn4= 42 patientsstillunderstudyatthebeginning 34,
deaths, s; the numberlost to followup.The survivaltimes in days forthe 51 patientswere 7,
42, 63, 64, 74+, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157,
ofmonth4. "Lost to followup," or "censored,"dataoc 160, 160, 165, 173, 176, 185+, 218, 225, 241, 248, 273, 277, 279+, 297, 319+, 405, 417,
curredmainlybecausepatientsenteredthestudyat dif 420, 440, 523, 523+, 583, 594, 1,101, 1,116+, 1,146, 1,226+, 1,349+, 1,412+, 1,417 ("+"
indicateslostto followup).The table was constructedfromthese data, takingone monthto be
ferentcalendartimes,and someof themwerestillalive 30.438 days.
whenthedatawerecollectedat theend ofthestudy.
Table 2 showsthe discretized data forarmB of the number ofdeathssiis binomially distributed, givenni,say
study.Here we haveusedN = 61 discreteintervals, not
all ofthesamelength.(The choiceofdiscretization made Si I ni  Bi(ni, hi) independently, i = 1, 2, ... , N.
littledifference
intheestimated hazardratesandsurvival (2.2)
curves;see RemarkE, Sec. 3, and RemarkI, Sec. 5.)
Our basicassumption is thatfordataoftype(2.1), the In otherwords,si has discretedensity
0.20
(n;)
hS'(l  hi)n,S, si = 0, 1, 2, . . . , ni.
Here hiis thediscretehazardrate:
0.15 
diesduringithinterval
hi = Pr{patient
1
patientsurvivesuntilbeginning ofithinterval}. (2.3)
The binomialassumption in (2.2) is basicto mostwork
0.10
in survivalanalysis.Nice discussions appearin chapter4
ofCox and Oakes (1984) and section(5.2) ofKalbfleisch
andPrentice(1980).In whatfollows,weconsider thenito
befixedat theirobservedvalues,and takeliterally thein
dependence assumptionin(2.2). Although thisassumption
cannotbe exactlytrue(see Sec. 3, RemarkA), itleadsto
reasonableconclusionsunderthe usual assumptions for
censoreddata.
Months
The survival functionforourdiscretized is
situation
Figure2. HazardRateEstimatesforthe HeadandNeckCancer
Study.Thereis an earlyhighrisk
periodforbothtreatments.
Thehazard fi (1h), (2.4)
ratesstabilize
afteroneyear,withtreatment
A having a hazardrate l'j<i
roughly2.5 times
thatoftreatment
B. (Thebullets
areidentifying
sym
bolsforcurveA,notdatapoints.) is basedona parametrictheprobability
Thisfigure thata patientdoes notdie duringthefirst
analysisdescribedinSection
2. i  1 timeintervalsand thussurvivesat leastuntilthe
416 Journal of the American Statistical Association, June 1988
1.0
occasionally
write , and hia to emphasizethe depen
denceon a.
0.8  These assumptions describea standardlogisticregres
sionmodel(e.g., see Cox 1970),so we willquotewithout
prooftheusualresults formaximum likelihoodestimation
0.6  in such models. Let s = (Si, S2,* SN), nha = ,
n2h2
(nlh1l,, , . . . , nNhN,,)', andX equal theN x p ma
trixhavingvectorxi of (2.8), as itsithrow.Thenthep
0.4 
dimensionalscore vectoria = ( (aIaaj) log fa(s) .).
is
0.2 
ia = X'(s  nha). (3.1)
The MLE of a is thata^thatmakes(3.1) equal 0.
0.0
0 20 40
I
60 80 Thep x p secondderivative matrix la, withjlthele
Months ment  (a2laajaal) logfa(s), is givenby
Figure4. ParametricVersusNonparametric SurvivalEstimates,Arm
B. UfetableestimateforarmB Oaggedline)is comparedwithpartial ia = X' diag(niVj,a)X. (3.2)
splinejoined at 11 months
logisticregressionbased on cubiclinear
[see(2.9)].
Here Via hi,a(1  hi,a),and diag(niVi,a)is the N x N
diagonalmatrixwithdiagonalelementsniVi,a. The ex
fora, 9a = E{ia(S)ia(S)'}
matrix
pectedFisherinformation
3. MAXIMUM AND
ESTIMATES
LIKELIHOOD = Ela}, also equals X' diag(niVi,a)X.The observed
STANDARD ERRORS matrix
Fisherinformation tobe 3 = &,orequiv
is defined
This sectiondiscussescalculationof maximumlikeli 3
alentlyfrom(3.2) = 1& = X' diag(niVi,&)X.
hood estimatesand theirstandarderrors,forpartiallo Estimatedstandarderrors(SE's) forquantitiesof in
gisticregression models.UsingthearmA data of Table terestsuchas a, hi, and Gi are obtainedfromfamiliar
1 as an example,we showhow the parametric survival maximum likelihoodcalculations:
estimatesapproachthelifetable curvesandhowtheires COV ( AY
A
_
1
timatedstandarderrorsapproachthosegivenby Green
wood'sformula, as theparametric modelsbecomemore SE(hi,&) = Vi,[xigx l]1/2
complicated. The additionaltheoryrequiredformodels
involvingjoinpointestimation, suchas the cubiclinear
SE(Gi,) = Gi, [(E hj,&x) (hj&Xj) (3.3)
spline(2.9), is discussedin Section4.
Suppose, then, that we have si I ni  Bi(ni, hi) as in
(2.2), wheretheniare considered fixedat theirobserved Here Gi, = H<V(  hj,&).We usually use the shorter
values,and thesi are takento be independently distrib notationhi = hi,& Gi = Gi,
uted,giventheni. (The independence assumption is fur Table 3 givesestimated hazardratesandtheirstandard
therdiscussed inRemarkA.) Also,assumethatthelogistic errorsforthreeconditional (2.8), fit
logisticregressions
parameterAi = log[hi/(1 hi)] followsthe linearlogistic to the Table 1 arm A data: a linearmodel xi = (1, ti),the
model Ai = xia as in (2.8), so hi = [1 + exp(Ai)] . We cubicmodel(2.7), and the cubiclinear spline(2.9). (A
Table 3. EstimatedHazard Rates and TheirStandard Errorsat Selected Time Points,forTable 1 ArmA Data
Month Linear Cubic Cubiclinear Lifetable LTSM Linear Cubic Cubiclinear LTSM
Table5. Estimated
Survival
Functions
and StandardErrors
Estimated
survival Estimated
standarderrorsforlog survival
Month Linear Cubic Cubiclinear Lifetable Linear Cubic Cubiclinear Lifetable
1 .910 .947 .985 .980 .019 .019 .031 .033
3 .759 .819 .860 .843 .054 .055 .059 .065
5 .640 .677 .642 .642 .084 .086 .090 .095
7 .545 .543 .483 .501 .109 .112 .129 .132
9 .469 .433 .402 .397 .131 .143 .155 .168
11 .407 .350 .359 .355 .150 .177 .186 .202
15 .313 .250 .295 .261 .184 .236 .256 .256
20 .236 .197 .235 .184 .222 .284 .286 .299
25 .185 .176 .191 .184 .262 .312 .309 .325
30 .150 .166 .159 .184 .307 .325 .319 .338
35 .125 .158 .134 .184 .358 .341 .324 .348
40 .107 .147 .115 .126 .413 .360 .338 .370
45 .094 .108 .100 .126 .470 .445 .436 .493
47 .089 .065 .095 .063 .493 .718 .686 .726
NOTE: Leftpanel: estimated survivalG at the end of the indicatedmonths,forfourdifferent estimatorsapplied to the arm A data,
Table 1. Rightpanel: estimatedstandarderrorsforlog{G}. The standarderrorforthe cubiclinearspline includes a termforthe choice
ofjoin (see Sec. 4). Note thatthe lifetableestimateis onlyslightlymore variablethanthe cubiclinearspline.
Suppose we are interested in estimating the survival expectthatthevariability ofa survival curveGi,obtained
function Gi ratherthan the hazard rate hi. In thiscase, from a pparameter conditional logistic regression, ap
parametric methodsofferless impressive improvementsproachesthevariability of thelifetable estimateGi as p
overthenonparametric lifetable
approach.Table5 com  N. What is surprisingin Table 5 is how quicklythe
paresthe estimatedsurvivalcurvesGi (fromthelinear, approachtakesplace. Even thecubicmodel,withonlyp
cubic,and cubiclinear splinemodels)withthenonpara = 4 parameters, has barely1O%15% smallerstandard
metriclifetableestimate(2.5). The rightpanel shows errorsthanGi. On the otherhand,parametric models
estimatedstandarderrorsforlog{Gi}. The cubiclinear providemuchgreaterimprovements whenestimating the
spline,whichwasouronlyparametric modelgivinga sat hazardrate,as thetheorem(3.7) shows.
isfactory fitto the data in Table 1, is onlyslightly less
variablethanthe lifetable estimate.In the notationof RemarkA. The independence assumption (2.2) can
notbe literallytrue.Forexample,ifthereis no censoring
(3.3), Si = ni  ni+1.In this case, the sequence s1, s2, . . ., is
completely determined by the sequencen1,n2, ... , in
SE(log Gi,a)= [( hi i,a, ] (3.8) contradiction to (2.2).
j<i j<i Nevertheless, calculations based on (2.2) givereason
ableanswers underreasonableassumptions. Usingtheno
It is easierto comparestandarderrorsforlog G thanfor tationof (2.1), letv' = (s1, s2,
s1, S2, . . * , s{1, Si)
G itself,becausethefactorGi&in thethirdequationof and = (s1, withn
vi Si, S2, S2' . . . , sii, Si'i). Starting
sharpenthe com n1patientsat riskat thebeginning

formula(3.3) is removed.To further ofobservation (which
parison,all of thestandarderrorsin Table 5 werecalcu we taketo be a constant, fixedat itsobservedvalue),vi
latedassuming thatthecubicmodelwas true.
is thehistory ofdeathsand lossesforthefirsti  1 time
Formula(3.8) is closelyrelatedtoGreenwood's formula
intervals;v' is the same history extendedto includesi.
forthevarianceofthelifetable estimate. Supposein(2.8) Here we followtheusualconvention thatthes!' lossesin
we takep = N andxi = ei,theNdimensional vector(0,
anyone timeinterval occurafterthesi deaths.Notethat
O,..., 1, 0,. . .,O) with1 in the ith place (i = 1,
n2 = n1  s1  s ,n3 = n2  S2 s , and so forth, so
2, . . . , N). ThenX equals theN x N identity matrix, thereis no need to indicaten2,n3, . . , in or vi'.
ni vi
and (3.1) showsthatthe MLE hi equals si/ni= hi, the
We assumethatsi, givenvi,has a Bi(ni,hi,,)distribu
nonparametric MLE. In thiscase,theMLE ofthesurvival
tion,wherehi,a = [1 + exp(xia)]1, as in (2.8), and that
curve,Gi, equals (2.5), the lifetableestimateGi = on a nuisance
4i',givenvi, has a distribution depending
1l<j<i(1  hj). The observedinformation matrix 9 = X'
parametervector(, but not on a:
diag(niVi,&)X 
equalsdiag(nihi(1 hi)),so (3.8) gives
fa,jS1Sli51S2, S2 ,*
I
SN, SN)
I
[(s:i)  f<(sj
r ~~~ ~~1/2
= [2 ' Si ) (3.9) X [(122) h2a(1  h2)S2] f(S I V4) 
whichis Greenwood'sformula (see Miller1981,p. 45). >([()  hN,a) N NNI fg(SN (3.10)
hN,a (1 k).
This calculation
(as wellas common sense)leads us to
420 Journalof the American StatisticalAssociation,June1988
= log(2), i = 37, . . . , 61. (3.17) 1 1.31 1.41 1.03 .011 .012 1.00
3 .43 .45 1.00 .022 .030 1.00
Thesea' compensate forthediffering lengthsofthetime 5 .21 .37 1.00 .017 .044 1.00
intervalsin Table 2. [See Eq. (5.3). The fittedhazards 7 .39 .36 1.02 .032 .033 1.00
9 .68 .45 1.04 .035 .024 1.00
hi, forarmB wereadjustedto onemonth for
intervals, 11 .80 .54 1.00 .028 .020 1.01
hi, by 2 for i = 1, . .. , 18, in
example, by multiplying 15 .83 .48 1.00 .026 .016 1.02
20 .85 .42 1.01 .024 .013 1.01
orderto maketheplottedhazardratesin Fig. 2 compa 25 .87 .44 1.01 .022 .013 1.00
rable.] 35 .90 .63 1.01 .019 .016 1.01
45 .95 .92 1.01 .016 .020 1.02
SPLINEMODEL
4. THECUBICLINEAR NOTE: Leftpanel: the logitscale. Rightpanel: the hazard scale. The penaltyratiois now very
small,so estimatingthe join pointfromthe data adds littleto the standarderror.
Thissectiondiscussesthemaximum likelihoodestima
tionofthejoinpointinthecubiclinear splinemodel(2.9).
We are particularly interestedin assessingtheincreased in thepictureddifferences of thetwohazardsthantheir
standarderrorofquantities suchas thehazard
ofinterest, individual standard errors would suggest.(It is important
betweenarmsA andB ofthecancerstudy,
ratedifferences to note that thisstatement depends on choosingthesame
due to theestimation of thejoin. The discussionhereis join forboth estimated hazard rates.)
verybrief.Efron(1986,sec. 4) gavemoredetails. The resultsin Table 6 are basedon a generalization of
Tables6 and 7 numerically summarize therathertech model (2.8):
nicalresultsofEfron(1986).In Table6 we see theMLE's
= xi(o)a, i = 1, 2, ... , N, (4.1)
ofthehazardratesforarmsA andB, basedon thecubic Aiiit,ao
oftheJoinPoint4
Table8. Deviancesas a Function It is easytosee theresults
ofdiscretizing
thiscontinuous
forArmsA and B situation.Supposethattheithdiscretetimeinterval has
centerpointtiand lengthAi. Then,the discretedensity
giJa = i 2ga(t) dtis obtainedbya standardTaylorseries
Deviance 9.5 10 11 12 13 argument:
devA 50.928 48.016 47.519 47.370* 47.669
devB 34.011* 34.083 34.557 35.205 35.889 gi a = ga(ti)Ai + O(A ). (5.1)
Total 84.939 82.099 82.076* 82.575 83.558 thediscrete
Similarly, survival Gi,a=
function Ej2i gi and
NOTE: The totaldeviance is minimizedfor+ = 11.
discretehazard rate hia = gi, Gi,aare givenby
* Minima.
Gi,a = Ga(ti) + [ga(ti)I2] Ai + O(Aw)
suchas thosefortheGompertz
whereZ is a standardonesidedexponential.This last culations (5.7)
distribution
resultassumesthata2 > 0, SO Ga(t) approaches0 as t give
00o
thatthelifetime
is positiveprobability T is infinite: t? 2 to>  (5.14)
Pa{T = oo} = e0, 0 [exp(a1)]/a2 < 0. (5.9) If a2 < 0, thenPa{T = oo}is positiveand can be found
by lettingt1 (5.14): ooin
Withthisunderstanding, (5.7) remainstrueas stated.
All of thediscretemodelsconsideredpreviously have Pa{T = ??} = Ga(to)[ha(to)I(a2)I (5.15)
the potential for estimatingPa{T = oo} to be positive. For botharmsof the cancerstudy,the MLE a2 was
(Thishappenedinbotharmsofthecancerstudy;see Re negative.Formula(5.15), withto= 47 months forarmA
markG.) Thisis an advantageofourapproach.In prac and to = 77 monthsforarmB, givesthefollowing esti
tice,it is oftendesirableto includethe possibilityof a matesforthesurvival fractions:
positivesurvivalfraction,butthiscanbe clumsy todo with
the usual parametric modelsfor lifetimedistributions. P&A{T = oo} = .025, P&B{T = oo} = .189. (5.16)
Miller(1981,sec. 2.4) gavea briefdiscussion. Of course,estimates suchas (5.16) shouldbe interpreted
Morecomplicated examplesofmodel(5.4), suchas log withcaution,sincetheyrepresent heroicextrapolations
ha(t) = a1 + a2t + a3t2,do notyieldsimpleexpressions beyondtheobserveddata.
forthecdfor density.Thisis unimportant, sincethees
timationof parameters dependsonlyon the log hazard RemarkH. Thisarticleconcentrates on theonesam
rate,whichis particularly easyto use formodel(5.4). ple situation, where all patientshave the same survival
The parametervectora in model (5.4) is estimated curveGa(t). Model (5.4) and its discrete analog extend
as follows:Let n(t) be the numberof patientsat risk easilytotheregression situation, where patient j's survival
justbefore time t. We assume thattheoccurrence of ob depends on a timevarying vectorzj(t) of observedco
serveddeathsis a Poissonpointprocess,withintensity variates,say
n(t)ha(t) = n(t)ex(t)aat timet. This is thelimiting
process
hj(t) = exp[x(t)a + zj(t)f6]. (5.17)
obtainedfrom(2.2) by letting thediscretetimeintervals
decreaseto zero length(see Efron1977). Supposethat Model(5.17),andinparticular itsconnection withCox's
out of all n patientswe observed m deaths,at say,
times, likelihood,
partial orproportional hazards, model wasex
T1,T2, . .. , Tm,withthe othern  m patientsbeinglost aminedin Efron(1977). It is showntherethatthefully
to followup at varioustimesduringthestudy.DefineS parametric model(5.17) willusuallynotimprove muchon
thepartiallikelihoodmodel hj(t) = ho(t) with
exp[zj(t)f3],
x(T1)'. The scorevectorla forthePoissonprocess
is ho(t)completely at
unspecified, least notfor the estimation
f
off,.On theotherhand,(5.17) canbe effective inactually
ia = S  n(t)x(t)'ex(t)adt, (5.10) estimating the hazardshj(t), rather than just comparing
0 themas thepartiallikelihoodmodeldoes.
so theMLE 'a is givenby RemarkI. Suppose thatin the continuousPoisson
processsituation(5.4), we discretizeto situation(2.2).
S = n(t)x(t)'ex(06 dt. (5.11) How muchinformation is lost?For convenience assume
thatthecontinuous lifetime variateT takesitsvaluesin
The observedFisherinformation matrix fora is theunitinterval[0, 1], and thatthediscretization of the
data is into N equal subintervals, as in Table 1. Let
a = fn(t)x(t)'x(t)ex(t)a dt. (5.12) ga(N) be theFisherinformation matrix fora basedon the
discretedata (2.2) (takingtheindependence assumption
and
literally), let ga(oo)be the Fisher information matrix
It is easy to see that(5.10) and (5.12) are simplythe
onecan based on the original continuous data. Then, as N ?
continuous analogsof(3.1) and(3.2). Conversely,
look at (3.1) and (3.2) as convenient summation approx ga(N)  a(??)  c/N2, (5.18)
imations to theintegralsin (5.10)(5.12).The connection
betweenthediscrete andcontinuous caseswasdrawnmore wherec = (1/12)fJx(t)'x(t)n(t)h(t) dt.Here thefunc
carefully inEfron(1977),including a derivationof(5.10) tionn(t) is consideredfixedat its observedvalue,even
(5.12). thoughit is random[liketheniin (2.2)].
Result(5.18) saysthattheinformation lossdue to dis
RemarkG. A continuous cubiclinear splinemodellog cretization goes to0 veryquickly as N grows large.Various
ha(t) = a1l+ at2t+ ae3(t /ff+ a4(t  /)3 has
 alternative discretizations were triedon the cancerstudy
data, suchas discretizing armA intothesame intervals
log ha(ti) = log ha(t0) + a2(t1 to) (5.13) used for arm B in Table 2, withalmostimperceptible
forvaluesof t1and togreaterthanthejoin point4. Cal changesin theresults.
424 Journalof the American StatisticalAssociation,June1988