Professional Documents
Culture Documents
1. INTRODUCTION
THE PROCEDUREfor estimating a survival curve from data given in the life table
is well known. BERKSON and GAGE [l] and CUTLER and EDERER 121 give an
actuarial or life table method of estimating a survival curve. KAPLANand MEIER
[3] give a maximum likelihood estimate called the product-limit estimate; the life
table method is nearly equivalent to this in large sample,s.
The procedure for estimating other functions of survival time is not so well
known, especially to applied research workers in the life sciences. Any distribution
of survival times can be characterized by three equivalent functions which may be
defined in words as follows :
Survivorship function, T(ir(t):probability that an individual survives longer than t.
The survivorship function is often called a survival curve.
Hazard function, X(t): probability that an individual dies in a short interval of
time, given survival to time r. The h,azard function often is termed the force of
mortality or age-specific failure rate.
Probability density function, f(t): limit of the probability an individual dies in
the short interval t to (t + At) per unit width (At).
These functions are mathematically equivalent, as explained in the next section,
but each illustrates a different aspect of the data.
This paper has several purposes: (1) to provide methods for estimating the
three functions of survival time from the life table; (2) to give a procedure for
estimating median remaining life-time from life table data; and (3) to indicate how
a plot of the hazard function can be utilized to distinguish among theoretical forms
of survival time distribution. In particular, it will be seen that the log-normal
distribution implies a special form of hazard function that often is not fulfilled in
applications.
YThis investigation was partially supported by PHS researoh grants FRO0258 and FRO0254
from the Division of Research Facilities and Resources to The University of Texas at
Houston and to the Common Research Computer Facility, Texas Medical Center, Houston,
Texas.
629
630 EDMUND A. GEHAN
If some survival data are available and the form of the theoretical distribution
is unknown, it is suggested that the various survival functions be estimated and a
theoretical distribution chosen that gives a good fit to the three survival functions.
The estimation of parameters of some survival distributions from estimates of
the hazard function is considered in GEHANand SIDDIQUI[4].
There are often right-censored observations present in a set of survival data;
in other words, some individuals who were alive when last seen but who have been
lost to follow-up or were withdrawn alive at the end of the period of study. Estima-
tion procedures will be considered only if they are appropriate when censored data
are pre,sent. Of course, the procedures given will also be valid when there is
complete ascertainment of survival times.
The term “survival time” is used throughout the paper,, though it would be
equally proper to use length of response, time to recurrence of disease, time to
development of tumor or some other function of response time. In the following
sections, we give the relationship among the survival functions, a method for esti-
mating the various functions from the life table, and a method for estimating
median remaining life-time. Also, the hazard functions are given for some
theoretical distributions (exponential, Weibull, log-normal, Gamma and Gompertz).
Finally, estimates of the survival functions are obtained for an example.
Since T represents a survival time and is necessarily non-negative, f(t) is zero for
negative t and defined by the above definition for t 2 0. It can be shown that
Hence, given a form for the s.f., the p.d.f. is obtained by differentiation.
(2.3)
by definition and X(t) is obtained from ,(2.3). Finally, if X(f) is given, it can be
shown that
SO = exp I- \ h(u) dul (2.4
and f(t) is then obtained from (2.3). Thus, given any of the three survival functions,
the other two can be derived.
s- t” . . . g
ni’ = n,_:- lw - wbl - d,. In hand calculation, these entries are usually determined
after the entries li, wi and dl have been made (i= 1, . . . , 4. Then,
n;=G+‘+Ii+wi+d< (i=s,. . ., 1) and, as a check, n,’ should equal the total sample
size.
The individuals whose survival experience is studied could be obtained from:
(1) a cohort study, i.e., a group of individuals studied from some zero point, say
time of diagnosis, to death, or (2) a series of cohorts analyzed at a particular date;
for example, the cohorts could be all cases of a disease diagnosed in 1960, 1961,
. . . and the time of analysis could be 1968, or (3) a clinical trial where each patient
is observed from time of start of treatment to conclusion of study. In each case, the
individual’s survival time is measured from his own zero point. Examples of zero
point are time of start of treatment, time of diagnosis and time of first symptoms.
(e) Number lost to foZZow-up (Q. This is the number of individuals who are
lost to observation for some reason and whose survival .status thus became unknown
in the ith interval (i= 1, . . ., s). Individuals may be lost to observation if they
move, or fail to return for treatment, etc. Every attempt should be made to trace
such cases. See (0 for the assumption made about the ,survival experience of such
cases.
(0 Number withdrawn aJive (wi). Individuals withdrawn alive are those known
to be alive at the closing date of the study. Such observations arise in clinical
trials and cohort studies. In both cases, individuals are exposed to the risk of
death for varying periods of time depending on their date of entrance on study.
An individual entering study at a point in chronological time near the closing date
will have a period of exposure to the risk of failure shorter than an individual who
enters study substantially in advance of the closing date. The time to withdrawal
alive for an individual alive at the closing date is the length of chronological
time from his entrance into study to the closing date of the study. Thus, Wi is the
number of individuals withdrawn alive in the irh interval.
The assumption made for the life table calculations is that the survival experience
after ‘the date of last contact of those lost to follow-up (Zi) and withdrawn alive
(wJ is similar to that of the individuals who remain under observation, This
assumption seems reasonable for individuals withdrawn live. However, as CUTLER
and EDERER [2] explain, the survival experience of lo,st individuals may be better,
the same, or worse than individuals continuing under observation. Consequently,
it is most important to keep the percentage of individuals lost as low as possible.
(g) Number exposed to risk (n,). This number is defined as ni =n: - _5(li + wi),
(i=l, . . ., s). If there are no losses or withdrawals, then ni=n,‘. Individuals lost
or withdrawn in an interval are credited with being exposed to the risk of failure
for one-half the interval. This is a basic assumption of the life table and should
be correct on the average. The presumption is that the times to 10,s~or withdrawal
are approximately uniformly distributed throughout the interval.
(h) Number dying (di). This is the number dying in the P interval. The time
to death for each individual is measured from his own zero point.
(9 Conditional proportion dying (6). This is given by e1 = di/nr, (i= 1, . . . , s - l),
d8= 1. The proportion is conditional since it is the probability of death in the
rti interval, given exposure to the risk of death in the P interval.
634 FDMTJNDA.GJ3HAN
Thus, &,,J is the estimated probability of dying in the ith interval per unit width;
this is the definition of the p.d.f.
(m) Hazard function &x,,,Jl. The estimate of the h.f. for each interval is
(3.1)
(or kJ, &,,J and *h(x.,,,J*.Each survival function illustrates a different aspect of
the data. The survivorship function, &Xi) is useful in obtaining median and other
percentile estimates of survival time. The median, z, is that value such that
4($=05. Other percentiles of survival time are obtained similarly. The sur-
vivorship function is also useful for estimating the percentage of patients surviving
longer than x time units. A common example is the proportion of patients with
a disease surviving longer than 5 yr. A plot of the hazard function, &x,,,J,
characterizes the ageing of the population. Does the risk of death per unit time
increase, remain about the same, decrease or describe a more complex course? In
a later section, it will be shown that plotting the hazard function is helpful in making
a choice of theoretical distribution. The probability density function, ~(x,,J, can
be used to estimate the proportion of deaths taking place in any interval of time.
It is also useful for estimating peaks of high frequency of death.
The variances of the estimates of the survival functions in the ith interval are :
i-l
(3.4)
and
(3.5)
All the formulae are large-sample approximations. The formula for the
Var [&xi)] is well-known, having been given first by GREENWOOD[l 11. The other
two variance formulas do not seem to have been given before and are derived in the
appendix. These may be used to obtain approximate confidence limits for the
various survival functions.
*A computer program is available to calculate the various survival functions and their
variances, given wg I,, di and R, (+I, 2, . . ..,. s). A copy may be obtained by writing the
author. Thanks to Mrs. Jane E. Putman for wntmg the program.
636 EDMUND A. GEHAN
accomplished when G(X) is less than O-5 for some x. The median is recommended
in elementary statistics texts for characterizing distributions skewed to the right. It
is possible to estimate it routinely from the life table. If the median and the
expectation of life are estimable, both could be estimated. When the mean is sub-
stantially larger than the median life-time, it is an indication that there is some
proportion of long term survivors.
The median remaining lifeatime at time xi, (i= 1, . . . , s - 1) is designated z, and
defined as :
Here, bj is the estimated proportion surviving beyond the lower limit of the class
interval containing the median. For the median to be defined, F,+* must be less
than ii/2.
The variance of this estimate is approximately
where ^Kx,-) is the estimate of the probability density function in the interval
containing the median. This is an extension of the formula derived in KENDALL
and STUART [13] to the case of conditional probability density functions. The com-
puter program noted will also calculate estimates of G and J [Var (;Fi)].
The hazard funotion characterizes the ageing or wearing out process. The
simplest hazard function is constant with time; in this case, the p.d.f. is the
exponential distribution. This means that there is no ageing and failure is a random
event. Though this is a very simple distribution, it has been found to fit many kinds
of data. ZELEN [14] discusses the application of exponential models in cancer
research; EPSTEIN[15] gives its role in industrial life-testing in one of many papers
on the distribution. The exponential distribution is not always applicable; see
BERG and ROBBINS[16] for an example in which the data fit the exponential dis-
tribution over the short but not the long-term.
Survival distributions other than the exponential have hazard functions that vary
with time. A complete discussion of various survival distributions is given by
Cox [6] and BUCKLAND[71. Some of the results pertaining to hazard functions
are summarized here :
Distribution Hazard function
Exponential A(x) = A,
Remarks : There is no ageing; failure is a random event.
Weibull A(x)= A,"lX,tii-'
Gamma
where
The exponential distribution is a special case of all the above distributions except
the log-normal. From the information given in the remarks, it should ,be evident that
the five survival distributions have a variety of types of hazard function. Suppose
a survival distribution is to be fitted to some data and there is not #theoretical basis
for choosing among distributions. It is suggested that the sample hazard functions be
plotted and a distribution chosen using the information given in the remarks. Other
forms of survival distribution are given by Cox [61 and BUCKLAND[?i.
A common choice of survival distribution among those in the life sciences is
the log-normal. Since most probability density functions are skewed to the right,
it is natural to take logarithms to obtain a more symmetrical distribution. How-
ever, choosing the log-normal distribution implies that there is an early period of
positive ageing followed by a period of negative ageing. The model should be
examined carefully and the sample hazard functions plotted before arguing that
survival is log-normal.
The log-normal distribution is difficult to distinguish from an exponential distri-
bution, especially when the distinction is attempted from a plot of the survivorship
functions. Fig. 1 gives a plot of the survivorship and hazard functions for an
exponential distribution and three forms of a log-normal distribution (coeff. of
var. = 05,1*0 and l-5).
The coeff. of var. is always one for the exponential distribution. The average sur-
vival time in all cases is 20 time units.
The curves given are those based on known values of the parameters of the
distributions and do not represent a sample of real data. It should be evident
that if there is a moderate-sized sample of survival times with a sample coefficient
of variation near one, it will be nearly impossible to distinguish between an ex-
ponential and a log-normal distribution in terms of “goodness of fit”, especially if
only the survivorship functions are compared. Note that the hazard functions for
each form of log-normal rise to a peak and later decrease. This implies the highest
risk of death per unit time is in some period after the start of the study. Since the
exponential is a special case of the Gamma, Gompertz and Weibull distributions,
one of these distributions would probably fit as well or better than a log-normal
distribution when the coefficient of variation is near one.
If there is no theoretical basis for choosing between fitting a log-normal and an
exponential distribution and both fit the data about equally well, the exponential
should probably be selected because of its simplicity.
SURVIVORSHIP FUNCTION
0 EXPONENTIAL
.I2 RD FUNCTION
.I0
UJ .06
5
2 .06
If
i .04
between O-09 and O-12. The hazard functions are generally higher after the tenth
year. Hence, the prognosis for a patient who has survived 1 yr is better than that
for a newly diagnosed patient if factors intiuencing prognosis are not considered.
A similar interpretation is reached by examining the median remaining life-times
by year. Initially, the estimated median life-time is 5.3 yr. From the first year to
the sixth, the median remaining life-time is above 5.3 yr and begins decreasing con-
sistently only from the seventh year on.
Further interpretation will not be attempted here, especially since PARKERet al.
[17] have a complete discussion. Also, it would be misleading to interpret these
data as representing the survival of a homogenous group of individuals.
TABLE 2. %RVIVAI. F~NCTIONCAL~JLA~ONS FOR MALES wITHANGINAPECTORIS
: ::: 1170
938 0 1116.5
871.5 125 0.1120
0.0952 0.8880
0.9048
7 722 x 671.0 ;: 0.1103 0.8897
; I:: 546
427 x 512.0
395.0 z 0.1063
0.0996 EE
10 1;: 321 0 298.5 43 0.1441 0:8559
:: 12.5
11:5 233
146 0” 206.5
129.5 34
18 0.1390
0.1646 0.8354
0.8610
;: 13.5
14.5 95 : 47.5
81.5 6’ 0.1104
0.1263 0.8737
0.8896
.25
01, i’i’i’l
8 ’
I
10
I I
12 ’ 14
8 ’
HAZARD FUNCTION
o’,, , , , , , I I’1 II I I I1
0 .2 4 6 6 IO 12 14
YEAit AFTER DIAGNOSIS
FIG. 2. Survival functions for males with angina pcctoris.
REFERENCES
1. BERKSON, J. and GAGE, R. R.: Calculation of survival rates for cancer. Proc. staff Meet.
Mayo Clin. 25,270, 1950.
2. CUTLER, S. I. and EDERER, F.: Maximum utilization of the life table method in analyzing
survival. J. &on. Dis. 8 (6), 699, 1958.
3. KAPLAN, E. L. and MEIER, P.: Nonparametric estimation from incomplete observations.
1. Am. stat. Ass. 53 (l), 457, 1958.
4. GEHAN,E. A. and SIDDIQUI,M. M. : In prepara,tion.
5. BROADBENT, S. : Simple mortality rates. Appl. Statist. MI (2), 86, 1958.
6. Cox, D. R.: Renewal Theory, Chap. I. Wiley, New York, 1962.
7. BUCKLAND,W. R.: Sfatistical Assessment of the Life Characteristic. Hafner, New York,
1964.
642 EDMUND A. GEM
APPENDIX
We first find the Var (&,Jl assuming that all individuals have failed. The result will
be given in terms of the true proportions failing in each interval. Estimates of these propor-
tions in the incomplete sample case will be substituted for the true proportions so that the
final variance formula is approximate and will be valid only for large samples.
If a sample of ai individuals is followed until all fail and each death is recorded as occurring
in one of s fixed intervals, the joint distribution of the numbers of deaths is multinomial.
Suppose the set-up is as follows :
Interval
Sample Alive
entering “1 n2 . . . “1 . . .
“s
interval
Proportion
dying in 6, 8, . . . 9, . . . es
Population interval
Deaths Vl v* *. . Vi . . . ve
The sample size is n, so that i v(=ni Here, ni’=ni since there are no losses or withdrawals.
I=1
.
Also, i f$=l. It is convenient to introduce
*=*
i-1 i-l i-1
m,=Ld,, I.+= I: v, and &= L Bi
1=1 111 +=I
6mi= mi-wi.
Then,
4
Var
‘, (ni_m‘--di/2) I =Var
h{ (n~-Pi-vi/2)
i
1-
am,
(nl-_CL<-vi12) -
6di
2 (n~-P{-vi12) i
=Var Vl
h‘ (n,--CL,-vr/:
(W2
(ni-Pi-Vi/2)3 + 4 (hi-_I.L*-vi/2)”
2
and E @d&m,)= -n,+$,. Substituting in (A.l) and after considerable simplification, we obtain
Var {t (x,J}=
nIhj2 ei
(1 -f#Ji-si/2)z ( l-
0,
[ 2(1-&-&/z) I) *
(A.21
‘This is ahe formula with complete ascertainment of survival times. For incomplete samples,
we use
e,& &+l-;, ?r,+ + n,,
where + means is estimated by. Of course, when there are losses or withdrawals, the actual
number starting study is n,’ and those alive at the start of each interval are ni’(i=l, . . ., s).
In all variance formulas, (3.3), (3.4), (3.5), and (4.1) n,’ is replaced by ni since this is the estimated
number of individuals exposed to the risk of failure under the life-table assumption concerning
losses and withdrawals. This certainly affects the estimates of the survival functions and their
variances. The effect of this, while almost certainly slight in large samples, will not be inves-
tigated here.
With the above assumptions and replacements, (A.2) becomes
(A.3)
Since $,=a, . . . jc,, the problem is to find the variance of a function of i random variables.
Using the large sample approximation formula in KENDALLand STUART [12] (p. 232), we have
644 EDMUND A. GEHAN
M.4)
where,
Now,
and
Var {ii1 =
Sj(lII,-&I , (j=l, . . . , i).
I
This is a large sample approximation formula and is defined only when ni > 0, i= 1, . . . , s- 1.