Professional Documents
Culture Documents
Subject CS2
Revision Notes
For the 2019 exams
Survival analysis
Booklet 4
covering
CONTENTS
Contents Page
Copyright agreement
Legal action will be taken if these terms are infringed. In addition, we may
seek to take disciplinary action through the profession or through your
employer.
These conditions remain in force after you have finished using the course.
These chapter numbers refer to the 2019 edition of the ActEd course notes.
OVERVIEW
You should also be able to define and give examples of different types of
censoring, and to spot all the types of censoring that are present in a given
situation.
CORE READING
All of the Core Reading for the topics covered in this booklet is contained in
this section.
The text given in Arial Bold Italic font is additional Core Reading that is not
directly related to the topic being discussed.
____________
Chapter 6 – Survival models
1 Assumption
Typical values of w for practical work are in the range 100–120. The
possibility of survival beyond age w is excluded by the model for
convenience and simplicity.
____________
4 We often need to deal with ages greater than zero. To meet this need,
we define Tx to be the future lifetime after age x , of a life who
survives to age x , for 0 £ x £ w . Note that T0 = T .
____________
For 0 £ x £ w :
F (x + t ) - F (x )
Fx (t ) = P [ Tx £ t ] = P [ T £ x + t ΩT > x ] =
S( x )
____________
t qx = Fx (t )
t px = 1 - t q x = Sx (t )
____________
q x = 1q x and px = 1 px
1
m x = lim+ ¥ P [ T £ x + h ΩT > x ]
hÆ0 h
h qx ª h. m x
1
(1) m x +t = lim+ ¥ P [ T £ x + t + h ΩT > x + t ]
hÆ0 h
1
(2) m x +t = lim+ ¥ P [ Tx £ t + hΩTx > t ]
hÆ0 h
It is an easy exercise to show from the definitions that these are equal.
We will often use m x + t for a fixed age x and 0 £ t < w - x .
____________
P [ T > x + t ] S( x + t )
= =
P [T > x ] S( x )
x + t p0
which can be expressed in actuarial notation as: t px =
x p0
____________
x + s + t p0 x + s p0 x + s + t p0
s +t px = = ¥ = s px ¥ t px + s
x p0 x p0 x + s p0
Similarly:
s + t px = t px ¥ s px + t
____________
or by multiplying:
d
Denote this by f x (t ) , and recall that f x (t ) = Fx (t ) .
dt
Then:
d
f x (t ) = P [ Tx £ t ]
dt
1
= lim ¥ ( P [ Tx £ t + h ] - P [ T x £ t ] )
h Æ0+ h
P [ T £ x + t + h ΩT > x ] - P [ T £ x + t ΩT > x ]
= lim
hÆ0 + h
P [T £ x +t + h ]-P [T £ x ] - ( P [T £ x +t ]-P [T £ x ] )
= lim
h Æ0+ S( x ) ¥ h
P [T £ x + t + h ]-P [T £ x + t ]
= lim
h Æ0+ S( x ) ¥ h
S( x + t ) 1 P [T £ x + t + h ]- P [T £ x + t ]
f x (t ) = ¥ lim
S( x ) h Æ0+ h S( x + t )
1
= S x (t ) ¥ lim P [ T £ x + t + h ΩT > x + t ]
hÆ 0+ h
= S x (t ) ¥ m x + t
f x (t ) = t px m x + t (0 £ t < w - x )
h qx ª h. m x (for small h)
qx
mx = 1
Ú 0 t px dt
spent alive between ages x and x + 1 by a life alive at age x and the
numerator is the probability of that life dying between exact ages x
and x + 1 .
____________
Number of deaths
Total time spent alive and at risk
More recently, these statistics have been used to estimate the force of
mortality rather than m x , because in that context they have a solid
basis in terms of a probabilistic model.
____________
1
qx Ú 0 t px m dt
mx = 1
= 1
=m
Ú 0 t px dt Ú 0 t px dt
It is denoted e∞ x .
____________
19 By definition:
w -x
e∞ x = Ú t . t px m x + t dt
0
w -x
∂
= Ú t (-
∂t
t p x ) dt
0
w -x
= -[t . t p x ]w
0
-x
+ Ú t px dt (integrating by parts)
0
w -x
= Ú t px dt
0
____________
K x = [ Tx ]
0, 1, 2, ... [w - x ]
____________
P [ K x = k ] = P [ k £ Tx < k + 1 ]
= P [ k < Tx £ k + 1 ] (*)
= k p x .q x + k
____________
e x = E [K x ]
____________
24 Then:
[w - x ]
ex = Â k . k px .q x + k
k =0
= 1 p x .q x +1
+ 2 px .q x + 2 + 2 px .q x + 2
+ 3 px .q x + 3 + 3 px .q x + 3 + 3 px .q x + 3
+ ...
[w - x ] [w - x ]
= Â Â j px .q x + j (summing columns)
k =1 j =k
[w - x ]
= Â k px
k =1
____________
w -x
e∞ x = Ú t px dt
0
[w - x ]
ex = Â k px
k =1
____________
e∞ x ª e x + ½
w -x 2
var[Tx ] = Ú t 2 t px m x +t dt - e∞ x
0
[w - x ]
var[K x ] = Â k2 k px q x + k - e x2
k =0
In this section we give two important formulae, one for t q x and one
for t px . The first follows from the result that f x (t ) = t px m x + t . We
have:
t t
t qx = Fx (t ) = Ú0 fx (s ) ds = Ú0 s px m x + s ds
Since it is impossible to die at more than one time, we simply add up,
or in the limit integrate, all these mutually exclusive probabilities.
____________
∂ ∂
p =- q = -f x (s ) = - s px m x + s
∂s s x ∂s s x
∂
∂ s px
∂
log s px = s
∂s s px
∂
log s px = - m x + s
∂s
Hence:
t t
∂
Ú ∂ s log s px ds = - Ú m x + s ds + c
0 0
ÏÔ t ¸Ô
t px = exp Ì - Ú m x + s ds + c ˝
ÓÔ 0 ˛Ô
ÏÔ t ¸Ô
p
t x = exp Ì - Ú m x + s ds ˝
ÔÓ 0 Ô˛
____________
t
t qx = Ú s px m x + s ds (6.1)
0
ÏÔ t ¸Ô
t px = exp Ì - Ú m x + s ds ˝ (6.2)
ÔÓ 0 Ô˛
____________
mx = m
{ }
ÏÔ t ¸Ô
= S x (t ) = exp Ì - Ú m ds ˝ = exp - [ m s ]0 = exp( - m t )
t
t px
ÓÔ 0 ˛Ô
t qx = 1 - t px = 1 - exp( m t )
____________
rexp(100,rate=0.5)
plot(seq(0:5000),dexp(seq(0:5000),
rate=0.5),type="l")
pexp(2,rate=0.5)
S x (t ) = exp È -a t b ˘ (6.3)
Î ˚
Since:
∂
m x +t = - log[Sx (t )]
∂t
we see that:
∂
m x +t = - [ -a t b ] = -[ -ab t b -1 ] = ab t b -1
∂t
____________
ab t b -1 = a .1.t 0 = a
This can be seen also from the expression for Sx (t ) (6.3), from which it
is clear that, when b = 1 , the Weibull model is the same as the
exponential model.
____________
The R code for simulating a random sample of 100 values from the
Weibull distribution with c = 2 and g = 0.25 is:
Similarly, the PDF, CDF and quantiles can be obtained using the R
functions dweibull, pweibull and qweibull.
33 The Gompertz and Makeham laws of mortality are two further examples
of parametric survival models. They can be expressed as follows:
Makeham’s Law: mx = A + B c x
____________
Ê t ˆ
p
t x = exp Á - Ú m x + s ds ˜
Ë 0 ¯
____________
x
(c t - 1)
t px = gc
where:
Ê -B ˆ
g = exp Á
Ë log c ˜¯
____________
x
(c t - 1)
t px = st g c
where:
Ê -B ˆ
g = exp Á and s = exp(- A)
Ë log c ˜¯
m x = b ea x
shape = logc
rate = B
to be 0.00135.
____________
47 Right censoring
48 Left censoring
49 Interval censoring
Both right and left censoring can be seen as special cases of interval
censoring.
____________
50 Random censoring
51 Type I censoring
52 Type II censoring
Let t1 < t2 < < t k be the ordered times at which deaths were
observed. We do not assume that k = m , so more than one death
might be observed at a single failure time.
57 We estimate the hazard within the interval containing event time t j as:
dj
lˆ j =
nj
Of course, effectively this formula is being used for all the other
intervals as well, but as d j = 0 in all these intervals, the hazard will be
zero.
____________
k
dj n j -d j
’ lj (1 -l j )
j =1
dj
lˆ j =
nj
(1 £ j £ k)
____________
l j = P ÈT = t j T ≥ t j ˘ 1£ j £ k (7.1)
Î ˚
(We use the symbol l to avoid confusion with the usual force of
mortality.)
____________
1 - F (t ) = ’ (1 - l j )
t j £t
Sˆ (t ) = ’
t j £t
(1 - lˆ )
j
____________
In effect, we choose finer and finer partitions of the time axis, and
estimate (1 - F (t )) as the product of the probabilities of surviving each
sub-interval. Then, with the above definition of the discrete force of
mortality (7.1), we obtain the Kaplan-Meier estimate as the mesh of the
partition tends to zero. This is the origin of the name ‘product-limit’
estimate, by which the Kaplan-Meier estimate is sometimes known.
____________
R code:
Surv(time, delta)
( )
2 dj
var ÈÎF (t ) ˘˚ ª 1 - Fˆ (t ) Â
t j £t n j (n j - d j )
t
L t = Ú m sds + Â lj
0 t j £t
where the integral deals with the continuous part of the distribution
and the sum with the discrete part. (Since this methodology was
developed by statisticians, the term ‘integrated hazard’ is in universal
use, and ‘integrated force of mortality’ is almost never seen.)
____________
dj
Lˆ t = Â
t j £t nj
____________
Sˆ (t ) = exp È- Lˆ t ˘
Î ˚
Fˆ (t ) = 1 - exp È- Lˆ t ˘
Î ˚
____________
]ª
var [ L Â
(
d j nj - d j )
t
t j £t n3j
____________
ˆ .
68 The Kaplan-Meier estimate can be approximated in terms of L t
Ê dj ˆ
Fˆt = 1 - ’ Á1- ˜
t j £t Ë nj ¯
Ê dj ˆ
@ 1 - exp Á - Â ˜
ÁË t j £ t n j ˜¯
ˆ )
= 1 - exp( - L t
____________
Consider only the single year of age between exact ages x and x + 1 .
(b) they withdraw from the investigation between exact ages x and
x +1
Cases (b) and (c) are treated as censored at either the time of
withdrawal, or exact age x + 1 respectively.
Consider first those lives in category (a), who die before exact age
x + 1 . Suppose there are k of these.
Take the first of these, and suppose that he or she died at duration t1 .
Given only the data on this life, the value of m that is most likely is the
value that maximises the probability that he or she actually dies at
duration t1 .
____________
However, in the investigation, we have more than one life that died.
Suppose a second life died at duration t2 .
____________
74 The probability of this happening is f (t2 ) , and the joint probability that
Given just these two lives, the value of m we need will be that which
maximises f (t1) f (t2 ) .
____________
75 If we now consider all the k lives that died, then the value of m we
want is that which maximises:
’ f (t i )
all lives that died
But what of the lives that were censored? Their experience must also
be taken into account.
Consider the first censored life, and suppose he or she was censored
at duration t k +1 . All we know about this person is that he or she was
still alive at duration t k +1 .
____________
’ S (t i )
all censored lives
Now, putting the deaths and the censored cases together, we can write
down the probability of observing all the data we actually observe –
both censored lives and those that died. This probability is:
’ S (t i ) ’ f (t i )
all censored lives all lives which died
d i = 1 if life i died
d i = 0 if life i was censored
n
L = ’ f (t i )d i S (t i )1-d i
i =1
Now, since:
f (t ) = S (t )h(t )
n n
L = ’ h(t i )d i S (t i )d i S (t i )1-d i = ’ h(t i )d i S (t i )
i =1 i =1
78 This produces:
n
L = ’ m d i exp( - m t i )
i =1
n n
log L = Â d i log m - Â m t i
i =1 i =1
∂ log L
 di n
= i =1 - Â ti
∂m m i =1
n
 di n
i =1
m
= Â ti
i =1
so that:
n
 di
mˆ = i =1
n
 ti
i =1
n
2
∂ log L
 di
i =1
=-
∂m 2 m2
n n
Since  di is just the total number of deaths in our data, and  ti is
i =1 i =1
the total time that the lives in the data are exposed to the risk of death,
our maximum likelihood estimate of the force of mortality (or hazard) is
just deaths divided by exposed to risk, which is intuitively sensible.
____________
If we repeat the exercise for other years of age, we can obtain a series
of estimates for the different hazards in each year of age.
Sˆ x (1) = exp( - mˆ x )
____________
80 To work out the probability that a person alive at exact age x will
survive to exact age x + 2 we note that this probability is equal to:
Therefore:
In general, therefore:
Ê m -1 ˆ
Sˆ x (m ) = ˆ
m px = exp Á - Â mˆ x + j ˜
ÁË j = 0 ˜¯
Covariates
The most widely used regression model in recent years has been the
proportional hazards model.
____________
li (t , zi ) = l0 (t )g (zi )
li (t , zi ) = l0 (t )g (zi , t )
but because the hazard no longer factorises into two terms, one
depending only on duration and the other depending only on the
covariates, these are not PH models. They are also both more complex
to interpret and more computer-intensive to estimate.
____________
l (t ) = Bc t
( )
B = exp b zTi
( )
li (t , zi ) = c t exp b zTi
____________
91 Actuaries are frequently interested in both the baseline hazard and the
effect of the covariates. As long as numerical methods are available to
maximise the full likelihood (and find the information matrix), which
nowadays should not be a problem, it is not difficult to specify any
baseline hazard required and to estimate all the parameters
simultaneously, ie those in the baseline hazard and the regression
coefficients.
____________
( )
T
l (t , z1 ) exp b z1
=
( )
l (t , z2 ) exp b zT
2
95 The Cox PH model proposes the following form of hazard function for
the i th life:
l (t ; zi ) = l0 (t ) exp( b ziT )
96 The utility of this model arises from the fact that the general ‘shape’ of
the hazard function for all individuals is determined by the baseline
hazard, while the exponential term accounts for differences between
individuals. So, if we are not primarily concerned with the precise form
of the hazard, but with the effects of the covariates, we can
ignore l0 (t ) and estimate b from the data irrespective of the shape of
the baseline hazard. This is termed a semi-parametric approach.
So useful and flexible has this proved, that the Cox model now
dominates the literature on survival analysis, and it is probably the tool
to which a statistician would turn first for the analysis of survival data.
____________
98 Let R (t j ) denote the set of lives which are at risk just before the j th
observed lifetime and for the moment assume that there is only one
death at each observed lifetime, that is d j = 1 (1 £ j £ k ) .
k exp(b z Tj )
L( b ) = ’
j =1 Â exp(b ziT )
i ŒR (t j )
99 Note that the baseline hazard cancels out and the partial likelihood
depends only on the order in which deaths are observed. (The name
‘partial’ likelihood arises because those parts of the full likelihood
involving the times at which deaths were observed and what was
observed between the observed deaths are thrown away.)
____________
Breslow’s approximation
102 Accurate calculation of the partial likelihood in case (a) is messy, since
all possible combinations of d j deaths out of the R (t j ) at risk at time
t j ought to contribute, and an approximation due to Breslow is often
used, namely:
k exp( b s Tj )
L( b ) = ’ dj
j =1 Ê ˆ
Á Â exp( b ziT )˜
ÁË i ŒR (t j ) ˜¯
103 As mentioned earlier, the partial likelihood behaves much like a full
likelihood; it yields an estimator for b which is asymptotically
(multivariate) normal and unbiased, and whose asymptotic variance
matrix can be estimated by the inverse of the observed information
matrix.
____________
Ê ∂ log L( b ) ∂ log L( b ) ˆ
u(b ) = Á , ... , ˜
Ë ∂b 1 ∂b p ¯
∂ 2 log L(b )
I (b )ij = - (1 £ i , j £ p )
∂b i ∂b j
evaluated at b̂ .
____________
104 A useful feature of most computer packages for fitting a Cox model is
that the information matrix evaluated at b̂ is usually produced as a
by-product of the fitting process (it is used in the Newton-Raphson
algorithm) so standard errors of the components of b̂ are available.
These are helpful in evaluating the fit of a particular model.
____________
Model fitting
-2(Lp - Lp +q )
Strictly this statistic is based upon full likelihoods, but when fitting a
Cox model it is used with partial likelihoods.
level of zi(1) representing sex and the level of zi(2) representing blood
pressure.
Building models
(a) we start with the null model (one with no covariates) and add
possible covariates one at a time; or
(b) we start with a full model which includes all possible covariates,
and then try to eliminate those of no significant effect.
____________
R code:
coxph(formula)
The argument formula will be similar to that used when fitting a linear
model via lm() (see Subject CS1) except that the response variable
will be a survival object instead of a vector.
____________
This section contains all the relevant exam questions from 2008 to 2017 that
are related to the topics covered in this booklet.
Solutions are given after the questions. These give enough information for
you to check your answer, including working, and also show you what an
outline examination answer should look like. Further information may be
available in the Examiners’ Report, ASET or Course Notes. (ASET can be
ordered from ActEd.)
We first provide you with a cross-reference grid that indicates the main
subject areas of each exam question. You can use this, if you wish, to
select the questions that relate just to those aspects of the topic that you
may be particularly interested in reviewing.
Alternatively, you can choose to ignore the grid, and attempt each question
without having any clues as to its content.
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
Question
Page 52
Lifetime RVs
Force of mortality
UDD assumption
Cross-reference grid
Central rate of
mortality
Gompertz model
Weibull model
Hazard function
Types of censoring
Exclusive use Batch 3a
Kaplan-Meier
model
Nelson-Aalen
model
PH models
Question attempted
Lifetime RVs
Force of mortality
UDD assumption
Gompertz model
Weibull model
Exclusive use Batch 3a
Hazard function
Types of censoring
Kaplan-Meier
model
Nelson-Aalen
model
PH models
Question attempted
Page 53
Exclusive use Batch 3a
Instrument
Piano 0
Violin [-0.05,0.19]
Trumpet [0.07,0.21]
Tuition method
Traditional 0
New [-0.15,0.05]
Sex
Male [-0.08,0.12]
Female 0
(i) Write down a general expression for the Cox proportional hazards model,
defining all terms that you use. [3]
(ii) State the regression parameters for the fitted model. [2]
(iii) Describe the class of children to which the baseline hazard applies. [1]
(iv) Discuss the suggestion that the new tuition method has improved the
chances of children continuing to play their instrument. [3]
(v) Calculate, using the results from the model, the probability that a boy will
still be playing the piano after 4 years if provided with the new tuition
method, given that the probability that a girl will still be playing the trumpet
after 4 years following the traditional method is 0.7. [3]
[Total 12]
1 2 Died
2 6 Died
3 12 Died
4 20 Left hospital
5 24 Left hospital
6 27 Died
7 30 Study ended
8 30 Study ended
9 30 Study ended
10 30 Study ended
(i) State whether the following types of censoring are present in this
investigation. In each case give a reason for your answer.
(a) Type I
(b) Type II
(ii) State, with a reason, whether the censoring in this investigation is likely to
be informative. [1]
(iii) Calculate the value of the Kaplan-Meier estimate of the survival function
at duration 28 days. [5]
(iv) Write down the Kaplan-Meier estimate of the hazard of death at duration 8
days. [1]
(i) Express the hazard rate at duration x months in terms of probabilities. [1]
(iii) Estimate confidence intervals around the integrated hazard using the
results from part (ii) to test the hypothesis that the rate of reconviction has
declined since the previous investigation. [6]
[Total 12]
(iii) Calculate mx under both of the assumptions (a) and (b) above. [5]
30 98,617
40 97,952
(i) Calculate 5 q30 under each of the two following alternative assumptions:
(ii) Calculate the number of survivors to exact age 35 years out of 100,000
births under each of the assumptions in (i) above. [1]
(i) Prove that, under Gompertz’s Law, the probability of survival from age x
to age x + t , t px , is given by:
c x ( c t -1)
È Ê -B ˆ ˘
t px = Í exp Á [3]
Î Ë ln c ˜¯ ˙˚
1 p50 = 0.995
2 p50 = 0.989
(iii) Comment on the calculation performed in (ii) compared with the usual
process for estimating the parameters from a set of crude mortality rates.
[3]
[Total 9]
Let Tx be a random variable denoting future lifetime after age x , and let T
be another random variable denoting the lifetime of a new-born person.
(
Sx (t ) = exp -( l t )b )
where l and b are parameters ( l , b > 0 ).
(iv) Sketch, on the same graph, the Weibull force of mortality for 0 £ t £ 5 for
the following pairs of values of l and b :
l = 1, b = 0.5
l = 1, b =1
l = 1, b = 1.5 [4]
[Total 10]
In your answer, explain the shape of the survival function between ages x
and y under each of the two assumptions. [2]
When the test was complete, the sub-contractor reported that he had
terminated the test after 150 days. He further reported that:
two batteries had failed after 97 days
three further batteries had failed after 120 days
two further batteries had failed after 141 days
one further battery had failed after 150 days.
(i) State, with reasons, the forms of censoring present in this study. [2]
(ii) Calculate the Kaplan-Meier estimate of the survival function based on the
information supplied by the sub-contractor. [5]
The study investigated the impact of age, sex and educational qualifications
on the hazard of returning to work using the following covariates:
S a dummy variable taking the value 1 if the person was male and 0 if the
person was female
E a dummy variable taking the value 1 if the person had passed a school
leaving examination in mathematics, and 0 otherwise
(ii) Explain why the Cox model is a popular model for the analysis of survival
data. [3]
(iii) (a) Write down the equation of the model that was estimated, defining
the terms you use (other than those defined above).
(b) List the characteristics of the young person to whom the baseline
hazard applies. [3]
Write down integral equations for the mean and variance of the complete
future lifetime at age x , Tx . [2]
The employer has maintained records for 23 of its students who all sat their
first examination in the first session of 2003. The students’ progress has been
recorded up to and including the last session of 2009. The following data
records the number of sessions which had been held before the specified
event occurred for a student in this cohort:
The remaining seven students were still studying for the examinations at the
end of 2009.
(i) Determine the median number of sessions taken to qualify for those
students who qualified during the period of observation. [2]
(ii) Calculate the Kaplan-Meier estimate of the survival function S(t ) for the
‘hazard’ of qualifying, where t is the number of sessions of examinations
since 1 January 2003. [5]
(iii) Hence estimate the median number of sessions to qualify for the students
of this employer. [2]
(iv) Explain the difference between the results in (i) and (iii) above. [2]
[Total 11]
(i) Write down the hazard function for the Cox proportional hazards model
defining all the terms that you use. [2]
Parameter
Variance
estimate
Chicken 0 0
Bird Duck –0.210 0.002
Goose 0.075 0.004
New 0.125 0.0015
Enclosure
Old 0 0
Male 0.2 0.0026
Sex
Female 0 0
(ii) State the features of the bird to which the baseline hazard applies. [1]
(b) Calculate the 95% confidence interval based on the standard error. [3]
(iv) Comment on the farmer’s belief that the new enclosure will result in an
increase in his birds’ life expectancy. [2]
(v) Calculate, using this model, the probability that a female duck in the new
enclosure has been killed by a predator at the end of six months, given
that the probability that a male goose in the old enclosure has been killed
at the end of the same period is 0.1 (all other decrements can be ignored).
[4]
[Total 12]
(ii) Calculate 0.5 p60 to six decimal places under both assumptions given
q60 = 0.05 . [2]
(iii) Comment on the relative magnitude of your answers to part (ii). [1]
[Total 5]
(i) Describe the types of censoring which are present in the study. [2]
A study of the mortality of a certain species of insect reveals that for the first 30
days of life, the insects are subject to a constant force of mortality of 0.05.
After 30 days, the force of mortality increases according to the formula:
m 30 + x = 0.05 exp(0.01x )
(i) Calculate the probability that a newly born insect will survive for at least
10 days. [1]
(ii) Calculate the probability that an insect aged 10 days will survive for at
least a further 30 days. [3]
(iii) Calculate the age in days by which 90 per cent of insects are expected
to have died. [4]
[Total 8]
(i) Explain whether each of the following types of censoring is present and
for those present explain where they occur:
right censoring
left censoring
informative censoring. [3]
(ii) Calculate the Kaplan-Meier estimate of the survival function for these
patients, stating all assumptions that you make. [6]
(iv) Estimate the probability that a patient will die within four weeks of surgery.
[1]
[Total 12]
A new weed killer was tested which was designed to kill weeds growing in
grass. The weedkiller was administered via a single application to 20 test
areas of grass. Within hours of applying the weedkiller, the leaves of all the
weeds went black and died, but after a time some of the weeds re-grew as
the weedkiller did not always kill the roots.
The test lasted for 12 months, but after six months five of the test areas were
accidentally ploughed up and so the trial on these areas had to be
discontinued. None of these five areas had shown any weed re-growth at
the time they were ploughed up.
Five areas still had no weed re-growth when the trial ended after 12
months.
(i) Describe, giving reasons, the types of censoring present in the data. [2]
(ii) Estimate the probability that there is no re-growth of weeds nine months
after application of the weedkiller using either the Kaplan-Meier or the
Nelson-Aalen estimator. [4]
[Total 6]
A study is made of the impact of regular exercise and gender on the risk of
developing heart disease among 50-70 year olds. A sample of people is
followed from exact age 50 years until either they develop heart disease or
they attain the age of 70 years. The study uses a Cox regression model.
(i) List reasons why the Cox regression model is a suitable model for
analyses of this kind. [3]
Z1 = 1 if male, 0 if female
Z2 = 1 if takes regular exercise, 0 otherwise.
The investigator then fitted three models, one with just gender as a covariate,
a second with gender and exercise as covariates, and a third with gender,
exercise and the interaction between them as covariates. The maximised
log-likelihoods of the three models and the maximum likelihood estimates of
the parameters in the third model were as follows:
Covariate Parameter
Gender 0.2
Exercise –0.3
Interaction –0.35
(ii) Show that the interaction term is required in the model by performing a
suitable statistical test. [5]
A new drug treatment for patients suffering from a chronic skin disease with
visible symptoms was tested. The drug was administered through a daily dose
for the duration of the trial. As soon as the drug regime started, the symptoms
disappeared in all patients, but after some time had a tendency to reappear as
the agent causing the disease developed resistance to the drug. The trial
lasted for six months.
The data below show the number of patients experiencing a return of their
symptoms in each month after the drug regime started.
Number of patients
Number of patient-
Month experiencing a return of their
months exposed to risk
symptoms
1 200 5
2 190 8
3 175 15
4 150 10
5 135 6
6 125 3
(ii) Comment on the use of each of these models in this situation. [4]
[Total 6]
n
L= ’ ( A + Bti )d i exp È - Ati -
Î
1 Bt 2 ˘
2 i ˚ [3]
i =1
(ii) Derive two simultaneous equations from which the maximum likelihood
estimates of the parameters A and B can be calculated. [3]
[Total 6]
Mr Bunn the baker made 12 pies to sell in his shop. He placed the pies in
the shop at 9am. During the rest of the day the following events took place:
Time Event
10am A boy bought two pies
11am A man bought three pies
12 noon Mr Bunn accidentally sat on one pie and squashed it so it could
not be sold
1pm A woman bought two pies
2pm A dog from across the street ran into Mr Bunn’s shop and stole
two pies
3pm A girl on the way home from school bought one pie
5pm Mr Bunn closed for the day and the remaining pie was still in
the shop
(i) Estimate the time it takes Mr Bunn to sell 40% of the pies he makes, using
the Nelson-Aalen estimator. [6]
(ii) Comment on whether you think this estimate would be a good basis for
Mr Bunn to plan his future production of pies. [3]
[Total 9]
(ii) Write down a general expression for the Cox proportional hazards model,
defining all the terms you use. [2]
A life office is trying to understand the impact of certain factors on the lapse
rates of its policies. It has studied the lapse rates on a block of business
subdivided by:
• sex of policyholder (Male or Female)
• policy type (Term Assurance or Whole Life)
• sales channel (Internet, Direct Sales Force or Independent Financial
Adviser.)
The office has fitted a Cox proportional hazards model to the data and has
calculated the following regression parameters:
Female 0.2
Male 0
Internet 0.4
Independent Financial Adviser −0.2
Direct Sales Force 0
(iii) State the sex/sales channel/policy type combination to which the baseline
hazard relates. [1]
(iv) Calculate the probability that this term assurance is still in force after five
years given that 60% of whole life policies bought on the internet by males
have lapsed by the end of year five. [4]
[Total 8]
A certain town runs a training course for traffic wardens each year. The
course lasts for 30 days, but the examination which enables someone to
qualify as a traffic warden can be sat any day during the course. In 2011
there were 13 participants who started the training course. The following
table has been compiled to show the day each candidate qualified or the day
each candidate who did not qualify left the course.
(iii) Sketch a graph of the Kaplan-Meier estimate, labelling the axes. [2]
When the data were gathered, the reasons for exit of candidates D and H
were accidentally transposed, and those for candidates B and L were also
accidentally transposed.
(iv) Explain how your answer to part (ii) would change if you had access to the
correct (ie untransposed) data for candidates D, H, B and L. [3]
[Total 14]
(i) Define right censoring, Type I censoring and Type II censoring. [3]
(i) Show that, for furry animals that die at ages over five years, the average
5m + 1
age at death in years is . [1]
m
A new investigation of this species of furry animal revealed that 30 per cent
of those born survived to exact age 10 years and 20 per cent of those born
survived to exact age 15 years.
(i) State the form of the hazard function for the Cox regression model,
defining all the terms used. [2]
Susanna is studying for an online test. She has collected data on past
attempts at the test and has fitted a Cox regression model to the success
rate using three covariates:
Employment 0.4
Attempt –0.2
Study time 1.15
Bill is an employee. He has taken study time and is attempting the test for
the second time. Ben is self-employed and is attempting the test for the first
time without taking study time.
(iii) Calculate how much more or less likely Ben is to pass, compared with Bill.
[3]
(iv) Explain how the model could be adjusted to take this into account. [2]
[Total 9]
The Shining Light company has developed a new type of light bulb which it
recently tested. 1,000 bulbs were switched on and observed until they
failed, or until 500 hours had elapsed. For each bulb that failed, the duration
in hours until failure was noted. Due to an earth tremor after 200 hours, 200
bulbs shattered and had to be removed from the test before failure.
The results showed that 10 bulbs failed after 50 hours, 20 bulbs failed after
100 hours, 50 bulbs failed after 250 hours, 300 bulbs failed after 400 hours
and 50 bulbs failed after 450 hours.
(i) Calculate the Kaplan-Meier estimate of the survival function S(t ) for the
light bulbs in the test. [6]
(iii) Estimate the probability that a bulb will not have failed after each of the
following durations: 300 hours, 400 hours and 600 hours. If it is not
possible to obtain an estimate for any of the durations without additional
assumptions, explain why. [3]
[Total 11]
(ii) Describe two types of censoring that are present and state to whom they
apply. [2]
2 6 3 2
1 7 1 10
1 10 3 13
2 14
(iii) Calculate the Nelson-Aalen estimate of the survival function for this trial.
[5]
(v) Estimate the probability that a person using the cream will still have
symptoms of the skin condition after two weeks. [1]
[Total 11]
loge x 0 x 1U 2I
(ii) Show that the model is both a Gompertz model and a proportional
hazards model. [3]
(iii) Calculate the predicted force of mortality for an urban resident aged 40
years with an annual income of $20,000. [2]
(iv) Calculate the additional income that an urban resident must have in order
to have the same force of mortality as a rural resident of the same age. [2]
(v) Calculate the 10-year survival probability for an urban resident aged 40
years whose annual income is $20,000. [2]
(vi) Determine the age of a rural resident with the same income as an urban
resident aged 40 years, who has the same chance of surviving for the
next 10 years. [4]
[Total 14]
An investigation has been performed into risk factors for liver disease in
persons currently resident in the United Kingdom (UK) and aged over 50
years. It considered the impact of three covariates: age at the start of the
investigation, weekly alcohol consumption and previous residence in a
tropical country.
The investigation used a Cox regression model for the hazard of developing
the disease, h(t ) , with three parameters, b A , bC and bT , as follows:
h(t ) = h0 (t ) exp ( b A A + bC C + bT T )
A was defined as exact age at the start of the investigation less 50 years.
(iii) Show that the probability of a person of the same age and drinking habits,
who has lived for more than 12 months in a tropical country, remaining
free of the disease for 10 years is slightly over one half. [4]
[Total 10]
(ii) Explain what right censoring, left censoring and interval censoring are,
giving an example of each. [3]
A toy manufacturer is testing the lifetime of its new electric children’s toy.
500 are set going at 9am one morning on test rigs plugged into the electricity
supply and are run until 5pm the next day or until they fail, whichever comes
first. Unfortunately the cleaner unplugged a test rig on which 17 toys were
still working at 7pm on the first evening in order to plug his floor polisher in.
Then, as he left work three hours later, he took three of the still working toys
for his children to play with. Of the other 480 toys it was found that 12 failed
after four hours, 25 failed after 11 hours and a further 8 failed after 31 hours.
(iii) Explain which forms of censoring are present in this investigation. [2]
(vi) Comment on the length of time for which a new toy has a 60% probability
of surviving. [1]
[Total 14]
However, it is known that the probability that an animal aged exactly five
years will survive until exact age 10 years is twice the probability that an
animal aged exactly five years will survive until exact age 20 years.
Assume that the force of mortality l is constant at ages over five years
exact.
(iv) Calculate the expectation of life at birth for these animals if l = m . [1]
(i) State, with reasons, whether the following types of censoring are present
in this investigation:
right
Type I
Type II
random. [4]
(iii) Calculate the Kaplan-Meier estimate of the survivor function for remaining
in the hospital. [6]
(iv) Sketch the Kaplan-Meier estimate of the survivor function, labelling the
axes. [2]
The mortality of a rare form of flying beetle is being studied. It has been
discovered that beetles kept in a protected environment have a constant
force of mortality m but that those in the wild have a force of mortality which
is 50% higher. It has been proven that the beetles revert immediately to the
higher rate of mortality if they are released from the protected environment.
A beetle born and always living in the wild has a 58% chance of living for
eight days.
(ii) Outline three reasons why the Cox proportional hazards model is widely
used in empirical work. [3]
[Total 6]
A study was made of a group of people seeking jobs. 700 people who were
just starting to look for work were followed for a period of eight months in a
series of interviews after exactly one month, two months etc. If the job
seeker found a job during a month, the job was assumed to have started at
the end of the month. Unfortunately, the study was unable to maintain
contact with all the job seekers.
The data from the study are shown in the table below.
Months since
Found employment Contact lost
start of study
1 100 50
2 70 0
3 50 20
4 40 20
5 20 30
6 20 60
7 12 38
8 6 0
(ii) Calculate the Kaplan-Meier estimate of the function for ‘remaining without
employment’. [6]
(iii) Test the goodness of fit of the data to this Weibull distribution. [6]
[Total 15]
(ii) Explain whether each form of censoring listed in part (i) occurs in each of
the following situations. If it is not possible to state whether a form of
censoring occurs, explain why this is the case.
where:
h0 (t ) is the baseline hazard
z is a covariate taking the value 1 if the cow was assigned the new
treatment and 0 if the cow was assigned the previous treatment
x is a covariate denoting the length of time (in days) for which the cow
had been suffering from the condition when treatment was started
For a particular cow, the new treatment and the previous treatment have
exactly the same hazard.
(iii) Calculate the number of days for which that cow had the condition before
the initiation of treatment. [2]
Under the previous treatment, cows whose treatment began after they had
been suffering from the condition for three days had a median recovery time
of 14 days once treatment had started.
(iv) Calculate the proportion of these cows which would still have had the
condition after 14 days if they had been given the new treatment. [4]
[Total 10]
Last year 33 students started the course. Of these 13 dropped out before
completing the year, and 16 passed the test before the end of the year. The
last lesson attended by the students who did not stay for the whole 39
lessons is shown in the table below along with their reason for leaving.
(iii) Determine the probability that a student who starts the course passes by
the end of the year. [1]
Since only four students had not passed by the end of the year and a total of
16 had passed, the school claims in its publicity that 80% of students are
awarded the qualification by the end of the year.
(iv) Comment on the school’s claim in light of your answer to part (iii). [2]
[Total 10]
(ii) Give an estimate of the daily cost of the new scheme. [1]
(iii) Comment on the assumptions that you have made in obtaining the
estimate in (ii). [2]
[Total 9]
The company has looked at its records over recent years and has fitted a
Cox proportional hazards model to those who have transferred within the
first two years using the factors which appear to have the most impact on
early transfer rates.
(i) Give the hazard function for this Cox proportional hazard model defining
all the terms and covariates. [3]
(ii) State the features of the person to whom the baseline hazard applies. [1]
(iii) Calculate symmetric 95% confidence intervals for the parameters based
on the standard errors. [2]
(iv) Test the suggestion that women change energy providers more frequently
than men. [3]
(v) Calculate the probability that a male customer who is a high consumer of
energy and lives in a city centre remains with the company for at least two
years. [3]
(vi) Set out how you would determine whether the effect of any of the factors
depends upon any of the other factors. [5]
[Total 17]
1 20 9 15
2 19 10 15
3 18 11 15
4 18 12 15
5 17 13 13
6 17 14 12
7 17 15 10
8 16
He's fitted a model with what he assumes are the most common contributing
factors and has calculated the parameters as shown in the table below:
Gender Male 0
Female 0.065
(i) Give the hazard function for this Cox proportional hazards model, defining
all the terms and covariates. [4]
A male moderate drinker who does not smoke has a hazard of leaving
hospital after three days of 0.6.
(ii) Calculate the hazard rate at three days for a female smoker who is a
heavy drinker and who is still in hospital at that point. [3]
(iii) Explain how the researcher could test this suggestion statistically. [2]
Another colleague suggests that the original model is good, but could be
improved by including an additional factor as to whether a patient is married
or not.
(iv) Set out how the researcher could establish whether an additional factor
representing marital status would improve the model. [4]
[Total 13]
A study was made of the impact of drinking beer on men aged 60 years and
over. A sample of men was followed from their 60th birthdays until they died,
or left the study for other reasons. The baseline hazard of death, m , was
assumed to be constant, and a proportional hazards model was estimated with
a single covariate: the average daily beer intake in standard-sized glasses
consumed, x . The equation of the model is:
h(t ) = m exp( b x )
(ii) Explain how m and b should be interpreted, in the context of this model.
[2]
(iii) Calculate the estimated hazard of death of a man aged exactly 62 years
who drinks two glasses of beer a day. [1]
A man is aged exactly 60 years and drinks three glasses of beer a day.
(iv) (a) Calculate the estimated probability that this man will still be alive in
10 years’ time.
(b) Calculate the expectation of life at age 60 years for this man. [2]
Another man is aged exactly 60 years. He drinks beer only in his local bar.
He drinks all the beer he buys and is expected to continue drinking the same
amount of beer every day until he dies. The owner of the bar is interested in
selling as much beer as possible.
(v) Determine the average number of glasses of beer a day the owner must
sell the man in order to maximise the total amount of beer the man buys
over his remaining lifetime. [4]
[Total 11]
As she runs a high quality establishment, she has lots of customers and some
of the cheese is sold. After ten days she decides the cheese will be too old to
sell and throws out the remaining packets.
(i) State, with reasons, THREE types of censoring present in this situation. [3]
(ii) Assess, for EACH type of censoring listed in your answer to part (i),
whether a change to the observational plan could be made which would
remove that type of censoring. [3]
(iii) Calculate the Kaplan-Meier estimate of the survival function for cheese
staying free from mould. [6]
[Total 12]
h(t ) = h0 (t ) exp (S b s + A b A + G bG )
S represents the sex of the patient and takes a value of 1 if the patient
is female, 0 if male
A represents the age, in years minus 20, of the patient when the drug
was administered
The company has discovered the following, where the age given is the age
when the drug was administered:
a 25 year old female who attended a gym had a hazard of symptoms
disappearing equal to twice that of a male of the same age who did not
attend a gym
a 45 year old male who did not attend a gym had a hazard of symptoms
disappearing half that of a 43 year old male who attended a gym
a 32 year old female who attended a gym had a hazard of symptoms
disappearing 60% greater than that of a 45 year old female who did not
attend a gym.
(ii) Determine for which group of people the drug is most effective. [3]
The probability that a woman who attended a gym and was aged 38 years
when she was given the drug still had symptoms of the condition after 28 days
was found to be 0.75.
(iii) Calculate the probability of still having symptoms after 28 days for a male
aged 26 years when given the drug who did not attend a gym. [4]
[Total 12]
(i) Write down the formulae for the Kaplan-Meier estimator Sˆ (t ) and Nelson-
Aalen estimator S (t ) of survival in the presence of a stated hazard,
defining all terms used. [2]
(ii) Demonstrate that the Nelson-Aalen estimator is never lower than the
Kaplan-Meier estimator. [2]
A trial is conducted amongst 20 patients who have suffered from eczema but
are in remission (that is, they are clear of the condition). The trial is to assess
whether continuing with periodic doses of a certain steroid cream in remission
reduces the rate at which eczema recurs. Patients are invited to tests every 3
months for a period of up to 5 years from when first declared to be in
remission.
The data for the trial are subdivided into a group who continued to receive the
steroid cream, and a control group who did not receive the steroid cream. The
data for the patients in the trial showing the quarterly test at which eczema
recurred, or censoring occurred, are as follows (an * indicates a patient who
was censored):
For group receiving steroid cream: 3, 5, 6*, 7*, 10, 10, 12*, 14*, 18, 19*
(b) Comment on the chance of being able to conclude from the trial data
that continuing to receive the steroid cream reduces the risk of
recurrence of eczema. [3]
[Total 18]
The solutions presented here are just outline solutions for you to use to
check your answers. See ASET for full solutions.
T
l (t , Z ) = l0 (t ) eβZ
where:
Ï1 if child is male
Z4 = Ì
Ó0 otherwise
The class of children to which the baseline hazard applies is girls who learn
the piano using the traditional method.
The point estimate of b 3 is –0.05. Since this is negative, it suggests that the
new tuition method reduces the hazard rate, ie it improves the chances of
children continuing to play their instruments.
However, when we look at the 95% confidence interval for b 3 , we see that it
contains 0. So b 3 does not appear to be significantly different from 0. This
implies that the difference made by new tuition method is not significant.
It may also be worth looking at the breakdown by sex and instrument played to
see if the new method makes a significant difference for a particular sex or a
particular instrument.
(v) Probability
The probability that a girl, taught using the traditional method, will still be
playing the trumpet after 4 years is:
exp Ê - Ú l0 (s ) e b 2 ds ˆ = 0.7
4
Ë 0 ¯
e0.14
È Ê 4 ˆ˘
ÍÎ exp Ë - Ú0 l0 (s ) ds ¯ ˙˚ = 0.7
-0.14
fi exp Ê - Ú l0 (s ) ds ˆ = (0.7)
4 e
= 0.73339
Ë 0 ¯
So the probability that a boy, taught using the new method, will still be playing
the piano after 4 years is:
e -0.05 + 0.02
È ˘
exp Ê - Ú l0 (s ) e b3 + b 4 ds ˆ = Í exp Ê - Ú l0 (s ) ds ˆ ˙
4 4
Ë 0 ¯ Î Ë 0 ¯˚
e -0.03
= (0.73339) = 0.74014
Type I censoring occurs at the end of the investigation, when the remaining
patients are censored. It is known in advance that any remaining patients will
be censored after 30 days.
Type II censoring does not occur here since the investigation does not end
once a predetermined number of patients have died.
Random censoring occurs since it is not known in advance when the patients
will leave hospital. The duration at which a patient leaves hospital can be
considered to be a random variable.
Informative censoring is likely to be present here since a patient who has left
hospital is likely to be in better health, and is therefore less likely to die, than
those that remain.
Ï1 for 0 £ t < 2
Ô
Ô9 for 2 £ t < 6
Ô 10
Ô9 8 4
ˆ Ô ¥ = for 6 £ t < 12
S(t ) = Ì 10 9 5
Ô9 8 7 7
Ô ¥ ¥ = for 12 £ t < 27
Ô 10 9 8 10
Ô 9 8 7 4 14
Ô ¥ ¥ ¥ = for 27 £ t £ 30
Ó 10 9 8 5 25
S(t)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
5 10 15 20 25 30 time, t
1
lim P ( x < X £ x + h)
lx = h Æ 0 + h
P ( X > x)
ˆ =
L t  lˆj
t j £t
where:
dj
lˆ j =
nj
d j = number of reconvictions at time t j
n j = number of criminals at large at time t j (prior to any reconvictions
at time t j )
t j j th time at which a reconviction occurs.
ˆ
Sˆ (t ) = e -Lt
The table below shows the data in terms of the required values, and the
calculation of the cumulative hazard function at each t j .
j tj nj dj lˆ j ˆ
L j
1 3 95 4 0.042105 0.042105
2 4 90 3 0.033333 0.075439
3 5 85 5 0.058824 0.134262
Sˆ (t )
0£t <3 1
3£t <4 0.95877
4£t <5 0.92734
5£t <7 0.87436
t≥7 £ 0.87436
H0 : F (6) = 0.2
H1 : F (6) < 0.2
ÈÎ0, Fu (6)˘˚
=
var L ( ) Â
(
d j nj - d j ) = 4 ¥ 91 + 3 ¥ 87 + 5 ¥ 80 = 0.00143391
6 3
t j £6 nj 953 903 853
is then:
The standard error of L 6
SE = 0.00143391 = 0.037867
Î(
Fu (6) = 1 - exp ÈÍ - Lˆ + 1.6449 ¥ SE ˘
6 ˙˚ )
= 1 - exp ÈÎ - (0.134262 + 1.6449 ¥ 0.037867) ˘˚
= 1 - e -0.196549
= 0.1784
The required 95% confidence interval for F (6) is therefore ÎÈ0, 0.1784 ˚˘ .
As this excludes the value 0.2, then we can reject H0 , at least at the 5%
level, and so conclude that there is sufficient evidence to support the
hypothesis that 6-month reconviction rates have declined since the previous
investigation.
We have removed the parts of this question that refer to the Balducci
assumption as this assumption is beyond the scope of Subject CS2.
qx
mx = 1
Ú 0 t px dt
1
where Ú 0 t px dt is the expected time spent alive in the year beginning at
exact age x .
t qx = t. qx
= 1 - t px = 1 - ( px ) = 1 - (1 - q x )
t t
t qx
(iii) Calculation of mx
Now:
1
È 2
t ˘
Ú 0 (1 - t qx ) dt = Ú 0 (1 - t qx ) dt = Ít - 2 qx ˙
1 1 1 1
Ú0 t px dt = = 1- 2
qx
ÍÎ ˙˚0
Therefore:
qx 0.1
mx = 1
= = 0.10526
1- qx 1 - 0.05
2
Ú 0 t px dt = Ú 0 ( px ) Ú 0 (0.9)
1 1 t 1 t
dt = dt
Therefore:
0.1
mx = = 0.10536
0.949122
1
Ú 0 t px mx + t dt
mx = 1
Ú 0 t px dt
The UDD assumption produces a lower value, of 0.10526. UDD implies that
the force of mortality is rising over the year of age (in order to produce a
constant rate of the number dying as the number of survivors decreases
over the age range). As the weights ( t px ) are higher at the start of the year
and decrease over the year, the greatest weights will be applied to m x + t
towards the start of the year, where mortality rates are lowest. This will tend
to reduce the weighted average mortality rate compared to (b).
At most points in the human lifespan, the force of mortality is rising with
increasing age. This means that the UDD assumption is likely to be the
most realistic assumption to make.
Both assumptions produce the same answer to the third significant figure (in
this case), so that for most practical purposes either assumption produces
acceptable results.
If deaths occur uniformly between the ages of 30 and 40, we would expect to
get 665 / 2 = 332.5 deaths between the ages of 30 and 35 (and the other
332.5 deaths between the ages of 35 and 40). So, under this assumption,
the probability of dying between the ages of 30 and 35 is:
332.5
5 q30 = = 0.003372
98, 617
10 p30 = e -10 m
= e -5 m = ( 10 p30 )
½
5 p30
97, 952
10 p30 = = 0.993257
98, 617
So:
97, 952
5 p30 = = 0.996623 and 5 q30 = 1 - 0.996623 = 0.003377
98, 617
5 q30 = 1 - e -5 m
Under the UDD assumption, the number of survivors at exact age 35 is:
97, 952
98, 617 ¥ 5 p30 = 98, 617 = 98, 283.9
98, 617
(iii) Comment
The actual number of survivors is 98,359 and this is higher than the figures
given by both the UDD and the constant force assumptions. This means
that there were more deaths between the ages of 35 and 40 than there were
between the ages of 30 and 35. In other words mortality is higher between
35 and 40 than it is between 30 and 35, which suggests that the force of
mortality is increasing between 30 and 40.
(i) Proof
m x = Bc x
t t x +s t
Ú0 mx + s ds = Ú0 Bc ds = Bc x Ú c s ds
0
t Bc x È s ln c ˘t
= Bc x Ú es ln c ds = e
0 ln c Î ˚0
=
Bc x È s ˘t
c =
Bc x c t - 1( )
ln c Î ˚0 ln c
Hence:
Ê t ˆ Í (
È -Bc x c t - 1
) ˘˙ = ÍÈexp Ê -B ˆ ˙˘ ( )
c x c t -1
t px = exp Ë - Ú0 m x + s ds ¯ = exp Í
ln c ˙
Î ËÁ ln c ¯˜ ˚
ÍÎ ˙˚
We have:
Ê -Bc 50 (c - 1) ˆ
1 p50 = exp Á ˜ = 0.995
ÁË ln c ˜¯
and:
(
Ê -Bc 50 c 2 - 1
= exp Á
) ˆ˜ = 0.989
2 p50 Á ˜
ln c
Ë ¯
Bc 50 (c - 1)
= - ln 0.995 (1)
ln c
and:
(
Bc 50 c 2 - 1 ) = - ln 0.989 (2)
ln c
c 2 - 1 ln 0.989
=
c - 1 ln 0.995
ln 0.989
c +1=
ln 0.995
So c = 1.20665 .
- ln 0.995 ¥ ln1.20665
B= = 3.797 ¥ 10 -7
1.2066550 (1.20665 - 1)
(iii) Comment
Also, in part (ii), we have used Gompertz’ law for mortality. However, in
general, we may use another member of the Gompertz-Makeham family.
The general formula for these functions is:
(i)(a) Definition of Sx (t )
Sx (t ) = P (Tx > t )
(i)(b) Derivation
We could say:
x + t p0 S( x + t )
S x ( t ) = t px = =
x p0 S( x )
P (Tx £ t + h | Tx > t )
m x + t = lim
h Æ0 h
Since:
∂ ∂
mx +t = - ln t px = - ln Sx (t )
∂t ∂t
∂
mx +t =
∂t
(l t )b = b l b t b -1
The diagram below shows the three force of mortality functions on the same
graph.
Force of mortality
= 1, = 1.5
1 = 1, = 1
= 1, = 0.5
1 2 3 4 5 Duration, t
UDD assumption
If deaths are uniformly distributed between the ages of x and y , then the
number of lives in the population decreases linearly between the ages of x
and y .
In general:
Sx (t ) = t px = exp Ê - Ú m x + s ds ˆ
t
Ë 0 ¯
S x (t ) = e - t m
n j = number at d j = number of
Failure nj - d j
j risk just before failures at time
time, t j nj
time t j tj
10
1 97 12 2 12
6
2 120 9 3 9
4
3 141 6 2 6
3
4 150 4 1 4
Ï1 for 0 £ t < 97
Ô5
for 97 £ t < 120
Ê n j - d j ˆ ÔÔ 6
Sˆ (t ) = ’ Á ˜ =Ì
5 for 120 £ t < 141
t j £t Ë n j ¯ Ô109
for 141 £ t < 150
Ô 27
ÔÓ5 18 for t = 150
The survival function is not defined at times greater than the time at which
the last censoring event took place, ie time 150 days in this case.
(iii) Explain why the figure is consistent with theft of the battery
5
The estimate of S(150) calculated in part (ii) is 18 , or 0.2777, which is not
the same as the figure of 0.2727 in the sub-contractor’s report.
However, if the sub-contractor had stolen one of the batteries at the start of
the investigation, we would have:
n j = number at d j = number of
Failure nj - d j
j risk just before failures at time
time, t j nj
time t j tj
9
1 97 11 2 11
6
2 120 9 3 9
4
3 141 6 2 6
3
4 150 4 1 4
So the Kaplan-Meier estimate of the survival function at time 150 would be:
Sˆ (150) = 9 11 ¥ 6 9 ¥ 4 6 ¥ 3 4 = 3 11 = 0.2727
(
l (t , Z ) = l0 (t ) exp b ZT )
where l0 (t ) is the baseline hazard at age/duration/time t , b is a vector of
parameters and Z is a vector of covariates.
In a proportional hazards model, the hazards of different lives with the same
age/duration/time are in the same proportion at all times, eg in the Cox
model:
(ii) Why the Cox model is a popular model for the analysis of survival data
The Cox model ensures that the hazard is always positive and gives a linear
model for the log-hazard, which is convenient in theory and practice.
In the Cox model, the general shape of the hazard function for all individuals
is determined by the baseline hazard, while the exponential term accounts
for the differences between individuals. So if we are not primarily concerned
with the precise form of the hazard, but with the effects of the covariates, we
can ignore l0 (t ) and estimate the parameters from the data irrespective of
the shape of the baseline hazard.
l (t , Z ) = l0 (t ) exp ( b A A + bSS + b E E )
The baseline hazard applies to the group of lives for whom all the covariates
are 0, ie females who were aged exactly 16 when they started to claim
benefit and had not passed the school leaving examination in mathematics.
exp ( b A + bS + b E )
= 1.5
exp ( bS )
Hence:
b A + b E = ln1.5 (1)
exp ( b E )
=2
exp ( bS )
and hence:
b E - bS = ln 2 (2)
exp ( 4 b A + b E )
=2
exp ( bS + b E )
and hence:
4 b A - bS = ln 2 (3)
4 b A - bE = 0
So:
bE = 4 b A
5 b A = ln1.5
f ( t ) = t px m x + t , t ≥ 0
• •
E (Tx ) = Ú0 t f (t )dt = Ú0 t t px m x + t dt
ex is:
•
E (Tx ) = Ú0 t px dt
The first row of data tells us that 11 students qualified during the observation
period.
6 8 8 9 9 9 11 11 13 13 13
The median number of sessions for them is the middle (6th) value from this
ordered list, ie 9 sessions.
S Q S Q
Q Q Q Q
SS Q Q Q Q Q S
4 5 6 8 9 11 13 14
j tj dj nj
1 6 1 21
2 8 2 20
3 9 3 17
4 11 2 14
5 13 3 11
Ê dj ˆ
Sˆ (t ) = ’ Á1 - ˜
t j £t Ë nj ¯
0£t <6: Sˆ (t ) = 1
20
6£t <8: Sˆ (t ) = = 0.95238
21
20 18
8 £ t < 9: Sˆ (t ) = ¥ = 0.85714
21 20
20 18 14
9 £ t < 11 : Sˆ (t ) = ¥ ¥ = 0.70588
21 20 17
20 18 14 12
11 £ t < 13 : Sˆ (t ) = ¥ ¥ ¥ = 0.60504
21 20 17 14
20 18 14 12 8
13 £ t : Sˆ (t ) = ¥ ¥ ¥ ¥ = 0.44003
21 20 17 14 11
(iii) Median
The median value we worked out in part (i) was the median time to qualify,
given that the student qualified during the period. This is not what we would
normally mean by the average time to qualify because this calculation
ignores the 7 students who were still studying at the end of the period.
The median value in part (iii), on the other hand, takes into account the
censored students, ie the students who dropped out and those still studying
at the end of the period. This provides a better estimate of the average time
to qualify.
The general form of the Cox model models l (t ; z i ) , the hazard function at
time t for individual i , as:
( )
l (t ; z i ) = l0 (t ) exp β zTi
where:
t is the duration (time)
z i is a row vector of covariates for individual i
β is a row vector of regression parameters for the model
l0 (t ) is the baseline hazard rate.
We can see from the table that the parameter estimates are zero for
‘chicken’, ‘old’ and ‘female’. So the baseline bird is a female chicken in the
old enclosure.
The hazard function for the Cox model in this example would have the
following form:
where:
t is the duration (time)
z1 = 1 for a duck and 0 for other birds
z2 = 1 for a goose and 0 for other birds
z3 = 1 for the new enclosure and 0 for the old enclosure
z4 = 1 for male and 0 for female
the b ’s are the corresponding parameters to be estimated
l0 (t ) is the baseline hazard rate.
The parameter b1 quantifies the effect of the type of bird (the differential
between duck and chicken) on the birds’ mortality rates.
The parameter b 2 quantifies the effect of the type of bird (the differential
between goose and chicken) on the birds’ mortality rates.
The parameter b 3 quantifies the effect of the enclosure (new versus old) on
the birds’ mortality rates.
The parameter b 4 quantifies the effect of sex (male versus female) on the
birds’ mortality rates.
Since the confidence interval for b 3 does not contain the value 0 and the
parameter estimate is positive, this implies that birds in the new enclosure
have a significantly higher mortality rate. So it appears that the new
enclosure will result in a reduction in the birds’ life expectancy, rather than
an increase.
The hazard rate for a male goose in the old enclosure at time t is:
= l0 (t )e0.275
We are given that at time 6 months ( = T , say), the survival probability is:
We can use this to find S0 (T ) , the survival probability of the baseline bird up
to time T :
The hazard rate for a female duck in the new enclosure at time t is:
= l0 (t )e -0.085
The probability that the female duck in the new enclosure has been killed by
6 months is therefore 1 - 0.92913 = 0.07087 .
We have removed the parts of this question that refer to the Balducci
assumption as this assumption is beyond the scope of Subject CS2.
(i)(a) UDD
t qx = t qx
( )
t
= 1 - ( px ) = 1 - (1 - q x )
t t
t qx = 1 - t px = 1 - e - m t = 1 - e - m
= ( p60 )
0.5
0.5 p60 = 0.950.5 = 0.974679
(iii) Comment
The lighter the mortality in the first half of the year, the higher the value
of 0.5 p60 . This will occur when the force of mortality is increasing over the
year between age x and age x + 1 .
The UDD assumption implies that the force of mortality is increasing and so
this gives a higher value for 0.5 p60 .
The constant force assumption says that mortality rates are the same in the
first half and in the second half of the year and so this gives a lower value
for 0.5 p60 .
Type I censoring is present since we know in advance that all surviving lives
(who have not previously withdrawn) will be censored 5 years after their
operations. Type I censoring is a special case of right censoring.
Random censoring occurs when patients withdraw from the study since the
withdrawal times are not known in advance. This is another special case of
right censoring.
dj
ˆ (t ) =
where L Â nj
. So:
tj £t
dj
 = - ln Sˆ (t )
tj £t nj
d 1
Sˆ (1) = 0.9355 fi 1 = - ln 0.9355 = 0.0667 =
n1 15
d d
Sˆ (3) = 0.7122 fi 1 + 2 = - ln 0.7122
n1 n2
d2 1 3
fi = - ln 0.7122 - = 0.2727 =
n2 15 11
Also:
d d d
Sˆ ( 4) = 0.6285 fi 1 + 2 + 3 = - ln 0.6285
n1 n2 n3
d3 1 3 1
fi = - ln 0.6285 - - = 0.1250 =
n3 15 11 8
= exp Ê - Ú ms ds ˆ
10
10 p0 Ë 0 ¯
= exp Ê - Ú 0.05 ds ˆ
10
Ë 0 ¯
= exp ( -0.5)
= 0.6065
Because the form of the force of mortality changes after 30 days, we need to
split up our calculation into the parts before and after 30 days:
20
Ú0 0.05ds = 0.05 ¥ 20 = 1
10
È 0.05e0.01s ˘
10
Ú0 0.05e0.01s ds = Í
ÍÎ 0.01 ˙˚ 0
˙ = 5 e0.1 - 1 ( )
So we get:
30 p10
ÎÍ (
= exp ( -1) ¥ exp È -5 e0.1 - 1 ˘ = 0.2174
˚˙ )
(iii) Age by which 90% have died
Let y denote the age (in days) by which 90% have died. This will satisfy
the equation y p0 = 0.1 . This age is likely to be greater than 30 days. So,
to evaluate the LHS, we need to split the age range as before:
y p0 = 30 p 0 ¥ y - 30 p3 0
(
= exp ( -0.05 ¥ 30) ¥ exp È -5 e0.01( y - 30) - 1 ˘
ÎÍ ˚˙ )
So:
(
exp ( -1.5) ¥ exp È -5 e0.01( y - 30) - 1 ˘ = 0.1
ÎÍ ˙˚ )
Taking logs:
( )
-1.5 - 5 e0.01( y - 30) - 1 = ln 0.1
3.5 - ln 0.1
fi e0.01( y - 30) =
5
Ê 3.5 - ln 0.1ˆ
fi y - 30 = 100 ln Á ˜¯ = 14.89
Ë 5
(i) Censoring
Right censoring
This is true for the individuals who left hospital during the study (B, H and M)
and for the individuals who were still alive and in hospital when the study
period finished (C, D, F, I, L and N). For these individuals all we can
establish is a lower limit for their time of death ti .
Left censoring
Left censoring is present when we are only able to establish an upper limit
for the time of death ti , not a precise value. This may be true to some
extent for the individuals who died during the study, since we are only given
the date of surgery and date of death, not the precise time of day.
It is likely that patients will only be discharged from hospital if they are well
enough to be able to cope by themselves. So those remaining in the
hospital will tend to be the more seriously ill patients, which implies
informative censoring.
The table below shows the times of death and censoring in ascending order,
measured in days from the date of surgery. A ’+’ indicates a right-censored
observation. For duration 56, we have followed the usual convention that
deaths are assumed to occur before censoring.
We can now construct the usual summary table for calculations based on
the Kaplan-Meier model:
Ê dj ˆ
Sˆ (t ) = ’ Á1 - ˜
t j £t Ë nj ¯
0 £ t < 2: Sˆ (t ) = 1
13
2 £ t < 5: Sˆ (t ) = = 0.92857
14
ˆ 13 11 11
5 £ t < 32 : S(t ) = ¥ = = 0.78571
14 13 14
ˆ 13 11 9 99
32 £ t < 56 : S(t ) = ¥ ¥ = = 0.70714
14 13 10 140
13 11 9 7 99
56 £ t < 92 : Sˆ (t ) = ¥ ¥ ¥ = = 0.61875
14 13 10 8 160
(iii) Graph
1
Estimate of S(t)
0.8
0.6
0.4
0.2
0
0 20 40 60 80 100
Duration t (days)
11 3
Fˆ (28) = 1 - Sˆ (28) = 1 - = = 0.21429
14 14
The initial rate of mortality q x is the probability that a life aged exactly x will
die during the next year, ie before reaching exact age x + 1 .
Type 1 censoring
Type 1 censoring is present since the study was terminated after a fixed time
period.
Right censoring
The test areas that were accidentally ploughed up and the ones that still
showed no re-growth after 12 months are subject to right censoring, since
we only know that their re-growth time exceeded a given number.
Non-informative censoring
We are told that it was an accident that the areas were ploughed up. So this
would be non-informative censoring.
Random censoring
Random censoring occurs when a decrement occurs at a time that was not
scheduled.
The table below summarises the information given, with the times arranged
in ascending order. A ’+’ sign indicates a right-censored observation.
We can now construct the usual summary table for calculations of empirical
survival rates:
Kaplan-Meier approach
Ê dj ˆ
Sˆ (t ) = ’ Á1 - ˜
t j £t Ë nj ¯
19 16 14 5 35 7
Sˆ (9) = ¥ ¥ ¥ = = = 0.3889
20 19 16 9 90 18
Nelson-Aalen approach
dj
Lˆ t = Â nj
t j £t
ˆ = 1 + 3 + 2 + 4 = 0.7773
L 9
20 19 16 9
Sˆ (t ) = exp -L
ˆ
t ( )
So the estimated probability of no re-growth by 9 months is:
the study can be thought of in terms of hazard rates, which form the
basis for the Cox (proportional hazards) model
we are primarily interested in comparing the effect of covariates (gender
and exercise regime)
we may want to analyse interactions between the covariates (which the
model can incorporate)
we are not particularly interested in the baseline hazard rate itself (and
Cox partial likelihoods allow us to strip this out)
the Cox model is a well-known model that is commonly used
the structure of the Cox model ensures that the hazard rate is always
positive
l (t ; Z1, Z2, Z3 ) = l0 (t )e b1 Z1 + b 2 Z2 + b 3 Z3
where Z3 = Z1 Z2 .
We need to test:
H0 : b 3 = 0 versus H1 : b 3 π 0
To do this we can use the likelihood ratio test to compare two models:
(
-2( 2 - 3 ) = -2 -1,250 - ( -1,246) = 8 )
Assuming that the sample size for the study is sufficiently large, under H0 ,
this should come from the c12 distribution. Since the observed value of the
test statistic exceeds 3.841, the upper 5% point of c12 , we reject H0 and
conclude that the b 3 parameter is required, ie there is a significant
interaction effect present.
The factors in the hazard rate for the four possible combinations of
individuals are:
The hazard rate for each month is obtained by dividing the number of
transitions by the exposed to risk, as shown in the table below.
(ii) Comment
Gompertz model
This formula implies that the hazard rate either increases or decreases
monotonically over time (or remains constant if c = 1 ). However, the results
from the table in part (i) indicate that the hazard rate reaches a peak in
month 3 and then starts to decline. This suggests that the Gompertz model
is not suitable here.
The Gompertz model does not allow us to model differences in the hazard
rate attributable to factors other than age or time.
Semi-parametric
A semi-parametric model specifies a formula for the hazard rate that is partly
parametric and partly non-parametric. The parametric component allows the
model to be fitted to the individual data.
The baseline hazard l0 (t ) , which is the same for all individuals, allows us to
incorporate a humped graph consistent with the results we can see in the
table.
However, the actual pattern of the hazard rate might not be consistent with
the proportional hazards assumption underlying the Cox model.
(i) Likelihood
For those lives who died, we know the exact values of their lifetimes. So the
contribution made to the likelihood function by the i th death is fTi (ti ) , where
f denotes the PDF of the complete future lifetime random variable for life i .
where:
= exp Ê - Ú i ( A + Bs ) ds ˆ
t
Ë 0 ¯
Ê ti ˆ
= exp Á - È A s + 1
B s2 ˘ ˜
Ë Î 2 ˚0¯
(
= exp - A ti - 1
2
B ti2 )
So the total contribution made to the likelihood function by the deaths is:
’ ( A + B ti )
deaths
(
exp - A ti - 1
2
B ti2 )
We do not know the exact lifetime of each of the survivors. For the i th
survivor, all we know is that his/her lifetime is more than ti . So the
contribution made to the likelihood function by the i th survivor is:
’
survivors
(
exp - A ti - 1
2
B ti2 )
So the overall likelihood function is:
L= ’ ( A + B ti )
deaths
(
exp - A ti - 1
2
B ti2 ) ’
survivors
(
exp - A ti - 1
2
B ti2 )
= ’ ( A + B ti ) ’ exp ( - A ti - 21 B ti2 )
deaths all
’ exp (- A ti - 21 B ti2 )
di
= ’ ( A + B ti )
all all
( )
n
d
= ’ ( A + B ti ) i exp - A ti - 1
2
B ti2
i =1
∂ n Ï d ¸Ô
Ô
ln L = Â Ì i - ti ˝
∂A Ó A + Bti
i =1 Ô Ô˛
∂ n Ï d t ¸
Ô 1 2Ô
ln L = Â Ì i i - t
2 i ˝
∂B Ô A + Bti
i =1 Ó ˛Ô
n ÏÔ d ¸Ô n d n
Â Ì A + iBt - ti ˝ = 0 or  A + iBt =  ti
i =1 Ô
Ó i Ô˛ i =1 i i =1
n
ÔÏ d t Ô¸ n dt n
Â Ì A +i Bt
i
- 21 ti2 ˝ = 0 or  A +i Bt
i
= 1
2 Â ti2
Ô
i =1 Ó i ˛Ô i =1 i i =1
(
Sˆ (t ) = exp -Lˆ (t ) )
dj
where the integrated hazard is calculated as Lˆ (t ) = Â nj
.
t j £t
So the estimated time at which 40% of the pies are sold, which is the same
as when the estimated survival probability equals 60%, will satisfy:
Lˆ (t ) = - ln0.6 = 0.5108
ie:
dj
 nj
= 0.5108
t j £t
We can now construct the usual summary table for calculations of empirical
survival rates:
We can then calculate the estimates of the hazard rate and the integrated
hazard between the times of each sale:
0£t <1 0 Lˆ (t ) = 0
1£ t < 2 2 = 1 Lˆ (t ) = 1 = 0.1667
12 6 6
2£t <4 3 Lˆ (t ) = 1 3 =
+ 10 7 = 0.4667
10 6 15
(ii) Comment
The true probability could be very different due to sampling error. It might be
better to work out a confidence interval.
It is not obvious how Mr Bunn could use this particular figure to plan
production, as the 40% figure seems an arbitrary percentage. Perhaps 90%
would be more meaningful, as this would be a time by which almost all the
pies have been sold.
Also, if he bakes twice as many pies, unless sales increase, it will take him
longer to sell 40% of them.
The Nelson-Aalen model assumes that all pies have the same probability of
being sold, which should be reasonable if they are all of a similar size and
none of them are burnt or damaged.
It also assumes that sales are independent. This is not the case here since
some of the sales and censoring events involved several pies in a single
event.
Hopefully, Mr Bunn sitting on a pie and the dog stealing two pies were one-
off events that we would not expect to be repeated.
This allows us to parameterise the model from the data irrespective of the
shape of the baseline hazard.
The general form of the Cox model models l (t ; z i ) , the hazard function at
time t for individual i , as:
( )
l (t ; z i ) = l0 (t ) exp β zTi
where:
t is the duration (time)
The baseline hazard relates to males who bought Whole Life Assurance
through the direct sales channel.
(iv) Probability that the term assurance is still in force after five years
We are told that 60% of whole life policies bought on the internet by males
have lapsed by the end of year five. We can interpret this as the probability
that such a policy has lapsed by the end of year 5:
ÏÔ 5 ¸Ô
1 - exp Ì - Ú l (t; z i ) dt ˝ = 0.6
ÔÓ 0 Ô˛
ÏÔ 5 ¸Ô
¤ exp Ì - e0.4 Ú l0 (t ) dt ˝ = 0.4
ÓÔ 0 ˛Ô
5
Ú l0 (t ) dt = - (ln 0.4) e
-0.4
¤
0
We can use this to calculate the probability that a term assurance sold to a
female by an IFA is still in force after five years:
ÏÔ 5 ¸Ô
exp Ì - e0.2 - 0.1- 0.2 Ú l0 (t ) dt ˝
ÓÔ 0 ˛Ô
{
= exp e0.2 - 0.1- 0.2 (ln 0.4) e -0.4 }
= exp {ln (0.4) e } -0.5
= 0.57364
Interval censoring is not present here because we know the day that each
traffic warden left without having qualified.
Right censoring is present here because the investigation ends after 30 days
and not all participants have qualified.
Informative censoring is present because the traffic wardens that left without
qualifying probably did so because they knew that it was going to take them
a long time to qualify.
Ê dj ˆ
Sˆ (t ) = ’ Á1 - ˜
t j £t Ë nj ¯
’ (1 - lˆj )
this day dj
tj dj nj lˆ j = Sˆ (t ) =
nj t j £t
1 1 13 0.07692 0.92308
5 1 12 0.08333 0.84615
12 2 10 0.20000 0.67692
15 1 8 0.12500 0.59231
19 1 7 0.14286 0.50769
24 1 4 0.25000 0.38077
So, we have:
Ï 1 0 £ t <1
Ô
Ô0.92308 1£ t < 5
Ô0.84615 5 £ t < 12
ˆ Ô
S(t ) = Ì0.67692 12 £ t < 15
Ô 0.59231 15 £ t < 19
Ô
Ô0.50769 19 £ t < 24
Ô
Ó0.38077 24 £ t £ 30
S(t)
1
0 t
0 30
(iv) How the answer would change with access to the correct data
Therefore, the fact that the reasons for exit of candidates D and H were
accidentally transposed is irrelevant. Both candidates left at time 19 so
swapping them will only change their labelling. There will be no effect on the
answer in part (ii).
The fact that the reasons for exit of candidates B and L were accidentally
transposed, means that the ‘death’ that we thought occurred at time 5,
actually occurred at time 10. So, the estimate of the survival probability,
Ŝ (t ) , will increase for 5 £ t < 10 and reduce for 10 £ t .
(i) Definitions
Right censoring
Right censoring occurs when observations in progress are cut short. So the
precise duration of the event is not known, only that it exceeds a certain
value.
Type I censoring
Type II censoring
Since the lives have a constant future force of mortality m , their future
lifetimes have an Exp( m ) distribution and their expected future lifetime is
1 m . However, they have already lived for 5 years. So their average age at
5m + 1
death will be 5 + 1 m , which can be written in the equivalent form .
m
For ages 5 and above, the force of mortality takes a constant value m , and
we have:
= exp Ê - Ú m x + s ds ˆ = exp Ê - Ú m ds ˆ = e - m t
t t
t px Ë 0 ¯ Ë 0 ¯
10 p0 - 15 p0 = 5 p0 ¥ 5 p5 - 5 p0 ¥ 10 p5
= 5 p0 ( 5 p5 - 10 p5 )
(
= 5 p0 e -5 m - e -10 m )
(iii) Calculate m and 5 p0
0.3
e5 m = = 1.5 fi m = 1
5
ln1.5 = 0.08109
0.2
0.3 0.32
5 p0 = 0.3e5 m = 0.3 ¥ = = 0.45
0.2 0.2
where:
b1, b 2, are the regression parameters estimated from the data.
The Cox model is a commonly used model and reliable software is available
for carrying out the required calculations.
The discrete-time hazard rates for Bill and Ben at time t are:
and:
So Ben is 42.3% ( = 1 - 0.577 ) less likely to pass the exam than Bill.
If the effect of the number of attempts is different for employees and the self-
employed, this means that there is an interaction between these two factors.
Ê dj ˆ
Sˆ (t ) = ’ Á1 - n ˜
t j £t Ë j ¯
Number of bulbs
Time (hours) Outcome
affected
0 Start of study 1,000
50 Failed 10
100 Failed 20
200 Tremor (censored) 200
250 Failed 50
400 Failed 300
450 Failed 50
500 End of study
We can now construct the usual summary table for calculations of empirical
survival rates:
We can then calculate the estimates of the survival function between the
times of each failure:
0 £ t < 50 ––– Sˆ (t ) = 1
10
50 £ t < 100 1- = 0.99 Sˆ (t ) = 1 ¥ 0.99 = 0.99
1, 000
20 Sˆ (t ) = 0.99 ¥ 0.979798
100 £ t < 250 1- = 0.979798
990 = 0.97
50 Sˆ (t ) = 0.97 ¥ 0.935065
250 £ t < 400 1- = 0.935065
770 = 0.907013
300 Sˆ (t ) = 0.907013 ¥ 0.583333
400 £ t < 450 1- = 0.583333
720 = 0.529091
50 Sˆ (t ) = 0.529091 ¥ 0.880952
450 £ t < 500 1- = 0.880952
420 = 0.466104
(ii) Sketch
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 100 200 300 400 500
Duration (hours)
(iii) Probabilities
Sˆ (300) = 0.907013
Sˆ (400) = 0.529091
We cannot estimate S(600) since time 600 lies outside the range of our
observations.
Type 1 censoring
Type 1 censoring is present since the study was terminated after a fixed time
period. So we do not know the time of recovery of any sufferers who were
still receiving the treatment at the end of the study.
Right censoring
dj
ˆ t
t j t nj
Sˆ (t ) exp ˆ t
0t 6 ––– 0 1
2
6t 7 0.02062 0.02062 0.97959
97
1
7 t 10 0.01053 0.03114 0.96934
95
1
10 t 14 0.01064 0.04178 0.95908
94
2
14 t 28 0.02247 0.06426 0.93777
89
(iv) Sketch
0.9
0 5 10 15 20 25 30
Duration (days)
(v) Probability
The probability that a person using the cream will still have symptoms after
two weeks is Sˆ (14) , which equals 0.93777, ie approximately 94%.
The Gompertz model is a simple model that is easy to apply. It has been
found to give a reasonably good description of human mortality at the older
ages where the rates increase exponentially.
Gompertz
x
x e 0 x 1U 2I e 1U 2I e 0 Bc x
where B e 1U 2I and c e 0 are constants for each individual. So this
matches the form of a Gompertz model.
Proportional hazards
e 0 x 1U 2I e
0 x 1U 2I
e
x
( t ;z1,z2 ) 0 ( t ) e 1z1 2z2
Here we have:
x 40 , U 1 and I 20,000
x e 0 x 1U 2I
e 9.0 0.0940 0.310.000120,000 e 7.1 0.000825
Let y denote the additional income required (with I being the rural
resident’s income). If we equate the force of mortality for the urban and rural
residents, we get:
0 x 1 2 (I y ) 0 x 2I
e
e
urban rural
e 1 2 y 1
1 0.3
y 3,000
2 0.0001
The predicted force of mortality at age 40 t for this individual, who is the
same as in part (iii), is:
10
log S(10)
0
40 t dt
10 7.1 0.09t
e
0
dt
10
e 7.1 e0.09t dt
0
10
0.09t
7.1 e
e
0.09 0
e0.9 1
e 7.1 0.01338
0.09
We need to find the age 40 z for which Srural (10) Surban (10) , or
equivalently log Srural (10) log Surban (10) .
We already know from part (v) that, for an individual with an income of
$20,000:
e0.9 1
log Surban (10) e 7.1
0.09
10
log Srural (10)
0
40 z t dt
10 7.4 0.09 z 0.09t
e dt
0
e0.9 1
e 7.4 0.09 z
0.09
0.3
7.1 7.4 0.09z z 3 31
0.09
The baseline hazard applies to individuals who were exact age 50 at the
start of the investigation, who are not heavy drinkers and who have not lived
for 12 months in a tropical country.
The hazard rates for the individuals in the first bullet point are
h0 (t ) exp (10 b A + bC ) and h0 (t ) exp (0) .
The hazard rates for the individuals in the second bullet point are
h0 (t ) exp ( b A A + bC + bT ) and h0 (t ) exp ( b A A) .
The hazard rates for the individuals in the third bullet point are
h0 (t ) exp ( b A A + bC C + bT ) and h0 (t ) exp ( b A A + bC C ) .
Working in reverse order, we can now find the values of the three b ’s:
bT = 1.0986 , bC = 0.2877 , b A = 0.04055
(i) Censoring
Left censoring is where we don’t know the precise time of entry, only that it
occurred before a certain time. An example would be in a medical
investigation of a disease when we don’t know the precise time of onset.
Interval censoring is where we can only say that the duration at the time of
the event of interest lies within a certain interval. An example would be in a
medical investigation where patients are only observed at six-monthly
intervals.
Right censoring is present here for the toys that were unplugged / stolen.
We don’t know when they would have failed if they had been allowed to
continue operating.
Random censoring is present here for the toys that were unplugged / stolen.
These events could not have been anticipated.
Non-random censoring is present here for the toys that were still operating at
the end of the observation period (5pm on the second day). All observations
were due to stop at this point anyway.
Non-informative censoring is present for the toys that were unplugged, since
there was nothing special to distinguish the 17 that were affected.
(
Sˆ (t ) = exp -Lˆ (t ) )
dj
where the integrated hazard is calculated as Lˆ (t ) = Â nj
.
t j £t
We can now construct the usual summary table for calculations of empirical
survival rates:
We can then calculate the estimates of the hazard rate and the integrated
hazard between each of the failure times:
0£t <4 0 0 1
12
4 £ t < 11 = 0.024 0.024 0.97629
500
25 0.024 + 0.05308
11 £ t < 31 = 0.05308 = 0.07708 0.92582
471
8 0.07708 + 0.01806
31 £ t £ 32 = 0.01806 0.90925
443 = 0.09514
(v) Graph
0.9
0.8
0 5 10 15 20 25 30 35
Duration (hours)
(vi) Comment
At the highest duration (32 hours) we have estimated the survival probability
to be 90.92%, which is much higher than 60%. So a survival probability of
60% corresponds to a time greater than 32 hours. However, we have no
data for failures beyond this time. So all we can say is that the length of time
for which a new toy has a 60% survival probability exceeds 32 hours.
The probability that an animal will survive from birth to exact age 5 years
is e -5 m .
(iii) Calculate l
5 p5 = 2 ¥ 15 p5
Since the force of mortality from age 5 onwards has a constant value l , this
is:
e -5 l = 2e -15 l
So: e10 l = 2 fi l = 1 ln 2
10
= 0.0693
Method 1
In this case, the force of mortality takes a constant value m throughout the
whole of the animal’s life.
1
So the lifetime will have an Exp( m ) distribution, which has a mean of .
m
Method 2
•
The formula for calculating the expected lifetime is ex = Ú0 t px dt . If the
force of mortality takes a constant value m , we have:
•
• • - mt È 1 ˘ 1 1
e0 = Ú0 t p0 dt = Ú0 e dt = Í - e - mt ˙ = - (0 - 1) =
Î m ˚0 m m
1 1
= = 14.427 years
m 0.0693
In this case:
•
e0 = Ú0 t p0 dt
5 •
= Ú0 t p0 dt + Ú5 t p0 dt
5 •
= Ú0 t p0 dt + Ú5 5 p0 t - 5 p5 dt
So we get:
5 - mt •
e0 = Ú0 e dt + Ú e -5 m e - l ( t - 5) dt
5
5 •
È 1 ˘ È 1 - l (t - 5) ˘
= Í - e - m t ˙ + e -5 m Í- l e ˙
Î m ˚0 Î ˚5
1 1 1 Ê 1 1ˆ
= (1 - e -5 m ) + e -5 m or + e -5 m Á - ˜
m l m Ël m¯
1 1 Ê 1ˆ
e0 = (1 - e -5 m ) + 14.43e -5 m or + e -5 m Á 14.43 - ˜
m m Ë m¯
Right censoring
Type I censoring
Type I censoring is present since the study was terminated after a fixed time
period. So we do not know the time of leaving the hospital for patients who
were still in the hospital at the end of the study.
Type II censoring
Random censoring
Random censoring is present for those patients who died or had a second
operation before the end of the study, as these events could not be
predicted in advance.
Informative censoring is present where lives who leave the study through
censoring can be expected to influence the likelihood of decrements within
the remaining population. This is likely to be the case here since the
patients removed by censoring (death or a second operation) will be in a
worse condition than those remaining in the hospital and so would have
probably had longer stays than those who were not censored.
We first need to calculate the number of days each patient was in the study
by subtracting the date of the operation from the date that observation
ended. This leads to the values:
We can now construct the usual summary table for calculations of empirical
survival rates (where ‘failure’ corresponds to leaving the hospital in this
scenario):
1 14 3 8
2 15 1 5
3 24 1 4
4 31 1 2
Ê dj ˆ
Sˆ (t ) = ’ Á1 - n ˜
t j £t Ë j ¯
We can then calculate the estimates of the survival function between the
times of each failure:
0 £ t < 14 ––– Sˆ (t ) = 1
3 5 5 5
14 £ t < 15 1- = = 0.625 Sˆ (t ) = 1 ¥ = ( = 0.625)
8 8 8 8
1 4 5 4 1
15 £ t < 24 1- = = 0.8 Sˆ (t ) = ¥ = ( = 0.5)
5 5 8 5 2
1 3 1 3 3
24 £ t < 31 1- = = 0.75 Sˆ (t ) = ¥ = ( = 0.375)
4 4 2 4 8
1 1 3 1 3
31 £ t £ 36 1- = = 0.5 Sˆ (t ) = ¥ = ( = 0.1875)
2 2 8 2 16
(iv) Sketch
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 10 20 30
Duration (days)
We can see that a lot of patients left the hospital after 14 or 15 days. This
suggests that the typical recovery period is 2 to 3 weeks, or that patients are
routinely kept in hospital for 2 weeks after surgery before they are
discharged.
We can also see from the table that the two deaths occurred very soon after
the original surgery. This suggests that the mortality rate is highest in the
first few days.
The mortality rates for the beetle are m in the protected environment and
1.5 m in the wild.
e -1.5 m ¥ 8 = 0.58
fi e -12 m = 0.58
1
fi m=- ln 0.58 = 0.04539
12
We can calculate this probability by splitting the time period into two parts –
before and after time 6:
- m ¥6 -1.5 m ¥ 2 -9 m
e
¥e
=e = e -9 ¥ 0.04539 = 0.6646
In protected In wild
environment
or approximately 66%.
The hazard rate for each individual consists of a baseline hazard, which is a
non-parametric component that depends only on the duration, multiplied by
a parametric function that depends only on the values of the covariates for
the individual.
The model is ‘proportional’ because the hazard rate for each individual
always remains in the same proportion to the baseline hazard (and hence
also to other individuals).
The Cox model is a commonly used model and reliable software is available
for carrying out the required calculations.
The exponential function ensures that the hazard rate is always positive.
Right censoring is present here for the people the researchers lost contact
with and for the people who still did not have a job at the end of the study
period. We don’t know when they would have found a job if we had
continued to monitor them.
Random censoring is also present here for the people the researchers lost
contact with. The timing of these events could not have been anticipated.
Ê dj ˆ
Sˆ (t ) = ’ Á1 - n ˜
t j £t Ë j ¯
We can see from the table that, at the end of the first month, we had lost
contact with 50 people. However, it is not clear from the information given
whether we should treat these people as having been censored:
(b) at the end of the first month (when we find that they have not turned up
for their interview).
This affects the calculations because in case (a) the censoring obviously
occurred before the decrements at time 1, whereas in case (b), we would
need to assume that the censoring occurred after the decrements at time 1
(as specified in the Core Reading).
We will adopt approach (b), but the examiners have said that either
approach was equally valid, provided that the assumption made was clearly
stated.
Found
Counter Time Censored At risk
employment
j tj cj nj
dj
1 1 100 50 700
2 2 70 0 550
3 3 50 20 480
4 4 40 20 410
5 5 20 30 350
6 6 20 60 300
7 7 12 38 220
8 8 6 164 170
We can then calculate the estimates of the survival function for each time
interval.
0 £ t <1 ––– Sˆ (t ) = 1
100
1£ t < 2 1- = 0.85714 Sˆ (t ) = 1 ¥ 0.85714 = 0.85714
700
70
2£t <3 1- = 0.87273 Sˆ (t ) = 0.85714 ¥ 0.87273 = 0.74805
550
50
3£t <4 1- = 0.89583 Sˆ (t ) = 0.74805 ¥ 0.89583 = 0.67013
480
40
4£t <5 1- = 0.90244 Sˆ (t ) = 0.67013 ¥ 0.90244 = 0.60475
410
20
5£t <6 1- = 0.94286 Sˆ (t ) = 0.60475 ¥ 0.94286 = 0.57019
350
20
6£t <7 1- = 0.93333 Sˆ (t ) = 0.57019 ¥ 0.93333 = 0.53218
300
12
7£t <8 1- = 0.94545 Sˆ (t ) = 0.53218 ¥ 0.94545 = 0.50315
220
6
t =8 1- = 0.96471 Sˆ (t ) = 0.50315 ¥ 0.96471 = 0.48539
170
8 (Observed - Expected)2
The test statistic for this test is: c 2 = Â Expected
.
t =1
Observed
Month Expected Contribution
h(t ) = l b b t b -1 nj number
(t ) number (dj ) to c 2
This is a one-sided test and, under the null hypothesis, this statistic should
have a chi-squared distribution. There are 8 months and 2 parameters have
been estimated from the data. So the number of degrees of freedom in this
case is 6.
(i) Censoring
Type I censoring occurs when the censoring times are known in advance
and individuals under observation are considered to have been censored if
the event of interest has not occurred by a specified date.
Random censoring occurs when the censoring times are not known in
advance but are considered to be random variables for individuals that are
removed from observation before the event of interest has occurred.
Type I censoring did not occur because the time when the policyholder was
censored was not known in advance.
Random censoring is present because the time when the policyholder was
censored was not known in advance.
Type I censoring may be present if it was known at the outset that this
migration would occur.
Random censoring may be present if it was not known at the outset that this
migration would occur.
Type I censoring is present since it was known at the outset that the
policyholder’s policy would mature on this date.
Random censoring is not present since it was known at the outset that the
policyholder’s policy would mature on this date.
The hazard rate for each individual consists of a baseline hazard, which is a
non-parametric component that depends only on the duration, multiplied by
a parametric function that depends only on the values of the covariates for
the individual.
The model is ‘proportional’ because the hazard rate for each individual
always remains in the same proportion to the baseline hazard (and hence
also to other individuals).
(ii) Baseline
The ‘baseline cow’ is one for which the covariates x and z equal zero, ie a
cow assigned to the previous treatment, with treatment starting immediately.
The hazard rate for a cow receiving the new treatment is:
The hazard rate for a cow receiving the previous treatment is:
hOLD (t , x ) = h0 (t ) exp(0.4 x )
We are told that the median recovery time for cows on the previous
treatment with x = 3 was 14 days. So 50% of the cows will still have the
condition after 14 days, ie:
The proportion having the condition after 14 days with the new treatment is:
È ˘
= Í exp Ê - Ú h0 (t )e1.2dt ˆ ˙
14
e0.5
= 0.5
(e ) = 0.3189
0.5
Î Ë 0 ¯˚
Assuming that the decrement of interest is passing the exam, we can use
the data provided to construct the usual summary table required for
calculating empirical survival rates:
(
Sˆ (t ) = exp -L
ˆ (t ) )
dj
where the integrated hazard is calculated as Lˆ (t ) = Â nj
.
t j £t
We can then calculate the estimates of the hazard rate and the integrated
hazard between each of the times when students passed:
0£t <7 0 0 1
2
7 £ t < 14 = 0.07407 0.07407 0.92860
27
5
14 £ t < 27 = 0.21739 0.07407 + 0.21739 = 0.29147 0.74717
23
6
27 £ t < 36 = 0.33333 0.29147 + 0.33333 = 0.62480 0.53537
18
3
36 £ t £ 39 = 0.42857 0.62480 + 0.42857 = 1.05337 0.34876
7
(ii) Graph
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 5 10 15 20 25 30 35 40
Duration (weeks)
From our calculations based on the Nelson-Aalen method in part (i), the
probability that a student who starts the course will pass the exam by the
end of the year (after adjusting for those who drop out) is:
(iv) Comment
The school’s logic is that 16 students passed while 4 remained at the end of
16 16
the year who had not passed, so the pass rate is = = 80% .
16 + 4 20
Ê dj ˆ
Sˆ (t ) = ’ Á1 - n ˜
t j £t Ë j¯
We first need to calculate the duration in the queue for each customer. For
those who made a purchase, this is the difference between the ‘time joined’
and the ‘time purchase completed’. For those who were censored, ie left
without making a purchase, this is the difference between the ‘time joined’
and the ‘time left without making purchase’. Using ‘+’ to denote a censored
observation, these are:
We can construct the usual summary table for estimating empirical survival
rates.
Made a Censored
Counter Time At risk in (t j , t j +1)
purchase
j tj nj
dj cj
1 2 12 2 0
2 4 10 2 1
3 6 7 3 1
4 8 3 1 1
5 11 1 1 0
We can then calculate the Kaplan-Meier estimates of the survival function for
each time interval.
16
20, 000 ¥ $2 ¥ Sˆ (10) = $40, 000 ¥ = $10,159
63
(iii) Assumptions
The hazard rate for a customer whose contract began t years ago will be:
-0.25 - 0
z= = -2.041
0.015
Since this is less than –1.6449 (the 5th percentile of the standard normal
distribution), we reject H0 and conclude that the data provides evidence at
the 5% significance level to suggest that women do change providers more
often than men.
(v) Probability
A male who is a low energy consumer and lives in a rural area has
covariates z1 = 1 , z2 = 0 , z3 = 0 and z4 = 1 . Using the estimated
parameter values, his hazard rate after t years is l0 (t )e -0.6 .
The probability that this customer remains with the company for at least two
years is:
A male who is a high energy consumer and lives in a city centre has
covariates z1 = 1 , z2 = 1 , z3 = 1 and z4 = 0 . Again, using the estimated
parameter values, his hazard rate after t years is l0 (t )e0.26 .
The probability that this customer remains with the company for at least two
years is:
2 0.26
- l ( t )e
S2 (2) = e Ú0 0
dt
= (e Ú )
2 2 e0.86
- Ú0 l0 ( t )e -0.6 dt .e0.86 - l0 ( t )e -0.6 dt 0.86
=e 0 = (0.3)e = 0.058
To test whether there is an interaction in the effect arising from gender and
energy consumption (say), we need to introduce an extra term b12z1z2 in
the model and test whether b12 is significantly different from zero.
This can be done by applying a likelihood ratio test with the hypotheses:
(
-2 ln L original model - ln L model with interaction )
Under H0 , this test statistic has a chi-square distribution with 1 degree of
freedom. If the observed value exceeds 3.841, we reject H0 at the 5%
significance level and we retain this interaction term in the model.
We will assume that the cups are stolen at the end of the day.
We can then use the data provided to construct the usual summary table
required for calculating empirical survival rates:
1 3 1 19
2 5 1 18
3 8 1 17
4 9 1 16
5 14 1 13
6 15 2 12
(
Sˆ (t ) = exp -L
ˆ (t ) )
dj
ˆ (t ) =
where the integrated hazard is calculated as L Â nj
.
t j £t
We can then calculate the estimates of the hazard rate and the integrated
hazard:
0£t <3 0 0 1
1
3£t <5 = 0.05263 0.05263 0.94873
19
1
5£t <8 = 0.05556 0.05263 + 0.05556 = 0.10819 0.89746
18
1
8£t <9 = 0.05882 0.10819 + 0.05882 = 0.16701 0.84619
17
1
9 £ t < 14 = 0.06250 0.16701 + 0.06250 = 0.22951 0.79492
16
1
14 £ t < 15 = 0.07692 0.22951 + 0.07692 = 0.30643 0.73607
13
2
t = 15 = 0.16667 0.30643 + 0.16667 = 0.47310 0.62307
12
(ii) Sketch
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0 5 10 15
Duration (days)
The hazard rate for a customer who has been in hospital for a length of time
t will be:
where l0 (t ) is the baseline hazard rate (taken to be the rate for male
smokers who are moderate drinkers).
Ï1 Female Ï1 Non-smoker
z1 = Ì z2 = Ì
Ó0 Male Ó0 Smoker
b1, b 2, b 3 , b 4 are regression parameters. The estimated values for these are
bˆ1 = 0.065 , bˆ2 = -0.035 , bˆ3 = -0.06 and bˆ4 = 0.085 .
For the male moderate drinker who does not smoke, we have z1 = 0 ,
z2 = 1 , z3 = 0 and z4 = 0 .
H0 : b1 = 0 vs H1 : b1 π 0
-2( 3 - 4 )
If the value of the test statistic exceeds the critical value 3.841 (the upper 5%
point of this distribution), we reject H0 and conclude that gender does have
a material impact. Otherwise we cannot conclude that gender has a material
impact.
If we are only considering the distinction between patients who are married
and patients who are not married, we could introduce an extra covariate z5
with a corresponding parameter b 5 defined as follows (with ‘Not married’
specified as the baseline level):
Ï1 Married
z5 = Ì
Ó0 Not married
H0 : b 5 = 0 vs H1 : b 5 π 0
-2( 4 - 5 )
If the value of the test statistic exceeds the critical value 3.841, we reject H0
and conclude that marital status does have a material impact. Otherwise we
cannot conclude that marital status has a material impact.
The probability that a 93-year-old will receive a card at age 100 but not at
age 105 is:
The force of mortality takes the constant value 0.20 over the range
(100,105) .
So:
Noting that the form of the mortality function changes at age 95, we also
have:
( )
e -0.95 ¥ 1 - e -1 = 0.38674 ¥ (1 - 0.36788) = 0.24447
hi (t )
In a proportional hazards model, the ratio of the hazard rates for two
h j (t )
individuals i and j does not depend on the duration t .
The m in the formula denotes the baseline hazard rate, which is used as a
reference value for the hazard rate at age 60 + t . It is the hazard rate for a
man who does not drink any beer.
The parameter b specifies how much the hazard rate increases by with
each extra glass of beer drunk.
(iii) Hazard rate for a man aged 62 who drinks 2 glasses a day
The hazard rate for a man aged 62 who drinks 2 glasses of beer a day is
estimated to be:
The hazard rate at age 60 + t for a man who drinks 3 glasses of beer a day
is estimated to be:
So the probability that this man will still be alive in 10 years’ time is estimated
to be:
So the length of his future lifetime will have an exponential distribution with
parameter l = 0.05466 .
His expectation of life equals the mean of this distribution, which is:
1 1
= = 18.29 years
l 0.05466
Using the same method, if this man drinks x glasses of beer a day, his
1
hazard rate will be m e b x and his expectation of life will be years. So
me b x
the expected total number of beers he will buy during his lifetime is:
1 365.25
365.25 ¥ bx
¥x= xe -0.2 x
me m
1
This equals 0 when x = =5.
0.2
Right censoring occurs when observations are cut off so that we don’t know
when the event of interest would have occurred. The packets of cheese that
were sold or thrown out at the end were subject to right censoring.
The right censoring and random censoring when packets of cheese were
sold could be removed by closing the shop or removing the cheese from
sale.
If you mentioned informative censoring in your answer to part (i), then you
could say that this could be removed if the shopkeeper selected a packet at
random to give to the customers. Non-informative censoring could be
removed by telling customers to select the freshest cheese to buy.
Ê dj ˆ
Sˆ (t ) = ’ Á1 - n ˜
t j £t Ë j ¯
We can construct the usual summary table for estimating empirical survival
rates, where t denotes the number of days until the cheese goes mouldy.
1 3 16 1 0
2 4 15 2 4
3 8 9 2 2
4 10 5 3 2
We can then calculate the Kaplan-Meier estimates of the survival function for
each time interval.
1 15 15 15
3£t <4 1- = Sˆ (t ) = 1 ¥ = = 0.9375
16 16 16 16
2 13 15 13 13
4£t <8 1- = Sˆ (t ) = ¥ = = 0.8125
15 15 16 15 16
2 7 13 7 91
8 £ t < 10 1- = Sˆ (t ) = ¥ = = 0.6319
9 9 16 9 144
3 2 91 2 91
t = 10 1- = Sˆ (t ) = ¥ = = 0.2528
5 5 144 5 360
We can deduce the following relationships from the three statements from
the company.
e bS + 5 b A + bG = 2e5 b A
fi e bS + bG = 2
fi bS + bG = ln 2 … (1)
e 25 b A = 1 e 23 b A + bG
2
fi e 2 b A - bG = 1
2
fi 2 b A - bG = - ln 2 … (2)
e bS +12 b A + bG = 1.6e bS + 25 b A
fi e -13 b A + bG = 1.6
fi -13 b A + bG = ln1.6 … (3)
-11b A = - ln 2 + ln1.6
bG = 2 b A + ln 2 = 0.73372
bS = ln 2 - bG = -0.04057
h(t ) is the ‘hazard’ of symptoms disappearing. So the group for which the
drug is most effective will have the highest value of h(t ) .
hMale (t ) = h0 (t )e 6 b A ( = 1.12943h0 (t ))
(i) Formulae
Ê dj ˆ
SˆKM (t ) = ’ Á1 - n ˜ for the Kaplan-Meier model
t j £t Ë j¯
Ê dj ˆ
SˆNA (t ) = exp Á - Â ˜ for the Nelson-Aalen model.
ÁË t j £ t n j ˜¯
Here:
t j is the j th time at which a hazard event occurs
Ê dj ˆ Ê dj ˆ Ê dj ˆ
SˆNA (t ) = exp Á - Â ˜ = ’ exp Á - ˜ ≥ ’ Á1 - ˜ = SˆKM (t )
ÁË t j £ t n j ˜¯ t j £ t Ë n j ¯ t j £t Ë nj ¯
Type I censoring occurs when the observational plan specifies a fixed end
date. Here patients who are still under observation after 5 years will be
subject to Type I censoring.
There is also interval censoring (resulting from the 3-month gap between the
tests). You could also have mentioned informative or non-informative
censoring, provided you justify these. For example, if patients drop out
because they think they have recovered, this would be informative
censoring.
For the group receiving treatment the usual summary table required for
calculating empirical survival rates looks like this:
1 3 1 10
2 5 1 9
3 10 2 6
4 18 1 2
We can then calculate the Kaplan-Meier estimates of the survival function for
each time interval.
0£t <3 1 Sˆ (t ) = 1
3£t <5 1 9 9 9
1- = Sˆ (t ) = 1 ¥ = = 0.9
10 10 10 10
5 £ t < 10 1 8 9 8 8
1- = Sˆ (t ) = ¥ = = 0.8
9 9 10 9 10
10 £ t < 18 2 2 8 2 8
1- = Sˆ (t ) = ¥ = = 0.5333
6 3 10 3 15
18 £ t < 20 1 1 8 1 4
1- = Sˆ (t ) = ¥ = = 0.2667
2 2 15 2 15
Control group
1 6 1 10
2 8 2 9
3 14 1 4
4 18 2 2
0£t <6 1 Sˆ (t ) = 1
6£t <8 1 9 9 9
1- = Sˆ (t ) = 1 ¥ = = 0.9
10 10 10 10
8 £ t < 14 2 7 9 7 7
1- = Sˆ (t ) = ¥ = = 0.7
9 9 10 9 10
14 £ t < 18 1 3 7 3 21
1- = Sˆ (t ) = ¥ = = 0.525
4 4 10 4 40
18 £ t < 20 2 21
1- =0 Sˆ (t ) = ¥0 = 0
2 40
The standard deviation can be calculated using Greenwood’s formula for the
variance of the Kaplan-Meier estimator, which is given in the Tables.
(v)(b) Comment
Also, we have very small sample sizes for both groups, making a statistically
significant result unlikely.
In fact, if we compare the estimates of the survival functions for both groups,
we can see that they cross over. For example:
but:
This means that we definitely won’t be able to conclude that the cream is
effective over the whole period.
FACTSHEET
t qx = Fx (t ) = P [ Tx £ t ] =
Ú
0
s px m x + s ds
Ï t ¸
Ô Ô
t px = 1 - t q x = Sx (t ) = 1 - Fx (t ) = P [ Tx > t ] = exp Ì - m x + s ds ˝
ÔÓ 0 Ô˛
Ú
t + s px = t px ¥ s px + t = s px ¥ t p x + s
Force of mortality
1
mx = lim+ ¥ P [ T £ x + h ΩT > x ]
h Æ0 h
1
mx = lim ¥ h qx so h qx ª h.m x (for small h )
h Æ0 + h
Derivatives of t px
∂ t px
= - t px m x + t
∂t
∂ t px
= - t px ( m x + t - m x )
∂x
mx =
qx
=
Ú
0
t px m x + t dt
1 1
Ú0
t px dt
Ú0
t px dt
d
fx (t ) = Fx (t ) = t px mx + t (0 £ t £ w - x )
dt
w -x
e∞ x = E [Tx ] =
Ú
0
t px dt
w -x
Ú
2
var[Tx ] = t 2 t px m x + t dt - e∞ x
0
d x +k
P ( K x = k ) = k px q x + k = k|qx =
lx
[w - x ]
ex = E [ K x ] = Â k px ª e∞ x - ½
k =1
ÎÈw - x ˚˘
var[K x ] = Âk
k =0
2
k px q x + k - ex 2
Human mortality is relatively high at very young ages, peaks again at ages
around 20 (the ‘accident hump’) and increases exponentially at older ages.
Exponential model
This model assumes that the future lifetime random variable Tx follows an
exponential distribution. The hazard function (or force of mortality) is
constant under this model and the survival function is:
t px = e -t m
Weibull model
This model assumes that the future lifetime random variable Tx follows a
Weibull distribution. For this model:
m x + t = ab t b -1
If b = 1 , the hazard function is constant and the Weibull model is the same
as the exponential model.
Gompertz’ law
mx = B c x
x
( c t -1) Ê -B ˆ
t px = gc where g = exp Á
Ë log c ˜¯
Makeham’s law
mx = A + B c x
x
( c t -1) Ê -B ˆ
t px = st g c where g = exp Á and s = exp( - A)
Ë log c ˜¯
We have summarised below the key features of each model. The bullet
points cover:
how the model is parameterised
how other decrements are handled
what the key assumptions are
how the likelihood function is determined
how the model is fitted (estimating the parameters)
key formulae / results
what types of exam questions tend to come up
other specific points relating to the model
Kaplan-Meier model
non-parametric
– models the cumulative distribution of the lifetime random
variable Tx
dj
l̂ j = estimate of the discrete hazard
nj
dj
var[F (t )] ª [1 - Fˆ (t )]2 Â Greenwood’s formula
t j £t n j (n j - d j )
(on page 33 of Tables)
exam questions
– numerical calculations
– identifying/defining types of censoring
– sketching graphs (usually the distribution function of Tx or the
survival function)
specific points
– also known as the ‘product-limit’ estimator
– very similar to Nelson-Aalen in terms of application
Nelson-Aalen model
non-parametric
– models the integrated hazard corresponding to the lifetime random
variable Tx
dj
ˆ =
L j  nj
estimate of integrated hazard
t j £t
Fˆ (t ) = 1 - exp( -L
ˆ )
j estimate of distribution function
d j (n j - d j )
(t )] =
var[ L Â variance formula
t j £t n 3j
specific points
– very similar to Kaplan-Meier in terms of application
– can be obtained as an approximation to the Kaplan-Meier model
– has better statistical properties in small samples than the Kaplan-
Meier estimation procedure
parametric
– the lifetime distribution is assumed to belong to a given family of
parametric distributions
other decrements must be handled using conditional probabilities, which
complicates the likelihood function
key assumptions
– no heterogeneity (everyone has the same hazard rate)
fitting the model
– parameters of the model can be estimated using the method of
maximum likelihood
key formulae / results
possible distributions for the hazard rate include:
– exponential (constant hazard)
– Weibull (monotonic hazard)
– Gompertz-Makeham (exponential hazard)
– log-logistic (‘humped’ hazard)
exam questions
– numerical / algebraic calculations
– using likelihood functions to estimate the parameter(s)
specific points
– can be applied to a single homogeneous group or separately to a
small number of homogeneous groups
– also has applications in modelling loss distributions for insurance
claims data
parametric or non-parametric
li (t , zi ) = l0 (t )g (zi )
exam questions
– numerical calculations
– comparing rates and probabilities for different people
– determining whether specified models are proportional or not
specific points
– this model can handle heterogeneity (ie different rates for different
people)
– can be extended to a form in which the effect of the covariates
changes with duration
semi-parametric
key assumptions
– hazard rates are proportional to baseline hazard
– proportional adjustment is exponential-linear
likelihood function / fitting the model
k exp( b zTj )
L( b ) = ’ partial likelihood
j =1 Â exp( b zTi )
i ŒR ( t j )
exam questions
– numerical calculations
– comparing rates and probabilities for different people
– determining whether specified models are proportional or not
– interpreting the parameters (the b ’s)
specific points
– it is a proportional hazards model
– this model can handle heterogeneity (ie different rates for different
people)
NOTES
NOTES
NOTES
NOTES
NOTES
NOTES