You are on page 1of 63

Department of Biostatistics

University of Aarhus

May 11 2004
Michael Vth

STATISTICAL ANALYSIS OF SURVIVAL DATA


IN CLINICAL RESEARCH 4

The main topic in the third period is analysis of


aggregated survival time data, i.e. data in which a
record reflects the survival experience of several
individuals in a given time and/or age period.
Such data are often encountered in epidemiological
studies and the methods presented below are
essentially identical to methods used to analyze
incidence rates and mortality rates in epidemiology.
A comprehensive coverage of analysis of aggregated
survival time data is beyond the scope of this course,
but the main approaches will be presented and
exemplified.
One additional topic related to survival time data with
individual records are also presented today:
Calculation of the expected survival curve based on
life tables for an external reference population

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

ANALYSIS OF SURVIVAL TIME DATA RELATION TO METHODS USED IN


EPIDEMIOLOGY
The statistical methods described in period 1 and 2 have
focused on mortality rates and modeled how the rates
depend on prognostic factors.
A clinical study of a group of patients followed until death
or some common closing date may be viewed as an
epidemiological study of a fixed cohort.
The methods for analysis of survival time data are
closely related to methods used for analysis of incidence
and mortality rates in epidemiological cohort studies.
In epidemiology event rates are computed as
rate

number of events
total time at risk

The time scale and/or age scale is often split in a


number of intervals (e.g. 5-years intervals) and separate
rates are computed for each time/age interval.
The effect of an exposure on the occurrence of the
event can be expressed as a rate ratio, which can be
estimated at a crude rate ratio or stratified on age/time
categories as well as other risk factors. Poisson
regression is used for more comprehensive analysis.
2

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

In the Cox regression analysis the hazard rate is


unspecified, i.e. no restriction is imposed on the way the
hazard rate depends on time and the shape of the
estimated baseline functions - hazard rate and survival
function is determined completely by the data.
Alternatively, a parametric description of the hazard
rate may be postulated and the unknown parameters
are then estimated from the data, typically as maximumlikelihood estimates
A simple parametric model the exponential
distribution
The simplest possible parametric model for the hazard
rate is assuming an unknown, constant rate. The
distribution of life times with a constant hazard rate is
called the exponential distribution.
In this case we have that

(t )
S(t ) exp( t ) e t
The maximum-likelihood estimate of the constant
hazard rate is

d number of events
s
total time at risk
The standard error of is estimated by
SE ( ) d s
3

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

A 95% confidence intervals for the unknown rate is


usually obtained by computing a symmetric confidence
interval for ln(rate) and transforming this interval back to
the original scale.
One may show that
1
SE (ln( ))
,
d
so a 95% confidence interval for the constant hazard
rate has
lower bound C
upper bound C
where

C exp 1.96

Note: The individual survival times are not needed to


compute the estimate, the standard errors and the
confidence limits. They can all be obtained from directly
from the aggregated data d and s.
Example: Survival with malignant melanoma
Consider the data used in Exercise 12. First a patient
identification number is generated (this is needed for
some of the commands) then the data are defined as
survival time data
gen id=_n
stset survtime , failure(status==1) id(id)
noshow scale(365.25)
stptime
4

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The stptime generates the following output. The


calculations are based on the formulas above
Cohort | person-time failures
rate [95% Conf. Interval]
---------+---------------------------------------------------total | 1208.2793
57 .04717452 .0363884 .0611578

Separate rates for each category of a covariate are also


available
stptime , by(sex)
sex | person-time failures
rate [95% Conf. Interval]
---------+---------------------------------------------------female | 787.46886
28 .03555696 .0245506 .0514976
male | 420.8104
29 .06891465 .0478903 .0991689
---------+---------------------------------------------------total | 1208.2793
57 .04717452 .0363884 .0611578

Only one variable is allowed in the by option. To get


separate rates for intervals of follow-up time use the
option at(), which may be combined with the by option,
e.g.
stptime , at(2(2)8)
stptime , at(2(2)8) by(sex)

Output from the first command


Cohort | person-time failures
rate [95% Conf. Interval]
---------+---------------------------------------------------(0 - 2]| 387.25394
15 .03873427 .0233516 .0642502
(2 - 4]| 338.72005
21 .0619981 .0404232 .095088
(4 - 6]| 241.72895
14 .05791611 .034301 .0977896
(6 - 8]| 131.79329
5 .0379382 .0157909 .0911477
> 8 | 108.78303
2 .01838522 .0045981 .0735122
---------+----------------------------------------------------

Department of Biostatistics
University of Aarhus

total | 1208.2793

May 11 2004
Michael Vth

57 .04717452 .0363884 .0611578

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Apparently, the mortality rate initially increases to reach


a plateau and then decreases.
The underlying life time distribution is no longer
exponential when the rate is computed for different
follow-up time intervals. The rate is now piecewise
constant.
Distributions with piecewise constant hazard rate
constitute a flexible class of distributions, which are just
as easy to work with as the exponential distribution,
which usually provides a too crude picture of the
distribution of lifetimes.
The distributions are characterized by the value of the
hazard rate in each of a number of disjoint intervals:
1

0 t 1

1 t 2

(t )

r 1 t

For interval j from j 1 to j let


d j the number of events
s j the total time at risk
Knowledge of the statistics d 1, d 2, , d r , s1, s2., sr
permits calculation of all relevant estimates and test
statistics.
7

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

For each interval the value of the hazard rate is


estimated by
j d j s j ,
the corresponding standard error becomes
SE ( j ) d j s j ,
and 95% confidence intervals are also obtained as
before.
A distribution with piecewise constant hazard rate can
be viewed as a parametric version of the life table and
we may in fact estimate the survival function in a way
very similar to the one used when computing the life
table estimate of the survival function. The probability of
surviving the jth interval given alive at the beginning of
the interval is estimated by

p j exp j ( j j 1 )

and the probability of surviving from time 0 until the end


of the jth interval is then estimated by
j
S ( j ) p1 p 2
p
Distributions with piecewise constant hazards provide
the link between the methods used for analysis of
survival data and the method used for analysis
epidemiological cohort studies.
Survival analysis methodology uses individual records
whereas the epidemiological analysis usually is based
on a multi-way table of aggregated data.
8

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Example. Survival with malignant melanoma..


The STATA command
stptime , at(2(2)8) by(sex)

produces the following output


sex | person-time failures
rate [95% Conf. Interval]
-------+-----------------------------------------------------female |
(0 - 2]| 243.64408
5 .02052174 .0085417 .0493041
(2 - 4]| 221.13621
11 .0497431 .0275477 .0898214
(4 - 6]| 159.99452
8 .05000171 .0250057 .0999839
(6 - 8]| 86.639288
2 .02308422 .0057733 .0923008
> 8 | 76.054757
2 .02629684 .0065768 .1051463
-------+-----------------------------------------------------male |
(0 - 2]| 143.60986
10 .0696331 .0374664 .1294164
(2 - 4]| 117.58385
10 .0850457 .0457592 .1580614
(4 - 6]| 81.734428
6 .07340848 .0329795 .1633984
(6 - 8]| 45.154004
3 .06643929 .0214281 .2059996
> 8 | 32.728268
0
0
.
.
-------+-----------------------------------------------------total | 1208.2793
57 .04717452 .0363884 .0611578

To compare the survival for males and females


controlling for follow-up time (categorized in five time
intervals) using standard epidemiological methods only
the 2x5 person-time and 2x5 failures are needed.
STATA has the command stsplit and collapse which
can be used to form an aggregated data set. This new
data set can then be analyzed by a series of commands
for analysis of aggregated data.
First the individual records are split after 2, 4, 6, and 8
years of follow-up.

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

stsplit timecat , at(2(2)8)

We then define the new variables died and risktime


from the system variables _d, _t0 and _t, which for each
interval gives the event count (0 or 1), the start time and
the end time of the interval:
gen died=_d
gen risktime=_t-_t0

All the individual contributions are then aggregated (i.e.


summed) in a two-way table of sex versus time interval
and the result is saved in a new file
collapse (sum) died risktime , by(timecat sex)
save e:\kurser\survival\melatimesex.dta

To see the context of the new file write


use e:\kurser\survival\\melatimesex.dta
list
. list
+------------------------------------+
| sex timecat died risktime |
|------------------------------------|
1. | female
0
5 243.6441 |
2. | male
0
10 143.6099 |
3. | female
2
11 221.1362 |
4. | male
2
10 117.5838 |
5. | female
4
8 159.9945 |
|------------------------------------|
6. | male
4
6 81.73443 |
7. | female
6
2 86.63929 |
8. | male
6
3
45.154 |
9. | female
8
2 76.05476 |
10. | male
8
0 32.72827 |
10

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

+------------------------------------+

Note that timecat takes the lower limit of the interval as


category value.

11

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

To compute the mortality rate ratio for males versus


females stratified on categories of follow-up time write
ir died sex risktime , by(timecat)

The command syntax is


ir event-variable exposure-variable time-at-risk-variable ,
by(stratum-variable)

The exposure variable must have two categories and


only one stratum variable is allowed.
Output
. ir died sex risktime , by(timecat)
timecat |
IRR [95% Conf. Interval] M-H Weight
-------------+-----------------------------------------------0 | 3.388889 1.055401 12.63597 1.85567
2 | 1.702619 .6482636 4.416032 3.828909
4 | 1.463415 .4185228 4.809543 2.710744
6|
2.9 .3322018 34.72104 .6818182
8|
0
0 12.26261 .6055046
-------------+-----------------------------------------------Crude | 1.936121 1.111589 3.377279
(exact)
M-H combined | 1.936666 1.147481 3.268616
-------------------------------------------------------------Test of homogeneity (M-H) chi2(4) = 1.60 Pr>chi2 = 0.8096

Essentially the same results is obtained by a Cox


regression analysis of the original data set with
individual records
-------------------------------------------------------------_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Inter]
--------+----------------------------------------------------sex | 1.939011 .5140979 2.50 0.013 1.153182 3.260339
--------------------------------------------------------------

12

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

EXAMPLE: THE LIFE SPAN STUDY


The mortality of approximately 100,000 survivors after
the atomic bombing of Hiroshima and Nagasaki has
been followed since October 1950 in a still on-going
study called the Life Span Study (LSS).
The table below gives aggregated data on cancer
mortality from 1971 to 1990 in Hiroshima, only two dose
categories are considered here.
Age at
exposure
0-19

Sex
M
F

20-39

M
F

40-

M
F

No. in
1950
392
3696
406
4221
219
1774
506
4272
431
3355
376
4052

Dose (Gy)
1
0.005
1
0.005
1
0.005
1
0.005
1
0.005
1
0.005

Cancer
Risk time
deaths
(in 1000 y)
23
6.77
105
67.33
22
7.12
71
79.03
37
2.92
218
23.93
69
7.82
266
71.48
40
1.59
233
13.50
37
2.47
255
27.19

LSS: Mortality, all cancers combined, Hiroshima, 1971-90

The STATA file hiro7190.dta contains these data. The


variable names are agex, sex, dose, cases, and pyr.

13

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The following STATA commands may be used to


estimate the effect of exposure stratified on age-atexposure and sex
use e:\kurser\survival\hiro7190.dta
//combine agex and sex to a single variate
egen agexsex=group(agex sex)
//avoid rounding error
replace pyr=pyr*1000
ir cases dose pyr , by(agexsex)

Output
group(agex sex)
IRR [95% Conf. Interval] M-H Weight
------------------------------------------------------------1 | 2.178505 1.323522 3.445787
9.593117
2 | 3.43935 2.029444 5.615747
5.867905
3 | 1.390929 .9539309 1.977719
23.70801
4 | 2.371075 1.792351 3.100524
26.23102
5 | 1.457608 1.01505 2.045349
24.5507
6 | 1.597253 1.099477 2.26114
21.23567
-------------+----------------------------------------------Crude | 1.955327 1.688804 2.255824
M-H combined | 1.852352 1.607143 2.134972
------------------------------------------------------------Test of homogeneity (M-H) chi2(5) = 15.53 Pr>chi2 = 0.0083

Note: The last stratum variable is moving fastest, i.e.


1 ~ 0-19 M, 2 ~ 0-19 F, 3 ~ 20-39 M, etc.
The rate ratio for exposure effect is almost 2 and highly
significant. The test of homogeneity (identical stratumspecific rate ratios) is, however, also statistically
significant, indicating that the effect of exposure
depends on sex and/or age at exposure of the survivor.

14

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

A further investigation of this effect modification requires


a more refined method of analysis, a so-called Poisson
regression analysis.

POISSON REGRESSION
In the Poisson regression analysis the number of
events in a given cell of the multi-way table is treated as
a Poisson variable with mean equal to rate risktime .
Comment: A Poisson distribution with mean is the
limiting distribution of a binomial distribution (n,p) as the
n goes to infinity and p tends to zero such that the mean
is fixed n p . Poisson distributions are often used to
model occurrence of random events.
In a Poisson regression model the rate in a given cell is
modeled as a product of factors reflecting the effect of
the category levels of the variables defining the multiway table.
Example: Cancer mortality in the LSS
The LSS data above form a 3x2x2 table with agex, sex,
and dose as classifying variables.
A Poisson regression model specifies multiplicative
structure for mortality rate ijk in the cell given by
agex=i, sex=j, dose=k (i = 0,1,2, j = 0,1, k = 0,1)
If a reference category is chosen for each of the
classifying variables (e.g. i = j = k = 0), the Poisson
15

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

regression model with no interaction assumes that the


rates satisfy

ijk 000ai b j ck
The parameters ai , b j , and ck are rate ratios. The
parameter a1 represents the rate ratio of the mortality in
the second age-at-exposure category relative to the first
category when controlling for sex and dose as
independent risk factors.
Poisson models are usually specified as additive models
for the ln(rate). Using dummy variables we have
ln(rate ) ln( ijk ) 0 i(1)zi(1) j(2)z(2)
k(3)zk(3)
j
The constant, the parameter 0 , is the ln(rate) in the
reference cell and the other -parameters are
logarithms of rate ratios. Models with interaction terms
may also be used.
Poisson regression with STATA
The following commands fit the above Poisson
regression model to the LSS data in hiro7190.dta,
Note that the output from the default version (the first
command) gives the log-linear parameter estimates. To
get rate ratios the option irr must be added (second
version)
xi:poisson cases i.sex i.agex i.dose ,
exposure(pyr) nolog
16

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

xi:poisson cases i.sex i.agex i.dose ,


exposure(pyr) nolog irr

17

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Output
. xi:poisson cases i.sex i.agex i.dose , exposure(pyr) nolog
i.sex
_Isex_1-2
(naturally coded; _Isex_1 omitted)
i.agex
_Iagex_1-3 (naturally coded; _Iagex_1 omitted)
i.dose
_Idose_0-1 (naturally coded; _Idose_0 omitted)
Poisson regression

Number of obs =
12
LR chi2(4)
= 1150.09
Prob > chi2
= 0.0000
Log likelihood = -47.475645
Pseudo R2
= 0.9237
-------------------------------------------------------------cases | Coef. Std. Err. z P>|z| [95% Conf. Int.]
---------+---------------------------------------------------_Isex_2 | -.6606352 .054719 -12.07 0.000 -.767882 -.55339
_Iagex_2 | 1.529529 .0798769 19.15 0.000 1.37297 1.68609
_Iagex_3 | 2.29477 .0796431 28.81 0.000 2.13867 2.45087
_Idose_1 | .6165749 .0725606 8.50 0.000 .474358 .75879
_cons | -6.357755 .0712617 -89.22 0.000 -6.49743 -6.2181
pyr | (exposure)
-------------------------------------------------------------. xi:poisson cases i.sex i.agex i.dose , exposure(pyr) irr
******************* first part as above **********************
-------------------------------------------------------------cases |
IRR Std. Err. z P>|z| [95% Conf. Int.]
-------------+-----------------------------------------------_Isex_2 | .5165231 .0282636 -12.07 0.000 .463995 .574998
_Iagex_2 | 4.616002 .368712 19.15 0.000 3.94707 5.39830
_Iagex_3 | 9.922157 .7902317 28.81 0.000 8.48816 11.5984
_Idose_1 | 1.852572 .1344237 8.50 0.000 1.60698 2.13569
pyr | (exposure)
--------------------------------------------------------------

The reference group is unexposed males, age 0-19 in


August 1945. Note that the constant term is not printed
when rate ratios are requested.
Parameter estimates are maximum-likelihood
estimates. The dose effect is extremely significant and
almost identical to the one found previously, on page 18
18

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

we had 1.852352. As expected the mortality depends


also on sex and age-at-exposure.
Does the model fit the data?
The table below compares observed count with
expected count predicted by the Poisson model fitted
above. We have
where ijk 000ai b j ck .

Expected ijk ijk PYR ,


Age at exposure
Males
observed
expected
Females observed
expected

0-19
105
116.7
71
70.8

unexp.
20-39
218
191.5
266
295.4

40233
232.2
255
241.5

0-19
23
21.7
22
11.8

exposed
20-39
37
43.3
69
59.9

4040
50.7
37
40.6

Illustration: For exposed males aged 20-39 at exposure


we have e.g.
Expected 011 exp(-6.358) 4.616
1.853 2.92

1000

43.279

The usual 2 goodness-of-fit test becomes 22.27 with


12 5 = 7 degrees of freedom giving a p = 0.0023.
STATAs command poisgof computes this statistic and
the corresponding likelihood ratio test
poisgof
poisgof , pearson

Output
. poisgof
Goodness-of-fit chi2 = 20.61145
Prob > chi2(7)
= 0.0044

19

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

. poisgof , pearson
Goodness-of-fit chi2 = 22.27476
Prob > chi2(7)
= 0.0023

20

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The fit of the model can be improved by adding


interaction terms. The following output shows the result
of a series of such model fits. Only the output from the
final model is shown. Note the first model, which
includes the agex*sex interaction, corresponds to the
stratified analysis above.
. quietly
xi:poisson cases i.sex*i.agex i.dose , exposure(pyr) irr
. poisgof
Goodness-of-fit chi2 = 14.97348
Prob > chi2(5)
= 0.0105
. quietly xi:poisson cases i.sex*i.agex i.dose i.dose*i.sex ,
exposure(pyr) irr
. poisgof
Goodness-of-fit chi2 = 9.508152
Prob > chi2(4)
= 0.0496
. quietly xi: poisson cases i.sex*i.agex i.dose i.dose*i.sex
i.dose*i.agex , exposure(pyr) irr
. poisgof
Goodness-of-fit chi2 = 1.890253
Prob > chi2(2)
= 0.3886
. poisson //with no argument the previous fit is displayed
Poisson regression
Number of obs =
12
LR chi2(9)
= 1168.81
Prob > chi2
= 0.0000
Log likelihood = -38.115048
Pseudo R2
= 0.9388
-------------------------------------------------------------case |
IRR Std. Err. z P>|z| [95% Conf. Inter]
------------+------------------------------------------------_Isex_2| .588025 .082229 -3.80 0.000 .447058 .773442
_Iagex_2| 5.79417 .660672 15.41 0.000 4.63376 7.24516
_Iagex_3| 11.3722 1.28203 21.57 0.000 9.11773 14.184
_IsexXage_~2| .715692 .11442 -2.09 0.036 .52317 .97907
_IsexXage_~3| .891059 .143171 -0.72 0.473 .65034 1.22088
_Idose_1| 2.27988 .411002 4.57 0.000 1.60127 3.24609
21

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

_IdosXsex_~2| 1.43131 .210778 2.44 0.015 1.07246 1.91022


_IdosXage_~2| .679969 .135972 -1.93 0.054 .45949 1.00624
_IdosXage_~3| .557839 .115872 -2.81 0.005 .371279 .838141
pyr| (exposure)
--------------------------------------------------------------

. testparm *sXa* //testing no dose by age interaction


( 1) [cases]_IdosXage_1_2 = 0
( 2) [cases]_IdosXage_1_3 = 0
chi2( 2) = 7.90
Prob > chi2 = 0.0193
. testparm *xXa* //testing no sex by age interaction
( 1) [cases]_IsexXage_2_2 = 0
( 2) [cases]_IsexXage_2_3 = 0
chi2( 2) = 5.74
Prob > chi2 = 0.0566

Comments
The final model is consistent with the data, but gives a
rather complex description.
The dose effect is modified by both sex (larger rate ratio
for females) and age-at-exposure (the dose effect
decreases with age-at-exposure).
Having 10 estimated parameters the final model is only
slightly simpler than the saturated model (i.e. the model
with 12 freely varying rates).
Note also
The goodness-of-fit test is not very reliable in large
tables with many small counts. In such circumstances
one may instead compare a given model with a much
22

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

richer model that e.g. includes a lot of interaction


parameters.

23

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

FROM SURVIVAL TIME DATA


TO POISSON REGRESSION ANALYSIS
In the LSS example the cancer mortality rate in each of
the 12 groups was constant during the follow-up from
1971 to 1990. This is a highly unrealistic model, since it
is well-known that cancer mortality rates increase
dramatically with age.
In analyses of data from large, epidemiological cohort
studies the dependence of rates on age and calendar
time is usually described by piecewise constant
hazard rates models.
This gives much more realistic models with a better
correction for confounding effects of age and/or calendar
time.
The analysis of such models is based on event counts
and risk times in a multi-way table and in this context
the method of analysis is usually denoted Poisson
regression, since the analysis is formally identical (i.e.
gives the same maximum likelihood estimates) to a
regression model for counts described by Poisson
distributions.
Individual data records are initially aggregated to form
the multi-way table of event counts and person-years-atrisk, see the figure below.

24

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Age
25

20

15

10

1980

1985

1990

1995

2000

Calendar Time
In each cell compute:
Number of events D
Total time at risk
S

In the analysis of the LSS data multi-way tables with


3000-8000 cells are routinely used. These tables are
e.g. formed by a cross-classification on age (5-years
intervals), calendar time (5-years intervals), sex, city
and dose (8-12 categories) and separate analyses are
carried out for the most common cancer types.
25

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

For each entry in the multi-way table a crude rate can be


estimated as D/S = events/risktime. In the analysis the
dependence of these rates on the classifying factors are
studied using Poisson regression models very similar to
Cox regression models.
Main difference: the unspecified baseline hazard of the
Cox regression model is replaced by a piecewise
constant hazard.
In Poisson regression models effects of categorical
covariates used as classifying factors are described by
rate ratios. Both models with internal reference rates
and models with external reference rates are available.
The analysis requires software that can
1. Form the multi-way table of counts and personyears,
2. Perform a Poisson regression analysis of the
aggregated data.
Software:
Forming the table:
EPICURE, SAS, STATA (but not SPSS)
Poisson regression:
EPICURE, EGRET, SAS, STATA, S-Plus, Genstat,
GLIM, Statistix etc. (SPSS: not really).

26

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

FORMING EVENT-RISKTIME TABLES WITH STATA


The STATA commands stsplit and collapse are used
to transform survival time data with individual records
into a multi-way event-risktime table to be analyzed with
Poisson regression.
A few examples illustrate some of the possibilities. The
manual presents many more the stsplit is a highly
versatile command
Example 1. Splitting on time in study
In a clinical trial the data are usually described by
stset time , failure(status==1) noshow

to split the data at 1,2,3, and 5 years of follow-up write


stsplit timecat , at(1,2,3,5)

and data are split in 5 time categories.


Example 2: Splitting on age
If the survival time data are defined by
stset outdate , failure(status==1) enter(time
indate) origin(time bdate) scale(365.25)
noshow

the time scale is age in years and we may consider


using
stsplit agecat , at(10(10)70)
27

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Example 3 Splitting on age and time in study


The data considered in example 2 can be split on both
age and time in study with the commands
stset outdate , failure(status==1) enter(time
indate) origin(time bdate) scale(365.25)
noshow
stsplit agecat , at(10(10)70)
stsplit timecat , at(5(5)25)
from(time indate)

After the data have been split the multi-way table is


formed by the commands
gen event=_d
gen risktime=_t-_t0
collapse (sum) event risktime , by(varlist)
save newfilename
use newfilename
xi: poisson event varlist1 , exposure(risktime)
other options

etc.
where varlist is a subset of the variables defining the
multi-way table and interaction terms.
Note:
The data do not have to be collapsed to do Poisson
regression, but data may become very large if split on
28

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

several time scales in many intervals and collapsing the


data may speed up computation. Also consider deleting
unnecessary variables first.
POISSON REGRESSION
MALIGNANT MELANOM DATA
To compare the results from a Cox regression analysis
with those from a Poisson regression model of the same
covariates consider
use "E:\kurser\survival\melanoma.dta"
* generate a person id number
gen id=_n
* define data as survival time data
stset survtime ,
///
failure(status==1) noshow scale(365.25) id(id)
* for later comparison we fit the following
* Cox model
xi:stcox i.sex i.invasion i.ecells
///
i.ulcerat , nolog
* now be split on follow-up time
stsplit timecat , at(2(2)8)
gen died=_d
gen risktime=_t-_t0
collapse (sum) risktime died ,
///
by(timecat sex invasion ecells ulcerat)
* and save the multi-way table
save e:\kurser\survival\data\mmtable.dta
use e:\kurser\survival\mmtable.dta
* fit the corresponding
* poisson regression model
29

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

xi:poisson died
///
i.sex i.invasion i.ecells i.ulcerat
i.timecat , exposure(risktime) irr

///

Apart from the baseline hazard rate the two models are
identical and both give results as rate ratios.
Selected Output
Cox regression -- no ties
No. of subjects =
205
Number of obs =
No. of failures =
57
Time at risk = 1208.279261
LR chi2(5)
=
44.51
Log likelihood = -260.94353
Prob > chi2
=

205

0.0000

-------------------------------------------------------------_t |Haz. Ratio Std. Err. z P>|z| [95% Conf. Int]


------------+------------------------------------------------_Isex_1| 1.87870 .509345 2.33 0.020 1.10429 3.19618
_Iinvasion_1| 2.14216 .711768 2.29 0.022 1.11693 4.10845
_Iinvasion_2| 2.78566 1.09658 2.60 0.009 1.28781 6.02569
_Iecells_1| 1.79241 .547121 1.91 0.056 .985399 3.2603
_Iulcerat_1| 2.75137 .88215 3.16 0.002 1.4677 5.15780
--------------------------------------------------------------

Poisson regression

Number of obs =
109
LR chi2(9)
=
50.08
Prob > chi2
= 0.0000
Log likelihood = -87.668122
Pseudo R2
= 0.2222
-------------------------------------------------------------died |
IRR Std. Err. z P>|z| [95% Conf. Int]
------------+------------------------------------------------_Isex_1| 1.85962 .504960 2.28 0.022 1.09217 3.16635
_Iinvasion_1| 2.1712 .719682 2.34 0.019 1.13382 4.15761
_Iinvasion_2| 2.71461 1.06812 2.54 0.011 1.25541 5.86988
_Iecells_1| 1.81955 .555404 1.96 0.050 1.00033 3.3097
_Iulcerat_1| 2.76710 .885904 3.18 0.001 1.47743 5.18254
_Itimecat_2| 1.88015 .638056 1.86 0.063 .966772 3.65645
_Itimecat_4| 1.79353 .668697 1.57 0.117 .863671 3.72451
_Itimecat_6| 1.27918 .662529 0.48 0.635 .463521 3.53018
_Itimecat_8| .613017 .462909 -0.65 0.517 .139541 2.6930
risktime| (exposure)

30

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

--------------------------------------------------------------

The ratio between corresponding estimates varies


between 0.974 and 1.015, so estimated rate ratios are
indeed very similar in the two models. This is not
surprising since a piecewise constant hazard rate based
on 5 time intervals is rather flexible and it is therefore
possible to approximate the shape of a wide range of
baseline hazard rate functions.

31

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

USING POPULATION MORTALITY RATES IN


THE ANALYSIS

Main types of problems


Comparison of mortality (or survival) in a study group
with that of an external reference population for which
the mortality is known from e.g. published life tables.
Comparison of the excess mortality (relative to an
external reference group) found in two or several
subgroups of a study.
First carefully consider:
Why introduce an external reference population? Is it
really necessary or just a "convenient" way to correct for
age or sex?
Also consider:
Which external reference population should be
used? The whole country? The county? The
individuals in the working force? etc.
Which endpoint? All causes of death or specific
causes that are expected to be particularly
relevant?

32

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Here mainly a discussion of "how to do it" without


taking a random sample from the background
population.
The statistical methods which include the mortality of the
background population can roughly be divided in two
groups:
RELATIVE SURVIVAL
The statistical methods in this group include:
The expected survival curve
"Crude", "corrected" and "relative" survival
Excess mortality parameters are usually describing
additive effects on the mortality rate.
RELATIVE MORTALITY
The statistical methods in this group include:
The expected number of deaths
The person-year method
Standardized mortality ratios (SMR)
Poisson regression with external rates
Excess mortality parameters are usually describing
multiplicative effects on the mortality rate.
FIRST:
What kind of information is available about the mortality
of the "normal" population? - and how can it be utilized?
33

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

34

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

NATIONAL LIFE TABLES AND MORTALITY


STATISTICS
Sources:
Most countries regularly - typically once a year - publish
a cross-sectional population life table. A standard lay-out
and terminology are used.
In Denmark:
Publications from The National Bureau of Statistics
(Danmarks Statistik) including "Statistisk rbog",
"Befolkningens bevgelser" contain life tables for the
Danish population based on one year period or five year
periods for each sex and single year age intervals from
0 to 99 year.
Life tables since 1981 can be found on the website
http://www.statistikbanken.dk/
which also gives access to other tables with mortality
statistics - select the link to Population and elections
(Befolkning og valg)
A typical life table is included on the last page.
Sundhedsstyrelsen publishes information on cause of
death (based on the death certificates) each year in
"Ddsrsagerne i Danmark". Cancer incidence rates
are available from Krftens Bekmpelse.

35

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

THE COLUMNS OF THE LIFE TABLE


For each sex and single year age intervals from 0 to 99
years:
Age-specific mortality proportion (Aldersklassens
ddshyppighed):
q x q( x )

The probability of dying at the age of


x years given alive on the x year birthday.
The table gives 100000 q x

Survival function (Overlevende):


Sx S ( x )

The probability for a new-born of surviving


until the x year birthday.
The table gives 100000 Sx

Expected remaining lifetime (Middellevetid)


ex e( x ) The expected remaining lifetime for a x
year old from the x year birthday.
Interrelationships:
q x Sx Sx 1 Sx 1 Sx 1 Sx 1 px
Sx 1 Sx px
Sx p0 p1 p2 K px 1

36

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

COMPUTING MORTALITY RATES


FROM THE LIFE TABLE
If the national mortality rate is piecewise constant on 1
year intervals, i.e. ( x ) x for x in 1 year intervals, the
following relation is true
q x 1 Sx 1 Sx 1 exp( x )
The (total) mortality rate can therefore be obtained from
the first or the second column of the life table as

x ln Sx 1 Sx ln 1 q x
Notes
The age-specific mortality proportion q x is a probability
and has no dimension, whereas the mortality rate x has
dimension per time unit.
In epidemiology both are often denoted the
mortality rate.
The mortality rate is always numerically larger than the
corresponding age-specific mortality proportion, but
apart from extremely old ages the discrepancy is very
small.
The plots below show the ratio x q x plotted against the
proportion q x and against age for each sex for the 200001 Danish life table.
37

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The plots indicate that x q x is essentially correct for


ages below 60 and that an improved approximation can
be obtained as

x qx (1 q x 2) .

38

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

CAUSE-SPECIFIC MORTALITY
Simple -and reasonably accurate estimates of causespecific mortality rates can be derived from the
relation

CS ( x ) CS ( x ) ( x )
where CS ( x ) the cause-specific mortality rate, CS ( x )
the proportion of deaths from the specified cause at age
x, and ( x ) the total mortality rate.
Estimation of the total mortality rate ( x ) has already
been described.
For each sex and age in 5 year intervals the proportion
CS ( x ) can be estimated from tables of number of
deaths by cause published each year in "Causes of
death in Denmark" (Ddsrsagerne i Danmark) - or on
the website mention above - as
no. of cause-specific deaths in age interval
,
total no. of deaths in age interval
where the "age interval" refers to the five age interval
containing age x.
In each 5 year interval total mortality rates (one for each
of the 5 years) are then multiplied by this estimate to
give the corresponding cause-specific mortality rates .

39

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

THE EXPECTED SURVIVAL CURVE,


RELATIVE SURVIVAL AND CORECTED SURVIVAL
The expected survival curve:
Typical area of application: A clinical follow-up study.
Here: classical version based on grouped survival times
Follow-up
year
1. year
2. year
3. year
M

Alive at
start of
year
Y0
Y1
Y2
M

During the year of


follow-up
dead
d1
d2
d3
M

censored
c1
c2
c3
M

modified
number
at risk
n1
n2
n3
M

The modified number at risk is obtained as


ni Yi 1 ci 2
For each follow-up year the mortality proportion is
estimated by
q%
i d i ni
and the corresponding (conditional) survival proportion
is
%
p%
i 1 qi
The usual life table estimate of the survival function is
S%
i The probability of surviving until the end of period i
% p%
p%
i .
1 p2 K
40

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Computation of the corresponding "expected"


survival curve involves the following steps:
First follow-up year :
Consider the Y0 individuals alive at the start of the year.
For j 1,2,K Y0 let
p1*j

The probability according to the published life


table of surviving one year for an individual of
the same sex and age as individual j.

The average expected survival probability for the first


year:
1 Y0 *
*
p1 p1j
Y0 i 1
Second follow-up year :
Consider the Y1 individuals alive at the start of the year.
For j 1,2,K Y1 let
p2* j

The probability according to the published life


table of surviving one year for an individual of
the same sex and age as individual j.

The average expected survival probability for the second


year:
1 Y1 *
*
p2 p2 j
Y1 j 1
For each of the following years of follow-up an average
expected survival probability p3* , p4* etc. are similarly
computed.
41

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

After i year of follow-up the expected survival curve


takes the value:
Si* p1* p2* K
pi*
The corrected survival curve is defined as the ratio of
the estimated (crude) survival to the expected survival:
*
SiC S%
S
i
i .

Note that the corrected survival curve is not necessarily


decreasing!
For each follow-up interval the relative survival is
defined as the ratio of the estimated conditional survival
probability to the corresponding conditional expected
survival probability:
*
ri p%
i pi .

Software for calculation of expected survival curves


To my knowledge none of the commercial statistical
software packages are able to compute expected
survival, corrected survival and relative survival, but
several public-domain products are available. See e.g.
http://www.cancerregistry.fi/surv2/
A locally developed PASCAL program is available from
Department at Biostatistics.

42

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

THE STATISTICAL MODEL BEHIND EXPECTED


SURVIVAL AND CORRECTED SURVIVAL
The mortality rate for patient j at time t is the sum of two
terms: the background mortality for a person of the
same age and sex and the excess mortality "caused"
by the disease in question:

j (t ) * (age, sex ) (t ) ,
where * (age, sex ) is the population mortality rate and
(t ) the excess mortality rate.
Let
ti

Ai A(t i ) (t )dt
0

Then

*
S * is an estimate of exp ,
exp *i Ai
S%
i is an estimate of
i

*
The corrected survival SiC S%
i Si can therefore be
viewed as an estimate of

exp Ai exp (t ) dt ,
0
ti

the survival function corresponding to the excess


mortality rate (t ) .

43

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

RELATIVE MORTALITY,
THE PERSON-YEAR METHOD
Notation:

(t )
mortality rate in the study group
mortality rate in the reference population
* (t )
Survival function and integrated mortality rate in the
reference population are denoted S * (t ) and * (t ) .
A simple statistical model:
Assume that the mortality rate in the study group is
proportional to that of the reference group:

(t ) * (t )
Two situations:
1. Age a is chosen as the underlying time t. With t a
the model becomes

(a, sex ) * (a, sex )


2. Follow-up time is chosen as the underlying time
scale. If e denote the age at entry the model becomes

(t , e, sex ) * (t e, sex )
The dependence on sex is suppress below.

44

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The parameter is the mortality rate ratio or the


relative mortality. In epidemiology the estimate of is
usually called the standardized mortality ratio (SMR).
If ( )1 the mortality in the study group is higher
(lower) than the mortality in the reference population.
Generalizations: The relative mortality may depend on
e.g. sex, age-at-entry, follow-up time or risk factors,
which are known for each individual.
Estimation of the relative mortality
Data:
A record for each individual with:
Age at entry in the study
Age at exit from the study
Status at exit (dead or alive).

tENTRY
tEXIT

The maximum likelihood estimate of becomes

(t

D
*
)

(tENTRY )
EXIT

persons

D
E

The numerator D:
D = the observed number of deaths during follow-up.
The denominator E
E = the expected number of deaths during follow
up. This is a convenient terminology, but not quite
correct. E is rather number of deaths to be expected
with the observed follow-up times.
45

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Confidence intervals for the relative mortality


An approximate 95% confidence interval for is
obtained by using
SE (ln()) 1

A symmetric confidence interval for ln( ) is transformed


back to a asymmetric confidence interval for :

c c ,
where

c exp 1.96

Null hypothesis:
The mortality in the study group is identical to the
mortality in the reference population, i.e. 1.
The expected value of D E is 0 on the null hypothesis
E (D E ) 0
and Var (D E ) can be estimated by E.
These results lead to the following test statistic
(D E )2
X
E
2

which on the null hypothesis is approximately a 2


variate on 1 degree of freedom. Large values provide
evidence against the null hypothesis.
46

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

THE EXPECTED NUMBER OF DEATHS


Above the expected number of deaths was
computed as the sum of individual contributions of the
form
* (tEXIT ) * (tENTRY )
Since the mortality rate is assumed constant on a
number of age intervals (typically 1 year or 5 year
intervals) we have
tEXIT

(tEXIT ) (tENTRY )
*

tENTRY

* (s )ds x t x ,
x

hvor t x is the time the individual spends in the age


category x, i.e. the individuals contribution to the
time-at-risk in age category x. Often it is simpler to
calculate the expected number of deaths by first
computing total time-at-risk in each age category,
multiply by the age-specific mortality rate, and sum
contributions from each age category, i.e.
E

* (t

EXIT

) * (tENTRY )

persons

*
x x

persons x

x* PYR x
x

where PYRx the person-years at risk in age category


x.
47

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The following figure illustrates the two different ways to


calculation of E.
Age specific mortality rates

*
30

*
29

30

31

*
33

*
32

*
31

32

33

*
34

34

AGE

The "expected" number of deaths E depends on the


survival times and is therefore a random variable (i.e.
subject to random variation) and not really an expected
number (i.e. a constant).
If the mortality in the study group is identical to the
mortality in the reference population, i.e. if 1 one
may show that
E (E ) E (D ) expected number of deaths ,
i.e. on the average the "expected" number of death is
equal to the expected number of death!!!
48

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Example:
Mortality for patients diagnosed with manic-depressive
psychosis (Weeke, Juel & Vth, J. Affective Disorders
1987; 13: 287-292).
Data:
Patients admitted to a psychiatric hospital for the first
time in the period April 1, 1970 - March 31, 1972 and
followed until March 31, 1977.
Number of patients N = 2168.
17 patients were lost to follow-up due to emigration and
were censored on date of emigration.
Results:
Observed
"Expected"

X2

All patients
309
176.55
1.75
99.4

Males
159
73.34
2.17
100.0

Females
150
103.21
1.45
21.2

95% confidence intervals for the relative mortality:


Method above
"exact"
Poisson

All patients

Males

Females

1.57 - 1.97

1.86 - 2.53

1.24 - 1.71

1.56 - 1.96

1.84 - 2.53

1.24 - 1.71

49

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The results above are often derived from a Poisson


model for the observed number of deaths assuming that
the "expected" number E is fixed. Confidence interval
based on this Poisson model is referred to as "exact"
confidence intervals.
Here we can compare the excess mortality among men
with the excess mortality among women.
Null hypothesis: The relative mortality does not depend
on the sex of the patient: m f .
Test statistic:
On the null hypothesis the following test statistic is
approximately distributed as a 2 variate on 1 degree of
freedom
2

ln( m ) ln(f )
W

Dm

We find
W

Df

ln(2.17) ln(1.45)

1591 150 1

giving p = 0.00044.

50

12.35

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

USING STATA TO COMPUTE STANDARDIZED


MORTALITY RATIOS
The STATA command stptime can compute the SMR
relative to a set of reference rates read from a separate
file.
Example: Diabetes mortality data
The STATA file diabetes.dta contains data on mortality
for patients with diabetes from Green & Hougaard,
Diabetologia 1984; 26: 190-194, see Exercise 8 for a
variable description.
Here we compare the mortality in this group with the
mortality in the general population represented by a life
table based on the calendar years 1976-1980.
For simplicity 10 years age intervals are used in the rate
file. The file kvrater7680-10.dta contains the following
female mortality rates (per 1000 years)
agecat
0
10
20
30
40
50
60
70
80

rate
.461
.125
.205
.417
1.204
2.757
6.128
16.824
50.955

Note that agecat gives the lower bound of the age


interval.
51

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The rates are computed from the life table as

ratex 1000 ln Sf* ( x 10) Sf* ( x ) 10


A similar file, marater7680-10.dta contains the mortality
rates for males. Both files must be sorted on agecat
before saving them.
Note: stptime only allows one set of rates, so a
combined analysis is not possible unless the same set
of rates are applied to both men and women.
The following commands produce expected number of
deaths and SMR for each sex separately.
* defining the survival time data with
* age as time scale
gen exitage=entryage+futime/365.25
stset exitage ,
failure(status==1) entry(time entryage)
id(id) noshow

///
///

* calculations for females


stptime if(sex==0) ,
smr(agecat rate)
using(E:\kurser\survival\kvrater7680-10.dta) ///
at(30(10)80) trim per(1000)
* calculations for males
stptime if(sex==1) ,
smr(agecat rate)
using(E:\kurser\survival\marater7680-10.dta)///
at(30(10)80) trim per(1000)

52

///
///

///
///

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The option trim specifies that follow-up time less than


30 or greater than 90 are to be excluded from the
computations
Output
. stptime if(sex==0) , smr(agecat rate)
using(E:\kurser\survival\aarhus2003\data\kv
> rater7680-10.dta) at(30(10)80) trim per(1000)
|
observed expected
Cohort |person-time failures failures SMR [95% Conf.Inter]
---------+---------------------------------------------------(30 - 40]| 646.53937
11 .26951 40.815 22.6032 73.6995
(40 - 50]| 676.54623
8 .814799 9.8184 4.91015 19.6329
(50 - 60]| 787.27589
22 2.17029 10.137 6.67464 15.3951
(60 - 70]|
959.54
55 5.87991 9.3539 7.18152 12.1834
(70 - 80]| 723.50156
87 12.1722 7.1475 5.79286 8.81881
---------+---------------------------------------------------total | 3793.403 183 21.3067 8.5889 7.43041 9.92792
. stptime if(sex==1) , smr(agecat rate)
using(E:\kurser\survival\aarhus2003\data\ma
> rater7680-10.dta) at(30(10)80) trim per(1000)
|
observed expected
Cohort |person-time failures failures SMR [95% Conf.Inter]
---------+---------------------------------------------------(30 - 40]| 954.56259
17 .652274 26.063 16.2021 41.9243
(40 - 50]| 957.03772
22 1.6278 13.515 8.89909 20.5258
(50 - 60]| 970.16295
37 4.54827 8.135 5.89412 11.2277
(60 - 70]| 800.0821
63 9.53987 6.6039 5.15889 8.45355
(70 - 80]| 462.71036
82 13.747 5.9649 4.80402 7.40634
---------+---------------------------------------------------total | 4144.5557 221 30.1153 7.3385 6.43202 8.37266

For both women and men the mortality is considerably


higher than the mortality in the general population.
The SMR is slightly larger for women, but a clear trend
with age is seen in both sexes, so the overall SMR is
less relevant.

53

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The command strate has also options for simple


comparisons with external rates.

54

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

POISSON REGRESSION WITH EXTERNAL


REFERENCE RATES USING STATA
A computation of a standardized mortality ratio for a
group of individuals or patients is a rather crude
comparison with the mortality in a reference population.
Often further insight can be gained by studying how the
relative mortality depends on a number of covariates.
Such models, SMR regression models, are conveniently
expressed as a Poisson regression model for the
aggregated data. The relevant parameters are estimated
by choosing the expected number of deaths as timeat-risk. The following example illustrates the approach
using STATA.
Example: Diabetes mortality data
The data in diabetes.dta from Green&Hougaard (1987)
is first split on 5-year age categories and collapsed in a
multi-way event time table with sex, agecat, and dxcat
(age-at-diagnosis) as classifying factors (output omitted)
egen dxacat=cut(dxage) , at(0,20,40,60,120)
gen exitage=entryage+futime/365.25
stset exitage ,
///
failure(status==1) entry(time entryage) id(id) noshow
stsplit agecat , at(5(5)95)
gen died=_d
gen risktime=_t-_t0
collapse (sum) died risktime , by(sex agecat dxacat)

55

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

save e:\kurser\survival\diabetes-coll-agecat2.dta

The national mortality rates (per 1000 years) on 20 fiveyears intervals for each sex are placed in the file
mort7680.dta. The file contains data in variables
sexage and mrate, where sexage takes the values from
1 to 40:
sex
female
female
female
etc.
male
male

agecat
0-4
5-9
10-14

sexage
1
2
3

90-94
95-99

39
40

Apart from now using 5-years intervals mrate is


computed from the life table as before. The file
mort7680.dta must be sorted on sexage before it is
saved.
The reference rates are now appended to the multi-way
event time table using the commands
use e:\kurser\survival\diabetes-coll-agecat2.dta
egen sexage=group(sex agecat)
sort sexage
merge sexage
using e:\kurser\survival\mort7680.dta
save e:\kurser\survival\ diabetes-coll-agecat3.dta

We have now added a column, mrate, to the file.

56

///

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

The new column contains the reference rate (per 1000


years) in the appropriate sex and age category.

In a Poisson regression model number of events in a


cell of the multi-way table is treated as a Poisson variate
with mean raterisktime (see page 19).
The present table has sex, agecat and dxacat as
classifying factors and a total of 93 non-empty entries
with event, risktime and mrate in each cell
A Poisson regression model with external reference
rates specifies multiplicative structure for mortality rate
ijk in the cell given by sex=i, agecat=j, dxacat=k (i = 0,1,
j = 0,5,..,95, k = 0,20,40,60)

ijk ijk ij* ,


*
where ijk is the relative mortality in the i,j,k-cell and ij
is the sex and age specific reference rate.

The model therefore specifies that the number of events


in the i,j,k-cell has mean
rate risktime ijk risktime ijk ij* risktime ijk Eijk ,
where Eijk is the expected number of deaths in the cell
according the reference rates.

57

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

If we use Eijk instead of risktime in the Poisson


regression we have a regression model for the
relative mortality and may fit models like e.g.

ijk 000ai b j ck
A couple of examples illustrate the possibilities.
The irr option is not used since the constant term is not
displayed if this option is used. Rather inconvenient.
gene expected = risktime*mrate/1000
xi: poisson died , exposure(expected) nolog

Output
Poisson regression

Number of obs =
93
LR chi2(0)
= 0.00
Prob > chi2
=
.
Log likelihood = -221.45684
Pseudo R2
= 0.0000
-------------------------------------------------------------died | Coef. Std. Err. z P>|z| [95% Conf. Inte]
---------+---------------------------------------------------_cons | 1.911351 .0451294 42.35 0.000 1.82290 1.99980
expected | (exposure)
--------------------------------------------------------------

The coefficient is equal to ln(SMR) so


SMR exp( 1.911351) 6.762
A similar calculation gives the 95% confidence interval
for the SMR:
Lower limit = exp(1.82290) = 6.190
Upper limit = exp(1.99980) = 7.388

58

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Next see if the relative mortality depends on sex


xi: poisson died i.sex , exposure(expected) nolog

59

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Output
Poisson regression

Number of obs =
93
LR chi2(1)
= 0.64
Prob > chi2
= 0.4236
Log likelihood = -221.13671
Pseudo R2
= 0.0014
-------------------------------------------------------------died | Coef. Std. Err. z P>|z| [95% Conf. Inter]
---------+---------------------------------------------------_Isex_1 | .072242 .0903129 0.80 0.424 -.104768 .249252
_cons |1.874631 .064957 28.86 0.000 1.747318 2.001945
expected | (exposure)
--------------------------------------------------------------

The constant is ln(SMR) for females and the coefficient


for sex is ln(SMRm / SMRf ) . The SMR for males and
females are not significantly different (the ln(ratio) is
close to 0).
We find
SMRf exp(1.874631) 6.518
SMRm / SMRf exp(0.72242) 1.075
and therefore
SMRm SMRf (SMRm / SMRf ) 6.518
1.075 7.007
Confidence limits can be computed similarly.
Finally see if the relative mortality depends on age-atdiagnosis
xi: poisson died i.sex i.dxacat ,
exposure(expected) nolog
testparm *xa*
60

///

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Output
Poisson regression

Number of obs =
93
LR chi2(4)
= 61.16
Prob > chi2
= 0.0000
Log likelihood = -190.87923
Pseudo R2
= 0.1381
-------------------------------------------------------------died | Coef. Std. Err. z P>|z| [95% Conf. Inter]
-----------+-------------------------------------------------_Isex_1 |-.0495702 .0923718 -0.54 0.592 -.230616 .131475
_Idxacat_20|-.7117244 .1615624 -4.41 0.000 -1.02838 -.39507
_Idxacat_40|-.7557131 .1461376 -5.17 0.000 -1.04214 -.469289
_Idxacat_60|-1.277969 .1599211 -7.99 0.000 -1.59141 -.964530
_cons | 2.776317 .1412925 19.65 0.000 2.49939 3.05325
expected |(exposure)
-------------------------------------------------------------. testparm *dx*
( 1) [died]_Idxacat_20 = 0
( 2) [died]_Idxacat_40 = 0
( 3) [died]_Idxacat_60 = 0
chi2( 3) = 64.81
Prob > chi2 = 0.0000

The relative mortality depends clearly on age at


diagnosis, but this may well be a time-since-diagnosis
effect that is showing up here. Further analysis is
needed to uncover this. The model predicts the following
SMRs
dxage
0-20
20-40
40-60
60-

female
16.06
7.88
7.54
4.47

male
15.28
7.50
7.18
4.26

Example of obtaining an SMR from the coefficients


61

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

SMRm,20 40 exp(2.7763 0.0496 0.7117)


exp(2.015) 7.50

62

Department of Biostatistics
University of Aarhus

May 11 2004
Michael Vth

Analysis of censored survival data:


Cox regression or Poisson regression?
Analysis of time-to-event data can be analyzed both with
Cox regression and Poisson regression models
To use Poisson regression the individual data records
must first be aggregated in an event-time table using
special software.
This table will often be considerably smaller than the
original data set and computations will therefore be
faster.
Poisson regression is mainly preferable in large
studies with relatively few covariates. Time-dependent
covariates can be defined and used when setting up the
event-time table. Several time scales are easily
accommodated.
Cox regression is mainly preferable in studies with
many covariates and if the analyses include more
exploratory aspects of working with time-dependent
covariate information, e.g. selecting the best way to
define a time-dependent covariate. Once a proper
representation is found is may be advantageous to
continue with Poisson regression.

63

You might also like