You are on page 1of 65

College of health and medical science

Department of Epidemiology and Biostatistics

SURVIVAL DATA ANALYSIS

Biruk. Sh (BSc., MPH)

9/8/2022 BY BIRUK.S (BSC., MPH) 1


Course logistics

•Personal computer with installed stata software

•Flash disk for data and software sharing

•Broadband Internet

•Open discussion with facilitators and hands on practice

9/8/2022 BY BIRUK.S (BSC., MPH) 2


Objectives

In this course, our main objectives are to:


I. Describe survival data
II. Compare survival of several groups
III. Explain survival with covariates

9/8/2022 BY BIRUK.S (BSC., MPH) 3


Type of an Outcome Measure Determines
Statistical Methods
Variable Methods

Continuous t-test, ANOVA, linear regression, etc.;

Dichotomous/categorical Chi-squared test, Fisher’s test,


logistic regression; etc

Count Poisson regression, log-linear, etc

Time-to-event with censoring survival analyses

9/8/2022 BY BIRUK.S (BSC., MPH) 4


Survival Data Analysis
• Survival Analysis is a statistical method that analyzes time to
event data.
• Outcome data = (t, c) (two elements instead of one), where t=
observed event time if c=1 t= censored time if c=0

Different names for survival analysis:

✓ Reliability analysis in engineering

✓ Duration analysis in economics

✓ Event history analysis in social science

9/8/2022 BY BIRUK.S (BSC., MPH) 5


What is time to event?

• It is the time taken for events to occur.


• Another names: Failure time, or Survival time.
Example:
✓ Time to death
✓ Time to onset of a disease
✓ Time to failure in mechanical systems
✓ Response time to stimulus
✓ Time to graduation
✓ Time to divorce

9/8/2022 BY BIRUK.S (BSC., MPH) 6


Time to Event (cont’d)
✓ Survival time or failure time random variables (T) are always non-
negative. That is, 𝑇 ≥ 0.
✓ In order to define T (failure time random variable), we need to know:
1. Time origin: (e.g time of birth, Initiation of treatment, Registration for
organ transplant…
2. Time scale: (e.g. real time such as days, months, years)
3. Definition of the event: (e.g Death; onset of the disease; cure of a
condition, discharge from a program; break down of a system)

9/8/2022 BY BIRUK.S (BSC., MPH) 7


Time to Event (cont’d)
• Time to event or failure time may not be completely observed
for some patients
• That is, for some patients we may know that their survival time was
at least equal to some time t.
• Whereas, for other subjects, we will know their exact time of
event.
• Incompletely observed survival times are censored

9/8/2022 BY BIRUK.S (BSC., MPH) 8


Types of censoring
1.Right censoring is present when we have some information
about a subject’s event time, but we don’t know the exact event
time.
There are generally three reasons why right censoring might occur:
✓ A subject does not experience the event before the study ends
✓ A person is lost to follow-up during the study period
✓ A person withdraws from the study

9/8/2022 BY BIRUK.S (BSC., MPH) 9


Types of censoring (cont’d)
Right Censoring

Censoring Event

When the occurrence of the event - if possible at all - would


be in the future (to the right of the stopping time)

9/8/2022 BY BIRUK.S (BSC., MPH) 10


Types of censoring (cont’d)
2. Left-censoring: subjects in a study already have experienced
the event in interest at the start of the study but the exact time
when they first reached the event is unknown.
Example: a study of age at which children learn a given task. Some
already knew (left-censored), some learned during study (exact),
some had not yet learned by end of study (right-censored).

9/8/2022 BY BIRUK.S (BSC., MPH) 11


Types of censoring (cont’d)
Left Censoring

Study began
Study end
Event

The event of interest has already occurred for the individual


before that person is observed in the study.

9/8/2022 BY BIRUK.S (BSC., MPH) 12


Types of censoring (cont’d)

9/8/2022 BY BIRUK.S (BSC., MPH) 13


Type of Censoring (cont’d)
A ………..
B ._______________________________.

C ._______________________________...............

D
._____________________________.......

Recruitment interval Additional follow up interval

9/8/2022 BY BIRUK.S (BSC., MPH) 14


Type of Censoring (cont’d)

▪A Is left censored.
▪B Is fully observed.
▪C Is right censored because the observation is lost to study.
This type of right censoring does not cause any problems if
the censoring is random
▪D is right censored because the observation period ends before
the event has occurred. This type of censoring does not cause any
problems for the analysis.

9/8/2022 BY BIRUK.S (BSC., MPH) 15


Type of Censoring (cont’d)

• If there is no censoring, standard regression procedures could


be used.
• However, these may be inadequate because
✓ Time to event is restricted to be positive and has a skewed distribution.
✓ The probability of surviving past a certain point in time may be of more
interest than the expected time of event.
✓ The hazard function, used for regression in survival analysis, can lend more
insight into the failure mechanism than linear regression.

9/8/2022 BY BIRUK.S (BSC., MPH) 16


Limitation of regression and logistic
regression when data are censored
Analysis Outcome Assumption Scale notes

Regression T (time) No censoring Continuous Biased if ignoring


censoring; censoring
provides more info
than missing
Logistic C (censoring) Same follow- Binary Biased if follow-up is
regression up length; not the same; less
ignore time efficient because only C
used and t ignored
Survival (T,C) Combine time Combined Overcome the
and censoring limitations

9/8/2022 BY BIRUK.S (BSC., MPH) 17


Summary:What is Survival Analysis
• A method for modeling time to event (or failure)
• It commonly used in medicine, public health, biology, finance,
engineering, social science, etc.
• It helps to account for censoring
• It helps to compare between 2 or more groups
• It helps to assess the effect of risk factors (covariates) on time
to event or failure time or survival time

9/8/2022 BY BIRUK.S (BSC., MPH) 18


Survivor Function
• The survival time T may be regarded as a random variable
with a probability distribution F(t) and probability density
function f (t).
• An obvious quantity of interest is the probability of surviving to
time t or beyond, the survivor function or survival curve S(t),
which is given by

• S(t) = P(T ≥ t) = 1 −F(t).

9/8/2022 BY BIRUK.S (BSC., MPH) 19


Properties of Survivor Function

• S(0)=1: no death at birth


• S(t1)≥S(t2) for all t1<t2: positive and decreasing function of time
• S(∞)=a
o typically a=0
o a>0 represents an incurable proportion of the
population

9/8/2022 BY BIRUK.S (BSC., MPH) 20


Hazard Function
• A further function which is of interest for survival data is the hazard function.

• This represents the instantaneous death rate, that is, the probability that an
individual experiences the event of interest at a time point given that the
event has not yet occurred.

• The hazard function is given by h(t) = f .(t)


S(t)

• The instantaneous probability of death at time t divided by the probability of


surviving up to time t.
• Hazard function is just the incidence rate.

9/8/2022 BY BIRUK.S (BSC., MPH) 21


Estimating Survival Function
• Nonparametric methods
• Parametric methods
1. Nonparametric methods
There are commonly three methods for estimating a survivorship
function
S(t) = P(T > t)
1. Life-table (Actuarial Estimator)
2. Kaplan-Meier
3. Nelson-Aalen or Fleming-Harrington (via estimating the cumulative
hazard)
The first two will be our focus.

9/8/2022 BY BIRUK.S (BSC., MPH) 22


1. Life-table (Actuarial Estimator)
✓Set of probabilities used in estimating the probability of Occurrence
of an event or survival at each year and Cumulative probability of
survival to each year is called a life table

✓To carry out calculation, we first set out for each year (X) :
➢ Number alive at start = nx
➢ Number withdrawn during year= wx Formulas for LT
➢ Number dying = dx 1. rx=nx-½wx,
➢ Number at risk = rx
2. qx=dx/rx,
3. px=1-qx,
➢ Prob. of death = qx
4. Px=pxPx-1
➢ Prob. of surviving =( px )
➢ Cumulative prob. of surviving = ( Px )

9/8/2022 BY BIRUK.S (BSC., MPH) 23


1. Life-table (cont’d)
Life table calculation for parathyroid cancer survival: the survival times are given in years after diagnosis

Year Number Withdrawn At risk Deaths Prob. of death Prob. of Cumulative prob.
at start during year ( rx ) ( dx ) ( qx ) surviving year X of surviving x
(x) ( nx ) ( wx ) ( px ) years ( Px )

1 20 2 1
2 17 2 0
3 15 0 1
4 14 0 0
5 14 1 0
6 13 1 0
7 12 1 2
8 9 0 1
9 8 1 0
10 7 0 2
11 5 2 0
12 3 0 1
13 2 0 0
14 2 0 0
15 2 0 1
16 1 0 0
17 1 0 0
18 1 1 0

9/8/2022 BY BIRUK.S (BSC., MPH) 24


1. Life-table (cont’d)
Life table calculation for parathyroid cancer survival: the survival times are given in years after diagnosis

Year Number Withdrawn At risk Deaths Prob. of death Prob. of Cumulative prob.
at start during year ( rx ) ( dx ) ( qx ) surviving year X of surviving x
(x) ( nx ) ( wx ) ( px ) years ( Px )

1 20 2 19 1 0.0526 0.9474 0.9474


2 17 2 16 0 0 1 0.9474
3 15 0 15 1 0.0667 0.9333 0.8842
4 14 0 14 0 0 1 0.8842
5 14 1 13.5 0 0 1 0.8842
6 13 1 12.5 0 0 1 0.8842
7 12 1 11.5 2 0.1739 0.8261 0.7304
8 9 0 9 1 0.1111 0.8889 0.6493
9 8 1 7.5 0 0 1 0.6493
10 7 0 7 2 0.2857 0.7143 0.4638
11 5 2 4 0 0 1 0.4638
12 3 0 3 1 0.3333 0.6667 0.3092
13 2 0 2 0 0 1 0.3092
14 2 0 2 0 0 1 0.3092
15 2 0 2 1 0.5000 0.5000 0.1546
16 1 0 1 0 0 1 0.1546
17 1 0 1 0 0 1 0.1546
18 1 1 0.5 0 0 1 0.1546

9/8/2022 BY BIRUK.S (BSC., MPH) 25


1. Life-table (cont’d)

•The median survival time (call it τ ) is just the time where 50% of
the observations have experienced the event

•That means median survival time is the time where S(τ ) = 0.5

•In practice, however, we don’t usually hit the median survival at


exactly one of the failure times.

•In this case, the estimated median survival is the smallest time τ such
that: S(τ ) < 0.5

9/8/2022 BY BIRUK.S (BSC., MPH) 26


1. Life-table (cont’d)
Life table calculation for parathyroid cancer survival: the survival times are given in years after diagnosis

Year Number Withdrawn At risk Deaths Prob. of death Prob. of Cumulative prob.
at start during year ( rx ) ( dx ) ( qx ) surviving year X of surviving x
(x) ( nx ) ( wx ) ( px ) years ( Px )

1 20 2 19 1 0.0526 0.9474 0.9474


2 17 2 16 0 0 1 0.9474
3 15 0 15 1 0.0667 0.9333 0.8842
4 14 0 14 0 0 1 0.8842
5 14 1 13.5 0 0 1 0.8842
6 13 1 12.5 0 0 1 0.8842
7 12 1 11.5 2 0.1739 0.8261 0.7304
8 9 0 9 1 0.1111 0.8889 0.6493
Median
9 survival
8 time 1 7.5 0 0 1 0.6493
10 7 0 7 2 0.2857 0.7143 0.4638
11 5 2 4 0 0 1 0.4638
12 3 0 3 1 0.3333 0.6667 0.3092
13 2 0 2 0 0 1 0.3092
14 2 0 2 0 0 1 0.3092
15 2 0 2 1 0.5000 0.5000 0.1546
16 1 0 1 0 0 1 0.1546
17 1 0 1 0 0 1 0.1546
18 1 1 0.5 0 0 1 0.1546

9/8/2022 BY BIRUK.S (BSC., MPH) 27


Practical section for
Life-table
using Stata software

9/8/2022 BY BIRUK.S (BSC., MPH) 28


Data set- infant data
• Infants in Southwest Ethiopia were studied for one year.
• Measurements were taken approximately every two months.
• Nearly 8000 infants at baseline.
• Data on infant, maternal and household characteristics were collected.

• Death of infants within their first year is an important problem with these
data.
• We will therefore analyze the time from birth to death (in days).
• For infants still alive when these data were collected, time is the time from
birth to the time of data collection.
• The variable event is an indicator for whether time refers to death (1) or end of
study (0).
9/8/2022 BY BIRUK.S (BSC., MPH) 29
Data set- infant data (cont’d)
• Possible explanatory variables for time-to-death could be place of residence and
sex of infants, among others.
• These data can be described as survival data.
• Duration or survival data can generally not be analyzed by conventional
methods such as linear regression.
• The main reason for this is that some durations are usually right-censored.
• That is, the endpoint of interest has not occurred during the period of
observation.
• Another reason is that survival times tend to have positively skewed
distributions.

9/8/2022 BY BIRUK.S (BSC., MPH) 30


Life-table in Stata Software
• Before any analysis, we declare the data as being of the form ‘st’ (for survival
time) using the ‘stset’ command.

• Stata code: stset duration, failure(event)


• Stata outputs

9/8/2022 BY BIRUK.S (BSC., MPH) 31


Life table in Stata Software (cont’d)
• Stata code: ltable duration event, survival intervals(50)
• Stata outputs

9/8/2022 BY BIRUK.S (BSC., MPH) 32


Life table in Stata Software (cont’d)

Students class activity 1

1. Compute the life table with an interval of 20 days

2. Find the median survival time

9/8/2022 BY BIRUK.S (BSC., MPH) 33


2. Kaplan-Meier (KM) estimator

•KM estimator helps us to find S(t) when there are censored data

•To find KM estimator, we break up survival probability into a


sequence of conditions

•Probability of surviving t (t > 2) or more years from the beginning of


the study is the product of the observed survival rates. i.e.

S(t) = p1p2p3…pt

9/8/2022 BY BIRUK.S (BSC., MPH) 34


Kaplan-meier estimator (cont’d)

Mathematically we can put KM estimator as:

▪Pj = estimated by the proportion of people living through tj out those


who have survived beyond tj-1

▪nj = Number at risk at time tj

▪dj = Number who died at time tj

▪nj – dj = Number who survived beyond tj

9/8/2022 BY BIRUK.S (BSC., MPH) 35


Practical section for
Kaplan-Meier curve
using Stata software

9/8/2022 BY BIRUK.S (BSC., MPH) 36


Kaplan-meier estimates in Stata Software
• Stata code: sts graph, ylabel(0.7(0.1)1.0)
• Stata outputs
Kaplan-Meier survival estimate
1.00
0.90
0.80
0.70

0 100 200 300 400 500


analysis time

9/8/2022 BY BIRUK.S (BSC., MPH) 37


Kaplan-meier estimates in Stata Software (cont’d)
• Stata code: sts graph, by( CatBwt ) ylabel(0.7(0.1)1.0)
• Stata outputs

Kaplan-Meier survival estimates


1.00
0.90
0.80
0.70

0 100 200 300 400 500


analysis time

CatBwt = normal CatBwt = underweight

9/8/2022 BY BIRUK.S (BSC., MPH) 38


Kaplan-meier estimates in Stata Software (cont’d)

Students class activity 2

1. Construct the KM survival curve for the variables sexChild and


catgravidity

9/8/2022 BY BIRUK.S (BSC., MPH) 39


Limitations of Kaplan-Meier

✓Mainly descriptive

✓Doesn’t control for covariates

✓Requires categorical predictors

✓Can’t accommodate time-dependent variables

9/8/2022 BY BIRUK.S (BSC., MPH) 40


Comparison of Survival Curves (Function)
•After estimating the survival function, S(t), over time for a group of
individuals, our next step is to compare the survival estimates
between two or more groups.

Thus, we can do by Mantel-Haenszel log rank test

• The log rank test is the most well known and widely used.

• The null hypothesis for the log-rank test is that there is no


difference in survival probabilities between two groups.

9/8/2022 BY BIRUK.S (BSC., MPH) 41


Practical section for
Log rank test
using Stata software

9/8/2022 BY BIRUK.S (BSC., MPH) 42


log rank test in Stata Software
Hypothesis
Ho: There is no difference in survival probabilities between normal and underweight infants
Ha: There is difference in survival probabilities between normal and underweight infants

Stata code: sts test CatBwt, logrank


Stata outputs

Conclusion: We reject Ho and conclude that there is difference in survival probabilities


between normal and underweight infants

9/8/2022 BY BIRUK.S (BSC., MPH) 43


log rank test in Stata Software (cont’d)

Students class activity 3

1. Use logrank test to check if there is significant difference in


survival probabilities between sexChild and catgravidity

9/8/2022 BY BIRUK.S (BSC., MPH) 44


Modeling Survival Data

9/8/2022 BY BIRUK.S (BSC., MPH) 45


Basics of Survival analysis
Three types of analysis are different from each other.

9/8/2022 BY BIRUK.S (BSC., MPH) 46


Cox Proportional Hazards Model

• Kaplan-Meier and significance tests (e.g. Log-rank) can be used to


compare survival in different subgroups.
• However, when there are several explanatory variables (when
some of these are continuous) a regression method such as Cox
regression is preferred.

9/8/2022 BY BIRUK.S (BSC., MPH) 47


Cox Proportional Hazards Model (cont’d)
• The hazard function for individual i is modeled as
hi (t) = h0(t)exp(βT xi ).
h0(t): is the baseline hazard function.
β : are regression coefficients.
xi : are covariates of interest.
• The baseline hazard is the hazard when all covariates are zero, and
is left unspecified.
• The exponentiated regression parameters can therefore be
interpreted as hazard ratios.
9/8/2022 BY BIRUK.S (BSC., MPH) 48
Cox Proportional Hazards Model (cont’d)

Interpretation of Hazard Ratio (HR)


• HR = 1: No effect
• HR < 1: Reduction in the hazard
• HR > 1: Increase in Hazard
• A covariate with hazard ratio > 1 (𝛽 > 0) is called a risk factor
• A covariate with hazard ratio < 1 (𝛽 < 0) is called a protective factor

9/8/2022 BY BIRUK.S (BSC., MPH) 49


Practical section for
Proportional Hazards Model
using Stata software

9/8/2022 BY BIRUK.S (BSC., MPH) 50


Cox regression in Stata Software
Stata code: stcox i.CatBwt
Stata outputs

Interpretation;
• Underweight infants are nearly 2.79 times more likely to die at any given time as
compared to that of normal.
• The hazard of death is increased by 2.79 times among underweight infants as compared
to that of normal.

9/8/2022 BY BIRUK.S (BSC., MPH) 51


Cox regression in Stata Software(cont’d)
• We want to study the effect of sex, gravidity, family size and mom
age on infant survive.
Stata code: stcox i.CatGravida famsize momage i.sexChild
Stata outputs

9/8/2022 BY BIRUK.S (BSC., MPH) 52


Cox regression in Stata Software(cont’d)

Interpretation;
1. Keeping other variable constant, the hazard of death is
decreased by 14.2% among female infants as compared to
that of male infants.
2. Per one person increase in family size, the hazard of death is
decreased by 11.2% while holding other variable constant.

9/8/2022 BY BIRUK.S (BSC., MPH) 53


Cox regression in Stata Software(cont’d)

Students class activity 4

1. Fit cox regression for the independent variables Marital, Religions,


arm circumference and parity

2. Interpret the result

9/8/2022 BY BIRUK.S (BSC., MPH) 54


Test of PH Assumption

Methods
1. Hazard Plot (Log–log plot of survival)
2. Correlation Test using (test using Schoenfeld residuals)

9/8/2022 BY BIRUK.S (BSC., MPH) 55


Test of PH Assumption (cont’d)

1. Hazard Plot (Log–log plot of survival)


• Check for parallel curves across groups for all categorical covariates
(e.g. Treat)
• For continuous variables, create categories and then check for
parallel survival curves

9/8/2022 BY BIRUK.S (BSC., MPH) 56


Practical section for
Test of PH Assumption
using Hazard Plot
In Stata software

9/8/2022 BY BIRUK.S (BSC., MPH) 57


PH Assumption in Stata Software
Menu: Statistics > Survival analysis > Regression models > Graphically assess PH assumption

Stata code: stphplot, by( CatBwt )


Stata outputs
10

• Displays lines that are parallel, implying that the


proportional-hazards assumption for body weight
has not been violate
8
6
4
2

0 2 4 6
ln(analysis time)

CatBwt = normal CatBwt = underweight

9/8/2022 BY BIRUK.S (BSC., MPH) 58


Test of PH Assumption (cont’d)
2. Correlation Test using (test using Schoenfeld residuals)
• We can use the estat function in stata after the final model is fitted
• Significant p-value indicates the violation of assumptions
• If the assumption is violated, we can remove the variables one by one
and check the assumption again until the assumption is satisfied
• Once we identified the variable that violated the assumption, we can
use one of the following
Stratified analysis
Time dependent cox regression

9/8/2022 BY BIRUK.S (BSC., MPH) 59


Practical section for
Test of PH Assumption
using Correlation Test using
In Stata software

9/8/2022 BY BIRUK.S (BSC., MPH) 60


PH Assumption in Stata Software
Hypothesis
Ho: PH assumption is not violated
Ha: PH assumption is violated
Stata code: stcox i.CatGravida famsize momage i.sexChild
estat phtest
Stata outputs

We fail to reject Ho and conclude that PH assumption is not violated


9/8/2022 BY BIRUK.S (BSC., MPH) 61
Cox regression in Stata Software(cont’d)

Students class activity 5

1. Fit cox regression for the independent variables Marital, Religions,


arm circumference and parity

2. Test the PH assumption of the model

9/8/2022 BY BIRUK.S (BSC., MPH) 62


Students home take activities
1. Compute the life table with an interval of one month

2. Find the median survival time

3. Construct the KM survival curve and Fit cox regression for the
variables martial status length (by creating categories(30 to 40, 41
to 50 and greater than 51) and catgravidity

4. Test the PH assumption of the model

5. Interpret the result

9/8/2022 BY BIRUK.S (BSC., MPH) 63


Question ?

9/8/2022 BY BIRUK.S (BSC., MPH) 64


9/8/2022 BY BIRUK.S (BSC., MPH) 65

You might also like