You are on page 1of 14

Survival Analysis

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Setting
• The time to an event is frequently an
important outcome (or endpoint)
• E. g.,
– Dosing studies designed to determine an LC50 allow you to
determine “the concentration that kills 50% of the individuals
within a specific time frame (frequently 48h)”

– A more rigorous approach would be to ask “what are the


combined effects of concentration and exposure duration on the
lifetimes (or time to death) of individuals”

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Approach
– Model time to event (commonly failure or
death)
• Unlike linear regression, survival analysis has a
dichotomous (binary) outcome
• Unlike logistic regression (which models the
probability of an event), survival analysis analyzes
the time to an event
– Specifically – the probability that the event does not
occur until after some specific time

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Approach
– Time as a dependent variable is tricky!
• Non-normal
• Censoring Start Study End Study

TIME

Individual 1

Individual 2
“Right” Censored
Individual 3

Individual 4

Individual 5

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Example
Acute exposure of newborn rodents to cadmium (2 mg/kg
body weight). Preliminary trial, n=10. Time is measured in
days. (from Piegorsch and Bailer 1997, page 481)

6 individuals die, at days 1,3,4,4,6,8


4 animals develop other, unrelated problems and must be
removed (i.e., censored) at days 2,4,5,9
> ttt = c(1,3,4,4,6,8,2,4,5,9)
> ttt.status = c(1,1,1,1,1,1,0,0,0,0) 0 indicates censored
observations

What is the affect of exposure on lifespan?


Statistical Analysis Using R Stephen Cox Midwest SETAC
stephen.cox@ttu.edu March 2009

Survivor Function, S(t)


• Represents the probability that the “event” does not happen until
after some specific time
• Simplest from – just the proportion of individuals still alive at time t

However – this does not


account for the fact that we
have censored
observations!!

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Survivor Function, S(t)
• Kaplan-Meier (Product-Limit) Estimator – a non-parametric
estimator that adjusts for censuring
di = # of deaths at ti
ni= # of organisms alive and uncensored immediately before ti
(i.e., the number “at risk”)

> sfit = survfit(Surv(ttt,ttt.status)~1,


type="kaplan-meier")
> plot(sfit,xlab='time',ylab='S(t)')

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Survivor Function, S(t)


• Kaplan-Meier (Product-Limit) Estimator – a non-parametric
estimator that adjusts for censuring
di = # of deaths at ti
ni= # of organisms alive and uncensored immediately before ti
(i.e., the number “at risk”)

> sfit = survfit(Surv(ttt,ttt.status)~1,


type="kaplan-meier")
> plot(sfit,xlab='time',ylab='S(t)')

Censored observations

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Correcting for Censoring
Kaplan-Meier
Uncorrected

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Hazard Function, h(t)


• Hazard function is the derivative of the survivor function over time

– Age specific death rate when an individual is t years old


– mathematically convenient – more later

• Cumulative hazard function, H(t) = -log(S(t))

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Comparing Survival
Gehan dataset: remission time of leukemia patients.
Patients are split into two treatments. (from library MASS)

Is the treatment effective (i.e.,


is there a significant difference
in survival between groups)?
> library(MASS)
> data(gehan)
> gehan.surv = survfit(Surv(time,cens)~treat,
data=gehan)
> plot(gehan.surv,lty=3:2,lwd=2,cex=2,
xlab = "time of remission(weeks)",
ylab="survival")
> legend(25,0.1,c("control","6-MP"),
lty=2:3,lwd=2)

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Comparing Survival
• Log-Rank Test
– Based on each curves’ PL estimator
– Semi-parametric approach
– aka Cox-Mantel Test or Mantel-Haenszel Test
> survdiff(Surv(time,cens)~treat, data=gehan)
Call:
survdiff(formula = Surv(time, cens) ~ treat, data = gehan)

N Observed Expected (O-E)^2/E (O-E)^2/V


treat=6-MP 21 9 19.3 5.46 16.8
treat=control 21 21 10.7 9.77 16.8

Chisq= 16.8 on 1 degrees of freedom, p= 4.17e-05

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Survivor Function, S(t)
We need a methodology that would allow us to model survival time more
generally and to allow for covariates (think regression models - in fact, if we did
not have to worry about censoring, we could just use a glm). To do this, one
approach is to use …

• Parametric Estimators – assume a known probability distribution of lifetimes


From basic stats, let f(t) be the probability density function (pdf) of lifetimes,
and F(t) be the probability distribution function (or, cumulative distribution
function, cdf) of lifetimes.

Survivor and hazard functions are just …

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Survivor Function, S(t)


• Parametric Estimators –
Commonly used distributions …

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Survivor Function, S(t)
• Parametric Estimators –
Exponential
1.0
0.9

> sregexp = survreg(Surv(ttt,ttt.status)~1,


dist="exponential")
0.8

> curve(pweibull(x,scale = exp(coef(sregexp)),


shape=1,lower=F), from=0,to=8,
0.7

xlab = "time", ylab ="S(t)", col='red',


S(t)

main='Exponential')
0.6

> lines(sfit)
0.5
0.4

0 2 4 6 8
time

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Survivor Function, S(t)


• Parametric Estimators –
Weibull
1.0

> sregwei = survreg(Surv(ttt,ttt.status)~1,


dist="weibull")
0.8

> curve(pweibull(x,scale=exp(coef(sregwei)),
shape=1/sregwei$scale,lower=F),from=0,
to=8, xlab = "time", ylab ="S(t)",
S(t)

col='red', main='Weibull')
0.6

> lines(sfit)
0.4

0 2 4 6 8
time

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Comparing Survival (or Hazard)
• Two general classes of regression models
– Accelerated failure-time (AFT) models
• parametric
– Proportional hazards (PH) model
• can be parametric or semi-parametric (Cox)

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Accelerated Failure Time (AFT) Models

where, σ is a scale parameter, and the εj


are assigned a parametric distribution
output_name = survreg(Surv(time_var,cens_var)~Xi, dist=“distribution”)

Back to the gehan example:

> gehanreg = survreg(Surv(time,cens)~treat, data=gehan, dist="weibull")


> gehanreg
Call:
survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "weibull")

Coefficients:
(Intercept) treatcontrol
3.515687 -1.267335

Scale= 0.7321944

Loglik(model)= -106.6 Loglik(intercept only)= -116.4


Chisq= 19.65 on 1 degrees of freedom, p= 9.3e-06

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Accelerated Failure Time (AFT) Models

> summary(gehanreg)

Call:
survreg(formula = Surv(time, cens) ~ treat, data = gehan, dist = "weibull")
Value Std. Error z p
(Intercept) 3.516 0.252 13.96 2.61e-44
treatcontrol -1.267 0.311 -4.08 4.51e-05
Log(scale) -0.312 0.147 -2.12 3.43e-02

Scale= 0.732

Weibull distribution
Loglik(model)= -106.6 Loglik(intercept only)= -116.4
Chisq= 19.65 on 1 degrees of freedom, p= 9.3e-06
Number of Newton-Raphson Iterations: 5
n= 42

> anova(gehanreg)
Df Deviance Resid. Df -2*LL P(>|Chi|)
NULL NA NA 40 232.8108 NA
treat -1 19.65183 39 213.1590 9.291424e-06

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Weibull Distribution
Assessing the adequacy of the weibull distribution …
These should be
approximately
linear!

> plot(gehan.surv,lty=2:3,lwd=2,cex=2,
fun="cloglog",xlim=c(1,40),
xlab = "time of remission(weeks)",
ylab="log H(t)")
> legend(2,0.5,c("control","6MP"),
lty=3:2)

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Accelerated Failure Time (AFT) Models
Leukemia survival times with two covariates. (from library MASS)
white blood cell count (wbc)
diagnostic test results (ag)
NOTE: No censored
observations!

> library(MASS)
> data(leuk)
> leukfit = survfit(Surv(time)~ag, data=leuk)
> plot(leukfit,lty=3:2,lwd=2,cex=2,
xlab = "time",ylab="survival")
> legend(115,1,c("ag present","agabsent"),
lty=2:3,lwd=2)

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Accelerated Failure Time (AFT) Models


• Because there were no censored observations, we could also approach the data
using a glm!
> leuk_glm = glm(time~ag*log(wbc), data=leuk, family=Gamma(link=log))
> summary(leuk_glm,dispersion=1)

Call:
glm(formula = time ~ ag * log(wbc), family = Gamma(link = log), NOTE: the exponential
data = leuk)
distribution is a special
Deviance Residuals: case of the Gamma
Min 1Q Median 3Q Max distribution with
-1.9921 -1.2116 -0.3269 0.2159 1.5647
dispersion=1
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.3435 1.9662 2.209 0.0272 *
agpresent 4.1347 2.5703 1.609 0.1077
log(wbc) -0.1540 0.2027 -0.760 0.4472
agpresent:log(wbc) -0.3278 0.2669 -1.228 0.2194
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for Gamma family taken to be 1)

Null deviance: 58.138 on 32 degrees of freedom


Residual deviance: 38.555 on 29 degrees of freedom
AIC: 301.74

Number of Fisher Scoring iterations: 11

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Accelerated Failure Time (AFT) Models
> leuk_reg = survreg(Surv(time)~ag*log(wbc), data=leuk, dist="exponential")
> leuk_reg
Call:
survreg(formula = Surv(time) ~ ag * log(wbc), data = leuk, dist =
"exponential")

Coefficients:
(Intercept) agpresent log(wbc) agpresent:log(wbc)
4.3432709 4.1349385 -0.1540179 -0.3278114

Scale fixed at 1

Loglik(model)= -145.7 Loglik(intercept only)= -155.5


Chisq= 19.58 on 3 degrees of freedom, p= 0.00021
n= 33

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Accelerated Failure Time (AFT) Models


> summary(leuk_reg)

Call:
survreg(formula = Surv(time) ~ ag * log(wbc), data = leuk, dist =
"exponential")
Value Std. Error z p
(Intercept) 4.343 1.638 2.651 0.00802
agpresent 4.135 2.370 1.745 0.08097 NOTE: the interaction term
log(wbc) -0.154 0.168 -0.915 0.36000 is not significant,
agpresent:log(wbc) -0.328 0.246 -1.332 0.18298 indicating consistent
effects of log(wbc) across
Scale fixed at 1 groups.

Exponential distribution
Loglik(model)= -145.7 Loglik(intercept only)= -155.5
Chisq= 19.58 on 3 degrees of freedom, p= 0.00021
Number of Newton-Raphson Iterations: 5
n= 33

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Accelerated Failure Time (AFT) Models
> leuk_reg = survreg(Surv(time)~ag+log(wbc), data=leuk, dist="exponential")
> summary(leuk_reg)

Call:
survreg(formula = Surv(time) ~ ag + log(wbc), data = leuk, dist =
"exponential")
Value Std. Error z p
(Intercept) 5.815 1.263 4.60 4.15e-06
agpresent 1.018 0.364 2.80 5.14e-03
log(wbc) -0.304 0.124 -2.45 1.44e-02

Scale fixed at 1

Exponential distribution
Loglik(model)= -146.5 Loglik(intercept only)= -155.5
Chisq= 17.82 on 2 degrees of freedom, p= 0.00014
Number of Newton-Raphson Iterations: 5
n= 33

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Proportional Hazards Model


• Hazard function is modeled as a multiple of some
baseline hazard
– Baseline hazard can be specified as a fully parametric
model
• Requires assumptions to be met
• NOTE: the Weibull PH model turns out to be the same thing
as the Weibull AFT!
– Cox PH model
• The form of the baseline hazard is left unspecified
• Thus, provides a framework for modeling survival with
covariates but is “less parametric”

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009
Proportional Hazards Model
• Cox PH model,
output_name = coxph(Surv(time_var,cens_var)~Xi)
> leuk_cox = coxph(Surv(time)~ag+log(wbc), data=leuk)
> summary(leuk_cox)
Call:
coxph(formula = Surv(time) ~ ag + log(wbc), data = leuk)

n= 33
coef exp(coef) se(coef) z p
agpresent -1.069 0.343 0.429 -2.49 0.0130
log(wbc) 0.368 1.444 0.136 2.70 0.0069

exp(coef) exp(-coef) lower .95 upper .95


agpresent 0.343 2.913 0.148 0.796
log(wbc) 1.444 0.692 1.106 1.886

Rsquare= 0.377 (max possible= 0.994 )


Likelihood ratio test= 15.6 on 2 df, p=0.000401
Wald test = 15.1 on 2 df, p=0.000537 See also cph in the
Score (logrank) test = 16.5 on 2 df, p=0.000263 Design library

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

Proportional Hazards Model


• Leuk data
RED: Kaplan Meier (survival curves
for individuals in the two groups)
BLACK: Cox PH (survivial curves for
individuals WITH AVERAGE WBC in
the two groups)
Some of the differences between the
two groups were due to different
wbc!

> attach(leuk)
> plot(survfit(Surv(time)~ag), lty=2:3, log=T,
lwd=3, col='red')
> leuk_coxs = coxph(Surv(time)~strata(ag)+log(wbc),
data=leuk)
> lines(survfit(leuk_coxs),lty=2:3, lwd=3)
> legend(80,.8,c("ag absent", "ag present"),
lty=2:3, lwd=3)

Statistical Analysis Using R Stephen Cox Midwest SETAC


stephen.cox@ttu.edu March 2009

You might also like