You are on page 1of 4

The Cox model in R

Gardar Sveinbjornsson, Jongkil Kim, Yongsheng Wang


April 18, 2011

1 The data
The data is from an experimental study of recidivism of 432 male prisoners, who were observed for a year after
being released from prison. Half of the prisoners were randomly given financial aid when they were released.
The main focus of the study was whether financial aid kept the prisoners from being rearrested.
- week: week of arrest after release, or censoring time.
- arrest: the event indicator, equal to 1 for those arrested during the period of the study and 0 for those
who were not arrested.
- fin: a dummy variable, equal to 1 if the individual received financial aid after release from prison, and 0
if he did not; financial aid was a randomly assigned factor manipulated by the researchers.
- age: in years at the time of release.
- race: a dummy variable coded 1 for blacks and 0 for others.
- wexp: a dummy variable coded 1 if the individual had full-time work experience prior to incarceration
and 0 if he did not.
- mar: a dummy variable coded 1 if the individual was married at the time of release and 0 if he was not.
- paro: a dummy variable coded 1 if the individual was released on parole and 0 if he was not.
- prio: number of prior convictions.
- educ: education, a categorical variable, with codes 2 (grade 6 or less), 3 (grades 6 through 9), 4 (grades
10 and 11), 5 (grade 12), or 6 (some post-secondary).
- emp1 - emp52: dummy variables coded 1 if the individual was employed in the corresponding week of
the study and 0 otherwise.

2 Cox PH model for time-independent variables in R


2.1 Cox regression
We fit a naive Cox model to our data:
coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
mar + paro + prio + factor(educ), data = Rossi)

coef exp(coef) se(coef) z p


fin -0.4027 0.669 0.1930 -2.087 0.0370
age -0.0514 0.950 0.0222 -2.316 0.0210
race 0.3615 1.435 0.3122 1.158 0.2500
wexp -0.1200 0.887 0.2135 -0.562 0.5700
mar -0.4236 0.655 0.3822 -1.108 0.2700
paro -0.0982 0.906 0.1959 -0.501 0.6200
prio 0.0794 1.083 0.0293 2.707 0.0068
factor(educ)3 0.5934 1.810 0.5196 1.142 0.2500
factor(educ)4 0.3284 1.389 0.5437 0.604 0.5500
factor(educ)5 -0.1210 0.886 0.6752 -0.179 0.8600
factor(educ)6 -0.4070 0.666 1.1233 -0.362 0.7200
We see that finanicial aid, age and number of prior convictions are the only statistically significant variables on
the 5% level.

2.2 Adjusted survival function


plot(survfit(mod.allison), ylim=c(.7, 1), xlab=’Weeks’
, ylab=’Proportion Not Rearrested’)

1.00
0.95
Proportion Not Rearrested

0.90
0.85
0.80
0.75
0.70

0 10 20 30 40 50

Weeks

3 Model development
It is likely that we will have data for more covariates than we can reasonably expect to include in the model.
We must therefore decide on a method to select a subset.

3.1 Purposeful selection


A method for model building controlled by the data analyst. It includes several steps:

1. We fit a multivariable model containing all variables that were significant in a univariable analysis at
the 20-25% level.
2. We use the p-values from the Wald statistic to remove variables from our mode. We also confirm the non
significance by a likelihood ratio test.
3. We check whether the removal has produced an important change in coefficients of other variables.
4. We check again all the variables that we removed.
5. We check for nonlinearity.
6. We look for interactions.
7. We check assumptions.

3.2 Stepwise selection


Stepwise selection is a mix between forward and backward selection. We can either start with an empty model
or a full model and add/remove predictors according some criteria. We will use the AIC, which is defined as
follows:
AIC = 2k − 2max(loglikelihood)
where k is the number of parameters in the model.
Stepwise selection with the step() or stepAIC() function in R gives us:
Step: AIC=1327.35
Surv(week, arrest) ~ fin + age + mar + prio
Df AIC
<none> 1 327.3
-mar 1 1327.7
-fin 1 1329.0
-age 1 1335.4
-prio 1 1336.2

3.3 Best subset selection.


Provides a way to check all possible models. Instead of using the AIC we look at Mallows C.

C = W + (p − 2q)

where p is the number of variables under consideration,

q is the number of variables not included in the subset model.

W = W (p) − W (p − q), where W (p) is the Wald test statistic for the model containing all p variables and
W (p − q) denotes the Wald test statistic for the subset model.

4 Model diagnostics
- PH assumption

We will see that the PH assumption fails for the variable age. We therefore look at i) an interaction between
age and time and ii) age as a strata variable.

- Influential observations
For each covariate we look at how much the regression coefficients change if we remove one observation.
- Checking for nonlinearity
The Martingale residual for individual i at time ti is

M̂i = δi − Ĥ(ti , x, β̂)

where δi is the event indicator, Ĥ(ti , x, β̂) is the cumulative hazard for that indvidual and ti is the time at the
end of follow up.

To detect nonlinearity we plot the martingale residuals against covariates.

5 The Cox model with time-dependent covariate


We treat weekly employment as a time-dependent predictor of time to rearrest.
The coxph function handles time-dependent covariates by requiring that each time period for an individual
appear as a separate observation that is, as a separate row (or record) in the data set. However, the data for
each individual appears as a single row, with the weekly employment indicators as 52 columns in the data frame,
with names emp1 through emp52.
coxph(formula = Surv(start, stop, arrest.time) ~ fin + age +
mar + prio + employed, data = Rossi.2)

coef exp(coef) se(coef) z p


fin -0.3390 0.712 0.1904 -1.781 7.5e-02
age -0.0460 0.955 0.0206 -2.233 2.6e-02
mar -0.3612 0.697 0.3733 -0.967 3.3e-01
prio 0.0842 1.088 0.0278 3.034 2.4e-03
employed -1.3290 0.265 0.2498 -5.320 1.0e-07

The time-dependent employment variable has an apparently huge effect. The hazard of rearrest is smaller by a
factor of 0.265 (declined by 73.5%) when people are on a employed status.

5.1 Lagged variable


The time-dependent employment variable has an apparently large effect. There is an issue about causality here.
Being in prison stops you from working. If you use the same week as the event occures in then you will have a
problem. It would thus be safer to consider a lagged covariate here, that is to replace the term in the model for
working at time t with one for working at time t − 1. That is for predicting week t we use whether the former
prisoners were working in the week before.

Arrest
Arrest
at time t
At time t

Ambiguous causality

Weekly Weekly
Employment Employment
at time t at time t-1

6 Final model
After we introduced the weekly employment into our model the marriage variable has become non significant.
We therefore remove it. We also choose to have age as a strata variable for ease of interpretation.

coxph(formula = Surv(start, stop, arrest.time) ~ fin + strata(age.cat)


+ prio + employed,data = Rossi.2)

n= 19377, number of events= 113

coef exp(coef) se(coef) z Pr(>|z|)


fin -0.33454 0.71567 0.19078 -1.754 0.079502 .
prio 0.08984 1.09400 0.02707 3.319 0.000902 ***
employed -0.82758 0.43710 0.21583 -3.834 0.000126 ***
The estimated hazard ratio for receiving financial aid is 0.71567

This means that subject that received financial aid are being arrested at a 29% slower rate than those who did
not.

You might also like