You are on page 1of 31

Bakkalaureatsarbeit

Titel der Bakkalaureatsarbeit

“Breast cancer survival”


Finding new markers for predicting breast cancer survival

Verfasser

Alexander Dessovic / Mag. Franz Eigner

Wien, am 19. November 2008

Studienkennzahl lt Studienblatt: A 033 551


Studienrichtung lt. Studienblatt: Statistik
Betreuer: Ao.Univ.-Prof.Mag.Dr. Georg Heinze
Statistical Consulting SS 08

Table of contents

Table of contents ............................................................................................................................... 3

1 Introduction .............................................................................................................................. 4

2 Description of Data................................................................................................................... 4

2.1 Data checks and modifications........................................................................................................ 4

2.2 Summary statistics........................................................................................................................... 5

2.3 Censored Survival Data................................................................................................................... 7

2.3.1 5-year survival time .................................................................................................................... 8

2.3.2 Follow-up time ........................................................................................................................... 9

3 Methods ................................................................................................................................... 10

3.1 Kaplan-Meier estimator................................................................................................................. 10

3.2 Cox Regression ............................................................................................................................. 11

4 Survival Analysis .................................................................................................................... 12

4.1 Kaplan-Meier Curves .................................................................................................................... 13

4.2 Proportional Hazards Cox Regressions ......................................................................................... 15

Testing assumptions of the Cox regression ................................................................................................. 18

4.2.1 Proportional hazards assumption (PH) ..................................................................................... 18

4.2.2 Testing interaction terms .......................................................................................................... 23

4.2.3 Testing the linearity assumption............................................................................................... 28

5 Literature ................................................................................................................................ 30

3
Statistical Consulting SS 08

1 Introduction

In survival analysis breast cancer survival is usually predicted by using clinical variables and gene
expressions as independent factors. This paper intends to find new gene expressions as prognostic
markers for breast cancer survival. On behalf of Prof. Dan Cacsire Castillo-Tong and Prof. Zeilinger 6
gene expressions (DDR2, EMP1, EMP2, EMP3, PMP22 and MKI67) were chosen as candidate
markers, which were correlated with disease-free survival, overall survival and tumor specific survival
of the patients. A candidate marker which proves satisfactory in our analyses could be used to develop
a new score to improve prognosis for breast cancer.

2 Description of Data

250 breast cancer patients from the Department of Obstetrics and Gynaecology, Medical University of
Vienna, were included in this study. Date of diagnosis range from 03.03.1987 to 30.11.2001. As
clinical variables, histologic type (HISTOTYPE), tumor size (pT), degree of spread to lymph nodes
(pN), Tumor Grade (G) and age (AGE) were chosen. Additionally to the 6 gene expressions described
in the introduction, the gene expression of estrogen receptor (ER) was also quantified in the tumor
tissues of patients with breast cancer. The analyses were made by the open-source statistical software
package R (R Dev. Core Team, 2008) and the commercial SAS ® (SAS Institute Inc., 2008) software
package.

2.1 Data checks and modifications

Simple tests of plausibility were performed to check the data. Patient 7763 was excluded from the data
set because the results of all gene expressions were missing. Patient 7363 was removed from the data,
because the date of recurrence was unknown.

To graphically check the distribution of the gene expressions, histograms were computed. The data
were transformed by taking the logarithm to base 2 to obtain an approximately symmetric normal
distribution.

4
Statistical Consulting SS 08

0.0 0.1 0.2 0.3 0.4


0.20
0.20

0.20
Density

Density

Density

Density
0.10
0.10

0.10
0.00

0.00

0.00
-6 -2 0 2 -4 -2 0 2 4 -6 -4 -2 0 -6 -4 -2 0

log2(ddr2) log2(emp1) log2(emp2) log2(emp3)

0.30
0.20

6
0.10 0.20
Density

Density

Density

Density
0.10

4
0.10

2
0.00

0.00

0.00

0
-4 -2 0 2 -8 -4 0 4 -8 -6 -4 -2 0 0.0 0.4 0.8

log2(pmp22) log2(ER) log2(mki67) mki67

To show the importance of using the logarithm, the untransformed values of the gene expression
MKI67 are also plotted. Afterwards gene expressions were transformed into dichotomous variables by
using the median.

2.2 Summary statistics

Histologic type
Invasive ductal carcinoma (IDC) 182
IDC and ILC 6
Invasive lobular carcinoma (ILC) 40
Medullary 5
Mucinous 6
Unknown 9
Total 248

For analysis of survival times groupings were necessary because of the low number of cases in some
subgroups. Concerning histological type, the category IDC and ILC was classified just as IDC,
because for survival the more serious diagnosis is important. Medullary, Mucinous and Unknown were
combined to a new category Others and Unknown.
Tumor size
Mic 1
pT I 64
pT II 127
pT III 23
pT IV 14
Unknown 19
Total 248

The Patient with the category Mic was assigned to pT I. For the analysis pT III and pT IV were pooled
and compared with the groups pT I and pT II.
5
Statistical Consulting SS 08

Nodal status

pN0 95
pN1 123
Unknown 30
Total 248

Differentiation grade

G1 34
G2 122
G3 71
Unknown 21
Total 248

Recurrence of Disease
Recurrence of disease 109
No evidence of disease 139
Total 248

Survival
Alive at last observation 152
Death at last observation 196
Death as a result of disease 71
Death of other cause 16
Death of unknown cause 9
Total 248

Age Minimum 25% quantile 50% quantile 75% quantile Maximum


Years 27.8 48.0 58.1 69.4 89.6

For analysis, patients were divided into younger or equal than 50 years and older than 50 years.
Usually around this age the menopause starts. At time of diagnosis, 31% of all patients were younger
and 69% were older than 50 years.

Correlations

Gene expression values were grouped into values lower or equal to and values greater than the median
and then compared between groups constituted by histopathologic data according to the χ 2 -test.
PMP22 seems to be strongly correlated with differentiation grade (G) and nodal status (pN). Gene
expression of EMP2 seems to be strongly correlated with G, pT and pN. High correlations were also
found between ER and G. Furthermore correlation analysis reveals that no significant difference in the
level of expression of PMP22 can be examined between patients aged younger or equal than 50 years
and patients aged older than 50 years. Additionally, correlations between gene expressions were
estimated by Spearman’s nonparametric correlation coefficient. Our gene expressions are in general
remarkably correlated with each other.

6
Statistical Consulting SS 08

2.3 Censored Survival Data

As typical in clinical and epidemiological studies, survival times are censored caused by a time
restriction of type I (Lagakos, 1979). The study continues until a prespecified time point (cut-off
point). The date of the event of interest is known precisely only for those subjects who present the
event until cut-off point. For the remaining subjects, it is only known that the time to the event is
greater than the observation time. This is referred as „administrative censoring“ and the incomplete
data are called „right censored“. Besides the time restriction, incomplete data can be also given by lost
to follow-up or drop out patients in the study.

3 different survival times will be used for analyses:

- Disease-free survival time (DFS)

DFS was defined as the time elapsing from date of diagnosis to date of recurrence of disease
(event) or - in case of no recurrence - to date of last gynecological examination (censored).

- Overall survival time (OS)

OS was defined as the time elapsing from date of diagnosis to date of death (event) or - in case
of no death - to date of last observation (censored).

- Tumor specific survival time (TS)

TS follows the definition of OS except that patients who died of causes unrelated to breast
cancer were also treated as censored.

7
Statistical Consulting SS 08

2.3.1 5-year survival time

Survival probabilities were estimated using the method of Kaplan and Meier (1958).

Survival time prob. lower CL upper CL number at risk


5-year survival DFS 0.639 0.578 0.707 122
OS 0.754 0.700 0.812 156
TS 0.810 0.759 0.863 156
10-year survival DFS 0.490 0.420 0.570 50
OS 0.589 0.524 0.663 71
TS 0.658 0.591 0.732 71
15-year survival DFS 0.322 0.207 0.501 5
OS 0.400 0.290 0.553 7
TS 0.547 0.421 0.712 7

CL … confidence limit (95%)


prob. … probability to survive

Analyses show that the probability of recurrence of cancer within a time period of 5 years is about
36.1%, the probability of death is about 24.6% and of death on account of a tumor is about 19%.
Within a time period of 10, 15 years respectively probabilities for recurrence and death increase
steadily, approximately constant for disease-free survival and overall survival and with diminishing
trend for tumor-specific survival.

Disease-free survival Overall survival


1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2
0.0

0.0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Survival time (years) Survival time (years)

Tumor specific survival


1.0
0.8
Cumulative survival

0.6
0.4
0.2
0.0

0 2 4 6 8 10 12 14

Survival time (years)

8
Statistical Consulting SS 08

Median survival time is estimated on the survival curve for DFS, OS and TS.

Survival time Median survival time in years Number at risk


DFS 9.8 54
OS 13.5 21
TS - -

It estimates the time period beyond which 50% of the patients are expected to survive in the
population under study. As it’s evident in the graphs above, in the case of tumor-specific survival the
survival curve doesn’t fall under 50 %. The last patient in the study dies on account of the tumor after
20.6 years at a survival rate of 0.547.

2.3.2 Follow-up time

In order to have an additional indicator of completeness of follow-up, the distribution of follow-up


time was evaluated using the method “reverse Kaplan-Meier” (KM-PF) (Schemper and Smith, 1996),
which is calculated in the same way as the Kaplan-Meier estimate of the survival function, but with
the meaning of the status indicator reversed. “Thus death ( δ = 1 ) censors the true but unknown
observation time of an individual, and censoring ( δ = 0 ) is an endpoint. The unobservable follow-up
time of a deceased patient is interpreted as the follow-up time that potentially would have been
obtained had that patient not died.” (Schemper and Smith, p.344)

Follow-up distribution
time in years number at risk
1.0

75% 6.6 134


50% 9.8 74
Proportion followed-up

0.8

25% 12.5 34
0.6
0.4
0.2
0.0

0 2 4 6 8 10 12 14 16 18 20

years

9
Statistical Consulting SS 08

3 Methods

3.1 Kaplan-Meier estimator

The Kaplan-Meier estimator estimates the survival function from the survival data. It can be used to
measure the survival probability for a certain amount of time after biopsy. The value of the survival
function between successive distinct sampled observations is assumed to be constant. For simplicity,
explanations are restricted to the case, where the event of interest is death.

Let S(t) be the probability that an individual from a given population will have a lifetime exceeding t.
For a sample from this population of size N let the observed times until death of N sample members be
t1 ≤ t 2 ≤ t 3 L ≤ t N

Let T be the random variable that measures the time of death and let F(t) be its cumulative distribution
function. Then the survival function is given by:

S (t ) = P[T > t ] = 1 − P[T ≤ t ] = 1 − F (t )

The Kaplan-Meier estimator is the nonparametric maximum likelihood estimate of the survival
function S (t ) . It’s of the form

n − di
Sˆ (t ) = ∏ i
ti ≤t ni

where ni is the number "at risk" just prior to time ti , and d i is the number of deaths at time ti . With
censoring, ni is the number of survivors less the number of losses. It is only those surviving cases that

are still being observed that are "at risk" of an (observed) death.

10
Statistical Consulting SS 08

3.2 Cox Regression

Cox-Regression is a sub-class of survival models in statistics. They are used in this paper in the
analysis of censored survival data for identifying differences in survival due to prognostic factors. The
basic model assumes that the hazard function for failure time T for an individual i with covariate
vector xi′ = ( x1i , x 2 i , K , x ki , K , x Ki ) is

h(t ; xi ) = h0 (t ) exp(β ′xi )

for i = 1, K , N

The covariates are assumed to be constant in time and have independent effects on the hazard rate. The
first part, h0 (t ) , is a function of time and is assumed to be the same for all subjects. Its form is not
specified by the Cox model. The second depends on the individual covariate vector, where β is the
unknown effect parameter which has to be estimated. Cox (1972) introduced a method for
estimating β and hence the hazard ratio without having to involve h0 (t ) by using partial likelihoods.

Although h0 (t ) can take any form, the hazard ratio between 2 individuals can be calculated
independent of h0 (t ) .

h(t , x1 ) h0 (t ) exp(β ′x1 )


= = exp[β ′( x1 − x 2 )]
h(t , x 2 ) h0 (t ) exp(β ′x 2 )

The formula underlines the proportional hazards assumption, which means that the failure rates of any
two individuals are proportional, given that the ratio does not depend on time. Although the risk to

die can vary over time, the risk ratio between two individuals is constant over the whole range of
follow-up. h0 (t ) can be interpreted as the hazard function of a subject with all covariates of value

zero, therefore it is often termed the baseline hazard.

A different crucial assumption follows from the exponential function for linking the independent
covariates to the hazard. It leads to a multiplicative effect of a covariate on the hazard or, concerning
the logarithm of the hazard function, to an additive effect in form of a constant distance over time.
This assumption will be later relaxed by using interaction terms.

11
Statistical Consulting SS 08

4 Survival Analysis

The association of gene expression groups with survival times was assessed by estimating survival
curves through the method of Kaplan-Meier (1956), which were compared by the log-rank test of
Mantel-Haenzel (1959) and quantified by estimating relative risks (crude Hazards Ratio) from
(univariate) Cox regression analyses (1972), which are closely related to log-rank tests. In order to
evaluate gene expressions as independent prognostic factor for DFS, OS and TS, multivariable Cox
regression analyses were used additionally.

Estimates of the survival curve for censored data using the Kaplan-Meier method and the predicted
survivor function for a Cox proportional hazards model were computed by the function Survfit in the
R-package “Survival” (Therneau et al., 2008). Cox proportional hazards regression models are fitted
by the function coxph from the R-package “Survival”. The Efron approximation (1977) is used for
calculation of parameter estimators instead of the typical Breslow method (1974), “as it is much more
accurate when dealing with tied death times, and it is as efficient computationally” (Therneau et al.,
2008).

12
Statistical Consulting SS 08

4.1 Kaplan-Meier Curves

Kaplan-Meier Curves are plotted for all gene expressions using disease-free survival.

ddr2 emp1
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


Cumulative survival

Cumulative survival
≤ median ≤ median
> median > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Survival time (years) Survival time (years)

emp2 emp3
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


Cumulative survival

Cumulative survival

≤ median ≤ median
> median > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Survival time (years) Survival time (years)

pmp22 mki67
0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0


Cumulative survival

Cumulative survival

≤ median ≤ median
> median > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Survival time (years) Survival time (years)

Comment: Plots were cut off at a time level of 180 months, which are 15 years. Higher than median gene
expression levels are shown by a solid line, lower or equal than median gene expression levels are shown by a
dashed line.

13
Statistical Consulting SS 08

Kaplan-Meier survival curves for disease-free survival show only small differences in survival times
between low and high gene expression levels, which are statistically not significant, using univariate
Cox regressions with a confidence interval of 95%, as it is shown in the next chapter.

Kaplan Meier-Curves for disease-free survival for all clinical variables except for G are plotted.

pN ER
1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2

pN0 ≤ median
pN1 > median
0.0

0.0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Survival time (years) Survival time (years)

age pT
1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2

pT1
≤ 50 years pT2
> 50 years pT3
0.0

0.0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Survival time (years) Survival time (years)

Comment: Plots were cut off at a time level of 180 months, which are 15 years.

Differences in survival times between lower and higher levels for pN and AGE seem to be quite high.
One has to keep in mind that usual K-M curves reflect the unadjusted analysis. But multivariable Cox
regression allows us to obtain an estimate of the parameter effect adjusted by prognostic covariates.
14
Statistical Consulting SS 08

4.2 Proportional Hazards Cox Regressions

Crude Hazard Ratios (relative risks) can be calculated by exponentiating estimated coefficients of the
univariate Cox regressions. A Hazard Ratio estimate of 1 means that compared groups don’t differ in
terms of survival, whereas for instance a lower than 1 value indicates lower risk for the group with
higher than median gene expression.

Correlation of gene expressions with breast cancer

Marker Outcome Crude HR CI adj. HR CI

DFS 1.1 (0.8-1.7) 1.5* (1.0-2.4)


ddr2 OS 1.0 (0.7-1.5) 1.3 (0.8-2.1)
TS 1.1 (0.7-1.8) 1.5 (0.9-2.6)

DFS 1.0 (0.7-1.5) 1.1 (0.7-1.7)


EMP1 OS 1.1 (0.8-1.7) 1.2 (0.7-1.9)
TS 1.1 (0.7-1.7) 1.3 (0.8-2.2)

DFS 1.1 (0.7-1.5) 1.3 (0.8-2.1)


EMP2 OS 1.0 (0.6-1.4) 0.9 (0.5-1.5)
TS 1.0 (0.6-1.6) 1.2 (0.6-2.1)

DFS 0.9 (0.6-1.4) 1.3 (0.8-2.0)


EMP3 OS 1.1 (0.8-1.7) 1.5 (0.9-2.3)
TS 1.1 (0.7-1.7) 1.6 (0.9-2.7)

DFS 1.1 (0.8-1.6) 1.1 (0.7-1.6)


mki67 OS 1.1 (0.7-1.6) 1.0 (0.6-1.5)
TS 1.1 (0.7-1.8) 1.0 (0.6-1.7)

DFS 1.0 (0.7-1.5) 1.8* (1.1-3.0)


**
PMP22 OS 1.2 (0.8-1.8) 2.1 (1.2-3.6)
TS 1.3 (0.8-2.1) 3.2** (1.7-6.0)

**
p<0.01
*
p<0.05
HR Hazard Ratio
adj. HR HR in the multivariable Cox regression,
adjusted for the clinical variables and ER
CI Confidence Intervals (95%)

A marker is only important if it adds additional information to the survival prediction. Therefore, we
adjust our analyses to established markers, which can be clinical variables as well as gene expressions.
It is known that sometimes variables are only significant if adjusted for other effects. Indeed
multivariable Cox regressions reveal a weakly significant, independent effect of DDR2 on disease-free

15
Statistical Consulting SS 08

survival time, adjusting for Nodal Status, Differentiation Grade, Tumor Size, Age and ER gene
expression. More impressive is the impact of gene expression PMP22 on all 3 survival outcomes.
Patients with higher PMP22 expression had shorter disease-free survival, overall survival and tumor
specific survival time, whereas patients with lower PMP22 expression had better survival. Other gene
expressions didn’t show a significant effect on any survival time in the multivariable case. After these
results we concentrated our analyses on the gene expression PMP22.

Correlation of different factors with breast cancer

Disease-free survival
Crude HR CI Adj. HR1 CI
*
PMP22 expression 1.0 (0.7-1.5) 1.7 (1.1-3.0)
Nodal Status 2.7** (1.7-4.3) 2.5** (1.5-4.0)
Differentiation Grade 1.1 (0.8-1.4) 1.1 (0.7-1.5)
Tumor Size 1.3 (1.0-1.8) 1.3 (1.0-1.8)
Age>50 - - - -
*
ER expression 0.6 (0.4-0.9) - -

1
stratified by ER≤median and ER>median

Overall survival
Crude HR CI Adj. HR CI
PMP22 expression 1.2 (0.8-1.8) 2.1** (1.2-3.6)
Nodal Status 2.4** (1.5-3.9) 2.2** (1.3-3.7)
Differentiation Grade 0.8 (0.6-1.1) 0.9 (0.6-1.3)
Tumor Size 1.6** (1.1-2.1) 1.6** (1.1-2.3)
Age>50 0.8 (0.6-1.3) 0.9 (0.6-1.5)
ER expression 0.6* (0.4-1.0) 0.4** (0.3-0.7)

Tumor specific survival


Crude HR CI Adj. HR CI
PMP22 expression 1.3 (0.8-2.1) 3.2** (1.7-6.0)
Nodal Status 3.7** (2.0-6.9) 3.3** (1.7-6.3)
Differentiation Grade 1.0 (0.7-1.5) 1.1 (0.7-1.7)
Tumor Size 1.6* (1.1-2.2) 1.6* (1.1-2.4)
**
Age>50 0.5 (0.3-0.8) 0.6 (0.4-1.0)
ER expression 0.5** (0.3-0.9) 0.4** (0.2-0.7)

**
p<0.01
*
p<0.05
HR Hazard Ratio
adj. HR HR in the multivariable Cox regression,
adjusted for the clinical variables and ER
CI Confidence Intervals (95%)

16
Statistical Consulting SS 08

The prognostic value of PMP22 for all 3 survival outcomes is shown in the upper table, together with
histological data and age of patients. Analyses revealed that in the univariate Cox regression the level
of PMP22 doesn’t correlate with any survival outcome, whereas in the multivariable Cox regression
patients with higher expression level of PMP22 had a significantly poorer disease-free survival than
those with lower expression level (p=0.025). Patients with higher than median expression level of
PMP22 had a 1.7 fold (95% confidence level, 1.2-3.6) higher risk to relapse than those with a lower
than median expression level of PMP22. Similar results were obtained for overall survival (p=0.006)
and in a more impressive way, for tumor specific survival (p<0.001). Patients with higher than median
PMP22 expression level had a 3.2 fold (95% confidence interval, 1.7-6.0) higher risk to die on account
of a tumor.

An even larger independent effect on breast cancer was confirmed for nodal status adjusting for
PMP22, tumor size, differentiation grade and age of patient at diagnosis. Patients with negative nodal
status tended to experience much better survival than those with nodal involvement. (DFS: p<0.001,
OS: p=0003, TS: p<0.001)

A larger tumor size was correlated with poorer overall survival (p=0.007) and tumor specific survival
(p=0.018) compared to patients with a smaller tumor size. A negative but not statistically significant
impact of tumor size on disease-free survival was also revealed.

Older patients tended to have better overall and tumor specific survival, but this finding is not
statistically significant in the adjusted case (p=0.670 and p=0.058, respectively). A higher-than median
gene expression of ER is correlated with higher overall survival and higher tumor specific survival.
The same holds for its effect on disease-free survival in the univariate case (p=0.021). As we show in
the next chapter, in the multivariable (adjusted) case the effect of ER seemed to correlate with time as
revealed by a correlation of Schoenfeld-residuals with time (p=0.003). Therefore we stratified the
multivariable Cox regression by ER. Other histological data, that is Differentiation grade, didn’t show
a significant prognostic value in the univariate as in the multivariable case.

Differences between crude and adjusted Hazard Ratios could be due to correlations between the
examined variables. In this case these differences are quite large.

17
Statistical Consulting SS 08

Testing assumptions of the Cox regression

To determine whether a fitted Cox regression model describes adequately the data, one has to check its
fundamental assumptions: (1) proportional hazards assumption, (2) multiplicative effect of covariates
on the hazard and (3) linearity in the relationship between the log hazard and the covariates.
Extensions of the model are now presented to modify these characteristics. Assumption (1) can be
relaxed by stratification, assumption (2) can be relaxed by using interaction terms between the
covariates and assumption (3) can be replaced by integrating natural splines.

4.2.1 Proportional hazards assumption (PH)

The proportional hazards assumption is crucial for Cox regression and means that the ratio between
the hazards of 2 patient groups remain constant over the complete follow-up period. This implies that
in Cox regression analysis one relative risk is computed which should apply to all recurrence or death
times respectively. A way to formally detect violations of the proportional hazards assumption is to
test the significance of an interaction of a covariate with time. A different approach would be to test
the slope of partial residuals as proposed by Schoenfeld (1980, cited after Marubini/Valsecchi (1995,
p. 244). This approach has the advantage that one doesn’t have to pay attention to the specification of
the interaction term. By partitioning both the time axis and the space of the covariate values, mutually
exclusive categories of failure times with associated covariates are formed. The idea behind the
method aims at comparing the number of events observed and the number of those expected under the
Cox model in each of the „cells“ produced by this partition.

The function Cox.zph in the “Survival”- Package of R tests the proportional hazards assumption for a
Cox regression model by using scaled Schoenfeld residuals. However they are calculated after the
method of Grambsch and Therneau (1993), because they better reflect the log hazard ratio function
than ordinary Schoenfeld residuals and are furthermore on the regression coefficient scale. Residuals
are weighted by Grambsch and Therneau's "average variance" method. In detail each residual is scaled
by premultiplying by a time-dependent variance matrix, to obtain estimates of time varying
coefficients.

The scaled Schoenfeld Residual follows the formula

rk* = rk* ( β ) − V −1 ( β , t k )rk ( β )

where V ( β , t k ) is the weighted covariance-matrix

18
Statistical Consulting SS 08

Plots are made by the cox.zph function. The time dependent coefficient Beta(t) gives an estimate of
the correlation of each covariate with time, the test if the slope of partial residuals is unequal to zero is
measured by the p-value for Beta(t).

Disease-free survival

Beta(t) for pmp22 Beta(t) for ER

4
4

2
2

0
0

-2
-2
-4

-4

9.5 20 26 40 53 80 110 130 9.5 20 26 40 53 80 110 130


Beta(t) for G Beta(t) for pN
3
2

2
1
0

0
-1
-2

-2
-3
-4

-4

9.5 20 26 40 53 80 110 130 9.5 20 26 40 53 80 110 130


Beta(t) for pT
3
2
1
0
-1
-2

9.5 20 26 40 53 80 110 130

“The solid line is a smoothing-spline fit to the plot, with the broken lines representing a ± 2-standard-error
band around the fit. Systematic departures from a horizontal line are indicative of non-proportional hazards“.
(Fox, 2002, p. 13)

DFS Correlation coefficient Schoenfeld-Residual


rho p-value
PMP22 0.03 0.75
ER 0.33 <0.01
G <0.01 0.98
pN -0.19 0.07
pT 0.04 0.74

19
Statistical Consulting SS 08

Overall survival

Beta(t) for ER Beta(t) for age


4

2
2

0
0
-2

-2
-4

-4
18 29 45 60 84 110 140 170 18 29 45 60 84 110 140 170
Beta(t) for G Beta(t) for pN

2
2
0

0
-2

-2
-4

-4

18 29 45 60 84 110 140 170 18 29 45 60 84 110 140 170


Beta(t) for pT
3
2
1
0
-1
-2

18 29 45 60 84 110 140 170

OS Correlation coefficient Schoenfeld-Residual


rho p-value
PMP22 0.16 0.11
ER 0.04 0.74
Age 0.02 0.87
G 0.11 0.31
pN -0.04 0.73
pT 0.14 0.25

20
Statistical Consulting SS 08

Tumor specific survival

Beta(t) for ER Beta(t) for age


4

2
1
2

0
0

-1
-2

-2
-3
-4

-4
18 27 35 54 69 89 110 130 18 27 35 54 69 89 110 130
Beta(t) for G Beta(t) for pN
4
2

2
0

0
-2

-2
-4
-4

18 27 35 54 69 89 110 130 18 27 35 54 69 89 110 130


Beta(t) for pT
3
2
1
0
-1
-2

18 27 35 54 69 89 110 130

TS Correlation coefficient Schoenfeld-Residual


rho p-value
PMP22 0.19 0.09
ER 0.12 0.35
Age -0.06 0.65
G <0.01 0.98
pN 0.07 0.62
pT 0.26 0.07

The assumption of proportional hazards appears to be supported for nearly all covariates in all survival
times. There only appears to be strong evidence of non-proportional hazards for ER in the disease-free
survival analyses.

21
Statistical Consulting SS 08

Accommodating non-proportional hazards by Stratification

To correct for unproportional hazards, a stratified Cox model is used. Stratification can be used if for a
variable non-proportional hazards are detected and if the variable is not of interest by itself. Extending
the model may accommodate this by considering the stratification of the data into subgroups, each
identified by a level of the factor, and applying the model:

hm (t , x) = h0 m (t ) exp(β ′x)

where the suffix m indicates the stratum ( m = 1, K , M ). This model assumes that individuals within
the m th stratum who have different covariates still have proportional hazards, but individuals in
different strata are permitted to experience non-proportional hazards, because each stratum has a
different baseline hazard function.

A different approach to accommodate non-proportional hazards is to build interactions between


covariates and time into the Cox regression model. Such interactions are themselves time-dependent
covariates. However stratification has the advantage that one doesn’t have to assume a particular form
of interaction between the stratifying covariates and time. A disadvantage of stratification is “the
resulting inability to examine the effects of the stratifying covariates”, therefore “stratification is most
natural […] when the effect of the stratifying variable is not of direct interest” (Fox, p.14).

DFS: Without Stratification

exp(coef) lower CL upper CL p-value


PMP22 1.83 1.1 3.0 0.015
pN 2.60 1.6 4.2 <0.001
pT 1.25 0.9 1.7 0.160
G 1.04 0.7 1.4 0.830
ER 0.49 0.3 0.8 0.003

DFS: With Stratification

exp(coef) lower CL upper CL p-value


PMP22 1.73 1.1 2.8 0.025
pN 2.48 1.5 4.0 <0.001
pT 1.31 1.0 1.8 0.096
G 1.05 0.7 1.5 0.800

CL … confidence limit (95%)

Analyses show that stratifying by ER doesn’t seem to change coefficients significantly. The effect of
PMP22 fell from 1.8 to 1.7. It may be that the time dependent effect of ER was not too large.
22
Statistical Consulting SS 08

4.2.2 Testing interaction terms

If covariates are introduced in a Cox model without an interaction term, they are supposed to act
independently and multiplicatively on the hazard. The introduction of an interaction term relaxes this
assumption. Because ignoring interaction terms would lead to a misspecification of the model, one has
to test for interaction terms. At first all interaction terms between PMP22 and the clinical variables +
ER are analysed for all survival times.

DFS OS TS
p-value
PMP22 x pN 0.82 0.62 0.78
PMP22 x G 0.20 0.55 0.28
PMP22 x pT 0.24 0.03 0.13
PMP22 x age - 0.29 0.65
PMP22 x ER 0.55 0.70 0.77

There seems to be only one significant interaction term between PMP22 and all the other variables,
which is PMP22 together with Tumor Size in the overall survival analyses. This interaction term will
be analyzed further by drawing Kaplan-Meier Curves, showing the interaction between PMP22 and
pT. Keep in mind that these K-M curves are unadjusted for all the other histological variables +ER.

23
Statistical Consulting SS 08

Overall Survival: pmp22, pT=1 Overall Survival: pmp22, pT=2


1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2
≤ median ≤ median
0.0

0.0
> median > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Surv iv al time (y ears) Surv iv al time (y ears)

Overall Survival: pmp22, pT=3 Overall Survival: pmp22


1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2

≤ median ≤ median
0.0

0.0

> median > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Surv iv al time (y ears) Surv iv al time (y ears)

One has to notice that adjusted for Tumor Size, differences between survival curves due to PMP22
became obviously, at least for pT=2 and pT=3. In contrast to the univariate case (graph on the lower
right), in which there seems to be no significant difference.

Analyses are extended to the adjusted case by integrating the interaction term into the multivariable
Cox regression for overall survival.

Interaction term PMP22 x pT: OS

without Interaction
coef exp(coef) lower CL upper CL p-value
PMP22 0.75 2.13 1.24 3.63 0.01
pT 0.47 1.60 1.14 2.26 0.01
ER -0.82 0.44 0.27 0.74 <0.01
Age -0.11 0.90 0.55 1.46 0.67
G -0.11 0.90 0.63 1.29 0.57
pN 0.79 2.21 1.32 3.68 <0.01

24
Statistical Consulting SS 08

with interaction
coef exp(coef) lower CL upper CL p-value
PMP22 -0.90 0.41 0.08 1.95 0.26
pT -0.07 0.93 0.51 1.70 0.82
ER -0.80 0.45 0.27 0.75 <0.01
Age -0.21 0.81 0.50 1.33 0.41
G -0.08 0.93 0.64 1.34 0.68
pN 0.84 2.31 1.38 3.86 <0.01
PMP22:pT 0.80 2.23 1.07 4.64 0.03

CL … confidence limit (95%)

After inclusion of the interaction term PMP22:pT the coefficient for PMP22 became insignificant. To
calculate the impact of PMP22 on overall survival one now has to consider the interaction term too.
For instance, to calculate the effect of PMP22 and pT=1 one has to add (0.80) to the coefficient of
PMP22 (-0.9) which results in -0.1. Using the exponent on this result delivers the (adjusted) Hazard
Ratio.

Impact of PMP22 with pT on overall survival

Overall survival
coef HR lower CL upper CL p-value
pT=1 -0.10 0.90 0.4 2.3 0.830
pT=2 0.70 2.02 1.2 3.5 0.011
pT=3 1.51 4.51 1.8 11.1 0.001

HR … Hazard Ratio
CL … confidence limit (95%)

Because the interaction effect between PMP22 and pT is positive, increasing size of tumors leads to an
increasing interaction term and therefore to an increasing impact of PMP22 on overall survival. For
the case of pT=1, no effect of PMP22 can be detected. For pT=2 and pT=3, PMP22 has an increasing
influence on overall survival.

Confidence intervals were calculated by using Cox regressions where the interaction term for the
analyzed factor of pT was eliminated by subtracting the value of pT itself. Which means: In order to
specify the confidence interval for the effect of PMP22 in case of pT=1, Cox regressions were used,
where the interaction term with pT was eliminated by subtracting pT with 1. Therefore, the estimated
coefficient of PMP22 encompassed also the effect of the interaction term and represented the whole
effect of PMP22. This is shown in the following after the method of Figueiras et al. (1998).

25
Statistical Consulting SS 08

Eliminating the interaction term for pT=1 by defining


pT_1=pT-1

This Cox regression delivers the impact of PMP22 with pT=1 on overall survival.

coef exp(coef) lower CL upper CL p-value


PMP22 -0.1 0.9 0.36 2.27 0.83
pT_1 -0.1 0.9 0.51 1.70 0.82
ER -0.8 0.4 0.27 0.75 <0.01
age>50 -0.2 0.8 0.50 1.33 0.41
G -0.1 0.9 0.64 1.34 0.68
pN 0.8 2.3 1.38 3.86 <0.01
PMP22:pT_1 0.8 2.2 1.08 4.64 0.03

CL … confidence limit (95%)

Here the effect of PMP22 is clearly not significant. However evaluating the effect of the separate
factors of pT delivered a significant effect of PMP22 for pT=2 (pvalue=0.011, 1.2-3.5) and for pT=3
(pvalue=0.001, 1.8-11.1), for the latter there could be also a positive correlation with time on account
of significant Schoenfeld-Residuals (pvalue=<0.001). This would mean that the impact of PMP22 for
pT=3 seems to be larger in the later observation time. Survival curves differ concerning pT=2 and
pT=3, a positive correlation with time can be seen for pT=3.

Beta(t) for PMP22:pT_3

Correlation Schoenfeld
TS coefficient Residual
10

rho p-value
PMP22 0.37 <0.001
5

ER 0.05 0.624
Age -0.05 0.673
0

G 0.12 0.252
-5

pN <0.01 1.000
pT -0.26 0.019
-10

PMP22:pT_3 0.37 0.001


-15

18 29 45 60 84 110 140 170

Time

26
Statistical Consulting SS 08

Analyzing confounding of the effect of PMP22 by pN


Due to high correlations between PMP22 and pN ( Chapter 2.2), Kaplan-Meier Curves are
computed to analyze the confounding between PMP22 and pN.

Disease-free survival: pmp22 Disease-free survival: pmp22


1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2
≤ median, pN0
> median, pN0
≤ median, pN1 ≤ median
0.0

0.0
> median, pN1 > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Surv iv al time (y ears) Surv iv al time (y ears)

Overall survival: pmp22 Overall survival: pmp22


1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2

≤ median, pN0
> median, pN0
≤ median, pN1 ≤ median
0.0

0.0

> median, pN1 > median

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Surv iv al time (y ears) Surv iv al time (y ears)

Tumor specific survival: pmp22 Tumor specific survival: pmp22


1.0

1.0
0.8

0.8
Cumulative survival

Cumulative survival
0.6

0.6
0.4

0.4
0.2

0.2

≤ median, pN0
> median, pN0
≤ median, pN1 ≤ median
> median, pN1 > median
0.0

0.0

0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14

Surv iv al time (y ears) Surv iv al time (y ears)

27
Statistical Consulting SS 08

The graphical presentation of the confounding between PMP22 and pN delivers comparable results
with estimated Cox regressions. Looking at the 3 plots on the right side, which illustrate the crude
effect of PMP22 on the 3 outcomes, one can see that survival time doesn’t seem to be correlated with
the degree of PMP22 which undermines the non significant Hazard Ratio from the univariate Cox
regressions. Looking at the left side, we see how correlation between PMP22 and pN extracts
differences between survival times. Patients with negative nodal status again tend to experience better
survival than those with nodal involvement, however PMP22 seem to add additional explanation for
differences in survival times. A higher than median expression of PMP22 has a clearly negative effect
on survival times, with pN=0 and with pN=1. In the multivariable Cox regression we adjusted the
effect of PMP22 not only for pN, but also for Differentiation grade, tumor size, age and ER
expression. We also obtained a significant Hazard Ratio for all 3 survival time outcomes.

4.2.3 Testing the linearity assumption

„Nonlinearity – that is, an incorrectly specified functional form in the parametric part of the model – is
a potential problem in Cox regression as it is in linear and generalized linear models.“ (Fox, 2000, p.
15) When the linear dependence of the log-hazard on the covariate is not believed to hold through its
entire range, one may extend the predictor to include a squared term to detect a possible departure
from the linear relationship. A more sophisticated approach to this problem consists in using a spline
function to model the relationship between log-hazard and predictors (Harrel et al., 1988; Durrleman
and Simon, 1989) (cit. after Marubini/Valsecchi, p.195).

By using the function rcs from the R-package “Design” (Harrell, 2008) a linear tail-restricted cubic
spline function (natural spline) for PMP22 is integrated into the model.

DFS OS TS
coef p-value coef p-value coef p-value
PMP22 0.98 0.319 PMP22 1.04 0.371 PMP22 0.90 0.515
PMP22' -0.97 0.717 PMP22' -0.15 0.961 PMP22' 1.23 0.734
PMP22'' 0.80 0.938 PMP22'' -3.69 0.752 PMP22'' -8.33 0.545
ER -0.07 0.368 ER -0.17 0.094 ER -0.23 0.081
Age - - Age 0.01 0.176 Age -0.01 0.187
G -0.02 0.912 G -0.17 0.336 G -0.02 0.914
pN 0.94 <0.001 pN 0.76 0.004 pN 1.23 <0.001
pT 0.12 0.449 pT 0.34 0.050 pT 0.29 0.137

28
Statistical Consulting SS 08

Disease-free survival time (DFS) Overall survival time (OS)

0.5
0.0
log Relative Hazard

log Relative Hazard


0.0
-0.5

-0.5
-1.0

-1.0
-1.5

-1.5 -1.0 -0.5 0.0 0.5 -1.5 -1.0 -0.5 0.0 0.5

Tumor specific survival time (TS)


0.5
log Relative Hazard
0.0
-1.0 -0.5
-2.0 -1.5

-1.5 -1.0 -0.5 0.0 0.5

One can test for each survival time if the model with the cubic spline function delivers a significant
higher likelihood (LL) than the general model. The joint contribution of the cubic spline coefficients to
the likelihood is evaluated by applying the likelihood ratio test (LR), which gives the statistic:

QLR = −2 * LL _ cubic − LL _ general

The statistic QLR is asymptotically distributed as a chi-square with two degrees of freedom.

LL_gen LL_cubic -2*(LL_gen-LL_cubic) p-value


-431.25 -429.81 2.90 0.235
DFS
-371.25 -368.90 4.69 0.096
OS
-281.41 -278.62 5.57 0.062
TS

Analyses show that for all survival times the assumption of linearity concerning the effect of gene
expression PMP22 can not be rejected on a significance level of 5% by using cubic splines as an
alternative.

29
Statistical Consulting SS 08

Conclusion

Six candidate markers were evaluated in its ability to predict survival of breast cancer patients. For
simplicity, the expression values of these markers have been categorized (above median, below
median). The marker didn’t prove satisfactory in univariate analyses. However, in multivariable Cox
regressions, statistically significant correlations were found between gene expression of PMP22 and
all of the analyzed survival times, which were disease-free survival, overall survival and tumor
specific survival. Further analyzes showed that the significant effect of PMP22 in multivariable Cox
regressions seemed to be due to confounding by pN and, at least in the overall survival case, by pT.

5 Literature

Breslow, NE. (1974) Covariance analysis of censored survival data. Biometrics, 30: 89-99.

Cox, DR (1972) Regression models and life tables. J R Stat Soc B 34: 187-220.

Efron, B. (1977) The efficiency of Cox's likelihood function for censored data. J. Amer. Statist. Assoc.
72: 557-565.

Figueiras A, Domenech-Massons J M, Cadarso C (1998) Regression models: calculating the


confidence interval of effects in the presence of interactions. Statistics in Medicine. Vol. 17:
2099-2105.

Fox, J. (2002) Cox Proportional-Hazards Regression for Survival Data.


<http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-cox-regression.pdf>

Harrell, Frank E Jr (2008) Design: Design Package. R package version 2.1-2.


<http://biostat.mc.vanderbilt.edu/s/Design>, <http://biostat.mc.vanderbilt.edu/rms>

Kaplan EL and Meier P (1958) Nonparametric estimation for incomplete observations. J Am Stat
Assoc 53: 457-481.

Lagakos S. W. (1979) General right censoring and its impact on the analysis of survival data.
Biometrics, 35: 139-156,

Lam, P (2007) coxph: Cox Proportional Hazards Regression for Duration Dependent Variables, in
Kosuke Imai, Gary King and Olivia Lau, “Zelig: Everyone’s Statistical Software”
<http://gking.harvard.edu/zelig>
30
Statistical Consulting SS 08

Mantel, N. and Haenszel, W. (1959) Statistical Aspects of the Analysis of Data from Retrospective
Studies of Disease. Journal of the National Cancer Institute, 22: 719-748.

Marubini E, Valsecchi MG (1995) Analysing survival data from clinical trials and observational
studies. Wiley.

R Development Core Team (2008). R: A language and environment for statistical computing. R
Foundation for Statistical Computing,Vienna. < http://www.R-project.org>

SAS Institute Inc. (2008) SAS for Windows, Version 9.2 SAS Institute Inc., Cary, NC, USA.

Schemper M, Smith TL (1996) A note on quantifying follow-up in studies of failure time. Control Clin
Trials 17: 343-346.

Therneau, T. M. (1999) A Package for Survival Analysis in S. Technical Report


<http://www.mayo.edu/hsr/people/therneau/survival.ps> Mayo Foundation.

Therneau T M, Grambsch P M (2000) Modeling Survival Data: Extending the Cox Model, Springer.

Therneau and ported by Lumley T (2008) survival: Survival analysis, including penalised likelihood.
R package version 2.34-1.

31