You are on page 1of 52

Case-Control Studies: Statistical

Analysis

Greg Stoddard

December 16, 2010


University of Utah School of Medicine
Rothman claims, “Properly carried out, case-
control studies provide information that
mirrors what could be learned from a
cohort study, usually at considerably less
cost and time.”

[Rothman KJ, Epidemiology: An Introduction, 2002, p.73]

Goal: contrast the statistical approaches of


the two study designs to verify Rothman’s
claim.
Diagrammatically,

Cohort study Case-Control Study

D E
E not-D D not-E
D E
not-E not-D not-D not-E
Data Layout,
E Not-E
D a b nD Cohort Study
Not-D c d nnot-D
NE Nnot-E

E Not-E
D a b ND Case-Control
Not-D c d Nnot-D Study
nE nnot-E

N = fixed , n = free to vary


E Not-E Cohort Study
D a b nD incidence proportion =
Not-D c d nnot-D disease cases / persons at
risk
NE Nnot-E

E Not-E Case-Control Study


D a b ND incidence proportion =
Not-D c d Nnot-D (not estimable)
nE nnot-E
The incidence proportion not being
estimable is not much of a shortcoming.

Given a study’s inclusion/exclusion criteria,


the incidence proportion does not actually
apply to a very wide patient population,
anyway.
The goal is not to estimate incidence, but
rather to assess an exposure-disease
association.

We can do that just fine with relative


measures of effect, the risk ratio and odds
ratio.
E Not-E Cohort Study
D a b nD risk ratio = (a/NE)/(b/Nnot-E)
Not-D c d nnot-D odds ratio = odds(D|E)/odds(D|not-E)
=( = (a/c)/(b/d) = (ad)/(bc)
NE Nnot-E

Case-Control Study
E Not-E
exposure odds ratio
D a b ND = odds(E|D)/odds(E|not-D)
Not-D c d Nnot-D = (a/b)/(c/d) = (ad)/(bc)

nE nnot-E
So, as long as either E or D is free to vary, you get the same
relative effect measure, the odds ratio, with both study designs.
E Not-E Cohort Study
D a b nD a a
Not-D c d nnot-D RR= NE  a+c
b b
NE Nnot-E
Nnot-E b+d
If the disease is rare (<10% in both E and Not-E groups), so
a ≈ 0 and b ≈ 0, then c ≈ a + c and d ≈ b + d.
Substituting, a a
a+c c ad
RR=  = =OR
b b bc
b+d d
So, OR from case-control study approximates RR from
cohort study, when the rare disease assumption is met.
Why the 10%, or 0.10, incidence proportion
is a good cutpoint for “rare disease” is
illustrated nicely in a figure published in:

Zhang J, Yu KF. What’s the relative risk? A


method of correcting the odds ratio in cohort
studies of common outcomes. JAMA 1998;
280(19):1690-91.
Aside:
The formula in Zhang and Yu (1998) for
converting an odds ratio to a risk ratio in
cohort studies has been convincing
criticized as unreliable (Zou, 2004) so you
should avoid using it.
[Zou G. A modified Poisson regression approach to
prospective studies with binary data. Am J Epidemiol
2004;159(7):702-706.]
Checking our progress

How far have we gotten, thus far, in verifying


that a case-control study can mirror what
can be learned in a cohort study?
Checking our progress

We have seen that the OR is the same in


both study designs.

We have seen that the OR approximates the


RR under the rare disease assumption,
and so it has a straightforward
interpretation.
Checking our progress

However, cohort studies rarely use the odds


ratio, nor do they use the risk ratio.

Instead, cohort studies use survival analysis.

Why?
Risk Ratio Analysis

This type of analysis ignores time-at-risk.


That is, it assumes an equal follow-up time
for every study subject.
Exposed Non-Exposed

Follow- Begin Disease Day- Begin Disease Day- Day-


up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

Total 90 25 110 20

The risk ratio uses partial information (shown in


blue) from the complete data in the life table.
Risk ratio analysis data

Exposed Not-Exposed
Disease 25 (50%) 20 (40%)
Not-Disease 25 30
N 50 50

Risk Ratio = (25/50)/(20/50) =1.25


Chi-square test, p = 0.31

Analyzing these data in this way, we do not


demonstrate a significant effect. In fact, this
crude RR underestimates each of the day-
specific RR estimates.
Rate Ratio Analysis

Let’s see if we can do better with a rate ratio


analysis. It uses a person-time
denominator, so in that sense, it relaxes
the equal time-at-risk assumption of the
risk ratio analysis.
Exposed Non-Exposed

Follow- Begin Disease Day- Begin Disease Day- Day-


up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

Total 90 25 110 20

The rate ratio uses partial information (shown in


blue) from the complete data in the life table.
Rate ratio analysis data

Exposed Not-Exposed
Disease 25 (50%) 20 (40%)
Person-Days 90 110

Rate Ratio = (25/90)/(20/110) =1.53


Binomial probability mid-p exact test for person-time data,
p = 0.080

Analyzing these data in this way, we almost


demonstrate a significant effect. Again, this
crude rate ratio underestimates each of the day-
specific risk ratio (rate ratio) estimates.
Inefficient Use of Time in Rate Ratio
Analysis

The reason the rate ratio analysis failed to


convey the information in the life table is
because it only considers ratio of cases to
average person-time, without
distinguishing times to event and times to
censoring.
person-time = total time for subjects
= mean time x N
Suppose the individual times-at-risk for a
sample are: 10, 20, and 30. The person-
time is computed as:
PT = total time for subjects
= 10+20+30 = 60
which is equivalent to :
PT = mean time x N
= (10+20+30)/3 x 3 = 20 x 3 = 60
So, a rate ratio analysis would find the
following two scenarios equal (even
though Group B outperforms Group A)
(let x----x denote time)

x-------------------------------------x (censored) Group A


x-----x (died)
x--------x (died)
x--------------------------------------------x (censored)

x-------------------------------------x (died) Group B


x-----x (censored)
x--------x (censored)
x--------------------------------------------x (died)
Hazard Ratio Analysis (Survival Analysis)

This analysis uses time-at-risk is a very


complete way, using all of the information
in the life table.
Exposed Non-Exposed

Follow- Begin Disease Day- Begin Disease Day- Day-


up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

Total 90 25 110 20

From Cox regression, HR = 1.92, p = 0.032


The HR is identically the Mantel-Haenzsel
summary risk ratio.
Aside

Showing a life table like this and pointing out


that the HR is just the weighted average of
the day specific risk ratios, and so is a
relative risk estimate, is a very clear way
to explain the HR to a researcher.
Checking our progress

Recall, we are trying to verify that a case-


control study can mirror what can be
learned in a cohort study.

It appears, then, that we need to incorporate


survival analysis into the case-control
framework in order to keep up with what a
cohort study can do.
It turns out we can do this, use survival
analysis in the case-control framework, if
we tweak the study design slightly.

The slight variant is called the case-cohort


design (also called the density case-
control design).
While presenting this design, I am going to
show some simulation results. In this way,
I can demonstrate that the case-cohort
design really does perform as well as a
cohort study design.
Dataset

The dataset comes from Breslow and Day


[Breslow NE, Day NE. (1987). Statistical Methods in
Cancer Research, Vol II: The Design and Analysis of
Cohort Studies, Lyon, France, IARC, 1987.]
Men (n=679) employed in a nickel refinery in South Wales
were investigated to determine whether the risk of
developing carcinoma of the bronchi and nasal sinuses
(ICD = 160), which had been associated with the refining
of nickel from previous studies in the 1930s, was present
in this cohort.
Modified Dataset

I also modified the dataset, to create a


second dataset that does not meet the
rare disease assumption, by duplicating
the cases five times.
Treating this dataset as the “population”,
and then analyzing it, we know what the
answer is that a case-control design which
samples from this cohort is supposed to
achieve.
The population relative measures are:
Population Actual Dataset Augmented
Relative Effect with almos rare Dataset with
Measure disease (3% in frequent disease
unexposed, 12% (15% in
in exposed) unexposed, 60%
in exposed)
Odds Ratio 3.76 3.76
Risk Ratio 3.43 2.65
Rate Ratio 4.76 3.87
Hazard Ratio 5.02 4.19
Classical Case-Control Study (controls are
sampled from the population controls only)

Using a 2:1 sampling ratio


Exposed Not Total
to nickel exposed
to nickel
Tumor 46 10 56 use all 56 cases
No Tumor 343 280 56
sample 56 x 2
Total 389 290 679
controls
Monte Carlo simulation, computing OR from
1,000 samples, to get long-run average of
OR.

(Each sample keeps all 56 subjects from


the tumor row of the population 2 x 2 table,
and the randomly samples 112 subjects
from the no-tumor row of the population 2
x 2 table.)
The simulations results are:
Classical case-control design (sample
controls from no-tumor subjects only)
Population Actual Dataset Augmented
Relative Effect with almos rare Dataset with
Measure disease (3% in frequent disease
unexposed, 12% (15% in
in exposed) unexposed, 60%
in exposed)
Odds Ratio 3.76 (OR=3.81) 3.76 (OR=3.77)
Risk Ratio 3.43 2.65
Rate Ratio 4.76 3.87
Hazard Ratio 5.02 4.19
Case-Cohort Study Design
- In this design, we keep the cases. Then,
we sample our controls from the total
row of the population 2 x 2 table.
- For those cases that get mixed in with
the controls, we set their status variable to
0, the control value.
- We then calculate the OR in the usual
way.
Case-Cohort Study (controls are sampled
from the population row totals, which
includes both cases and controls)
Using a 2:1 sampling ratio

Exposed Not Total


to nickel exposed
to nickel use all 56 cases
Tumor 46 a 10 b 56
No Tumor 343 280 56
sample 56 x 2
Total 389 c 290 d 679
controls
The odds ratio is then a direct calucation of the risk ratio.
OR = (a x kd)/(b x kc) = (kad)/(kbc) = (ad)/(bc) , where k=(56x2)/679
RR = (a/c)/(b/d) = (ad)/(bc) = OR
The simulations results are:
Case-cohort design (sample controls
from total row of population 2 x 2 table)
Population Actual Dataset Augmented
Relative Effect with almos rare Dataset with
Measure disease (3% in frequent disease
unexposed, 12% (15% in
in exposed) unexposed, 60%
in exposed)
Odds Ratio 3.76 3.76
Risk Ratio 3.43 (OR=3.48) 2.65 (OR=2.67)
Rate Ratio 4.76 3.87
Hazard Ratio 5.02 4.19
Case-Cohort Study Design

For the case-cohort design, the rare-disease


assumption is not required for the OR to
be an estimate of RR (Rothman and
Greenland, 1998, p.110). We have
demonstrated that to be the case.

[Rothman KJ, Greenland S. (1998). Modern Epidemiology,


2nd ed. Philadelphia, PA.]
Case-Cohort Study Design
It is nice to be able to use the OR to directly
estimate RR, and not worry about the rare
disease assumption at all.
It comes with a price, however. Since your
controls are now “messy”, with cases mixed in,
you do not have as clear of a signal for the
effect, so statistical power is reduced. You need
to sample additional controls to make up the
difference (to get it back to the power of the
classic case-control study).
Case-Cohort Study Design With Risk Set
Sampling

In this design, you again keep all of the cases.

You then, again, sample controls from the total row


of population 2 x 2 table (sampled from cases &
controls). This time, however, you sample from
total row subjects which have the same or longer
time-at-risk. This is called risk set sampling.
Exposed Non-Exposed

Follow- Begin Disease Day- Begin Disease Day- Day-


up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

In this design, we also use a type of “total row”


sampling. That is, we select our controls from
the “Beginning N” column’s of the life table.
Exposed Non-Exposed

Follow- Begin Disease Day- Begin Disease Day- Day-


up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

For the 5+2 cases that occurred on day 1, we


sample our controls from the 50+50 persons still
at risk on day 1.
Exposed Non-Exposed

Follow- Begin Disease Day- Begin Disease Day- Day-


up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

For the 10+8 cases that occurred on day 2, we


sample our controls from the 30+40 persons still
at risk on day 2. …and so on.
We do this by forming risk sets. For every case,
we form a risk set that includes all subjects with
an equal or longer follow-up time. Then we
sample 2 controls from that risk set, if we are
using a 2:1 sampling ratio, that we match with
that case.

This is identical to sampling on the correct row


from the Beginning N column, like we did above.
We have already seen that the OR from a case-
cohort study design directly estimates the RR.

We are now doing a version of the case-cohort


approach for each row of the life table.

We know that the HR is just the summary RR


across the rows of the life table.

If we use conditional logistic regression, then, to


account for the row-specific matching, it would
seem the OR should directly estimate the HR.
Let’s see if that is true.

This time in the simulation, we will take the OR


from the conditional logistic regression, rather
than calculate if from a 2 x 2 table like we did for
the previous simulations.

The mean of the 1,000 conditional logistic


regression ORs will be our estimate of the HR.
The simulations results are:
Case-cohort design with risk set
sampling.
Population Actual Dataset Augmented
Relative Effect with almos rare Dataset with
Measure disease (3% in frequent disease
unexposed, 12% (15% in
in exposed) unexposed, 60%
in exposed)
Odds Ratio 3.76 3.76
Risk Ratio 3.43 2.65
Rate Ratio 4.76 3.87
Hazard Ratio 5.02 (OR=5.42) 4.19 (OR=4.43)

We were close, but the estimates appear to be biased.


The way it is really done is to use risk set sampling
followed by an actual Cox regression.

To adjust the standard error for the way the sampling was
done, there are three approaches:
Prentice
Self and Prentice
Barlow
In Stata,

Prentice:
stcascoh, alpha(.18) // risk set sampling
stcox nickel, robust

Self and Prentice


stcascoh, alpha(.18) // risk set sampling with log weights (_wSelPre)
stcox nickel, robust offset(_wSelPre)

Barlow
stcascoh, alpha(.18) // risk set sampling with log weights (_wBarlow)
stcox nickel, robust offset(_wBarlow)
The simulations results are:
Case-cohort design with risk set
sampling (Prentice Method)
Population Actual Dataset Augmented
Relative Effect with almos rare Dataset with
Measure disease (3% in frequent disease
unexposed, 12% (15% in
in exposed) unexposed, 60%
in exposed)
Odds Ratio 3.76 3.76
Risk Ratio 3.43 2.65
Rate Ratio 4.76 3.87
Hazard Ratio 5.02 (HR=5.08) 4.19

Estimates appear unbiased using this approach.

You might also like