4986697

Case-Control Studies: Statistical
Analysis
Greg Stoddard
December 16, 2010

University of Utah School of Medicine
Rothman claims, “Properly carried out, case-
control studies provide information that
mirrors what could be learned from a
cohort study, usually at considerably less
cost and time.”
[Rothman KJ, Epidemiology: An Introduction, 2002, p.73]
Goal: contrast the statistical approaches of

the two study designs to verify Rothman’s
claim.
Diagrammatically,
Cohort study Case-Control Study
D E
E not-D D not-E
D E
not-E not-D not-D not-E
Data Layout,
E Not-E
D a b nD Cohort Study
Not-D c d nnot-D
NE Nnot-E
E Not-E
D a b ND Case-Control
Not-D c d Nnot-D Study
nE nnot-E
N = fixed , n = free to vary

E Not-E Cohort Study
D a b nD incidence proportion =
Not-D c d nnot-D disease cases / persons at
risk
NE Nnot-E
E Not-E Case-Control Study

D a b ND incidence proportion =
Not-D c d Nnot-D (not estimable)
nE nnot-E
The incidence proportion not being
estimable is not much of a shortcoming.
Given a study’s inclusion/exclusion criteria,

the incidence proportion does not actually
apply to a very wide patient population,
anyway.
The goal is not to estimate incidence, but
rather to assess an exposure-disease
association.
We can do that just fine with relative

measures of effect, the risk ratio and odds
ratio.
D a b nD risk ratio = (a/NE)/(b/Nnot-E)
Not-D c d nnot-D odds ratio = odds(D|E)/odds(D|not-E)
=( = (a/c)/(b/d) = (ad)/(bc)
NE Nnot-E
Case-Control Study
E Not-E
exposure odds ratio
D a b ND = odds(E|D)/odds(E|not-D)
Not-D c d Nnot-D = (a/b)/(c/d) = (ad)/(bc)
nE nnot-E
So, as long as either E or D is free to vary, you get the same
relative effect measure, the odds ratio, with both study designs.
D a b nD a a
Not-D c d nnot-D RR= NE  a+c
b b
NE Nnot-E
Nnot-E b+d
If the disease is rare (<10% in both E and Not-E groups), so
a ≈ 0 and b ≈ 0, then c ≈ a + c and d ≈ b + d.
Substituting, a a
a+c c ad
RR=  = =OR
b b bc
b+d d
So, OR from case-control study approximates RR from
cohort study, when the rare disease assumption is met.
Why the 10%, or 0.10, incidence proportion
is a good cutpoint for “rare disease” is
illustrated nicely in a figure published in:
Zhang J, Yu KF. What’s the relative risk? A

method of correcting the odds ratio in cohort
studies of common outcomes. JAMA 1998;
280(19):1690-91.
Aside:
The formula in Zhang and Yu (1998) for
converting an odds ratio to a risk ratio in
cohort studies has been convincing
criticized as unreliable (Zou, 2004) so you
should avoid using it.
[Zou G. A modified Poisson regression approach to
prospective studies with binary data. Am J Epidemiol
2004;159(7):702-706.]
Checking our progress
How far have we gotten, thus far, in verifying

that a case-control study can mirror what
can be learned in a cohort study?
We have seen that the OR is the same in

both study designs.
We have seen that the OR approximates the

RR under the rare disease assumption,
and so it has a straightforward
interpretation.
However, cohort studies rarely use the odds

ratio, nor do they use the risk ratio.
Instead, cohort studies use survival analysis.
Why?
Risk Ratio Analysis
This type of analysis ignores time-at-risk.

That is, it assumes an equal follow-up time
for every study subject.
Exposed Non-Exposed
Follow- Begin Disease Day- Begin Disease Day- Day-

up day N Cases Specific N Cases Specific Specific
Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
Total 90 25 110 20
The risk ratio uses partial information (shown in

blue) from the complete data in the life table.
Risk ratio analysis data
Exposed Not-Exposed
Disease 25 (50%) 20 (40%)
Not-Disease 25 30
N 50 50
Risk Ratio = (25/50)/(20/50) =1.25

Chi-square test, p = 0.31
Analyzing these data in this way, we do not

demonstrate a significant effect. In fact, this
crude RR underestimates each of the day-
specific RR estimates.
Rate Ratio Analysis
Let’s see if we can do better with a rate ratio

analysis. It uses a person-time
denominator, so in that sense, it relaxes
the equal time-at-risk assumption of the
risk ratio analysis.
Exposed Non-Exposed

Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
Total 90 25 110 20
The rate ratio uses partial information (shown in

blue) from the complete data in the life table.
Rate ratio analysis data
Exposed Not-Exposed
Disease 25 (50%) 20 (40%)
Person-Days 90 110
Rate Ratio = (25/90)/(20/110) =1.53

Binomial probability mid-p exact test for person-time data,
p = 0.080
Analyzing these data in this way, we almost

demonstrate a significant effect. Again, this
crude rate ratio underestimates each of the day-
specific risk ratio (rate ratio) estimates.
Inefficient Use of Time in Rate Ratio
Analysis
The reason the rate ratio analysis failed to

convey the information in the life table is
because it only considers ratio of cases to
average person-time, without
distinguishing times to event and times to
censoring.
person-time = total time for subjects
= mean time x N
Suppose the individual times-at-risk for a
sample are: 10, 20, and 30. The person-
time is computed as:
PT = total time for subjects
= 10+20+30 = 60
which is equivalent to :
PT = mean time x N
= (10+20+30)/3 x 3 = 20 x 3 = 60
So, a rate ratio analysis would find the
following two scenarios equal (even
though Group B outperforms Group A)
(let x----x denote time)
x-------------------------------------x (censored) Group A

x-----x (died)
x--------x (died)
x--------------------------------------------x (censored)
x-------------------------------------x (died) Group B

x-----x (censored)
x--------x (censored)
x--------------------------------------------x (died)
Hazard Ratio Analysis (Survival Analysis)
This analysis uses time-at-risk is a very

complete way, using all of the information
in the life table.
Exposed Non-Exposed

Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
Total 90 25 110 20
From Cox regression, HR = 1.92, p = 0.032

The HR is identically the Mantel-Haenzsel
summary risk ratio.
Aside
Showing a life table like this and pointing out

that the HR is just the weighted average of
the day specific risk ratios, and so is a
relative risk estimate, is a very clear way
to explain the HR to a researcher.
Recall, we are trying to verify that a case-

control study can mirror what can be
learned in a cohort study.
It appears, then, that we need to incorporate

survival analysis into the case-control
framework in order to keep up with what a
cohort study can do.
It turns out we can do this, use survival
analysis in the case-control framework, if
we tweak the study design slightly.
The slight variant is called the case-cohort

design (also called the density case-
control design).
While presenting this design, I am going to
show some simulation results. In this way,
I can demonstrate that the case-cohort
design really does perform as well as a
cohort study design.
Dataset
The dataset comes from Breslow and Day

[Breslow NE, Day NE. (1987). Statistical Methods in
Cancer Research, Vol II: The Design and Analysis of
Cohort Studies, Lyon, France, IARC, 1987.]
Men (n=679) employed in a nickel refinery in South Wales
were investigated to determine whether the risk of
developing carcinoma of the bronchi and nasal sinuses
(ICD = 160), which had been associated with the refining
of nickel from previous studies in the 1930s, was present
in this cohort.
Modified Dataset
I also modified the dataset, to create a

second dataset that does not meet the
rare disease assumption, by duplicating
the cases five times.
Treating this dataset as the “population”,
and then analyzing it, we know what the
answer is that a case-control design which
samples from this cohort is supposed to
achieve.
The population relative measures are:
Population Actual Dataset Augmented
Relative Effect with almos rare Dataset with
Measure disease (3% in frequent disease
unexposed, 12% (15% in
in exposed) unexposed, 60%
in exposed)
Odds Ratio 3.76 3.76
Risk Ratio 3.43 2.65
Rate Ratio 4.76 3.87
Hazard Ratio 5.02 4.19
Classical Case-Control Study (controls are
sampled from the population controls only)
Using a 2:1 sampling ratio

Exposed Not Total
to nickel exposed
to nickel
Tumor 46 10 56 use all 56 cases
No Tumor 343 280 56
sample 56 x 2
Total 389 290 679
controls
Monte Carlo simulation, computing OR from
1,000 samples, to get long-run average of
OR.
(Each sample keeps all 56 subjects from

the tumor row of the population 2 x 2 table,
and the randomly samples 112 subjects
from the no-tumor row of the population 2
x 2 table.)
The simulations results are:
Classical case-control design (sample
controls from no-tumor subjects only)
in exposed)
Odds Ratio 3.76 (OR=3.81) 3.76 (OR=3.77)
Case-Cohort Study Design
- In this design, we keep the cases. Then,
we sample our controls from the total
row of the population 2 x 2 table.
- For those cases that get mixed in with
the controls, we set their status variable to
0, the control value.
- We then calculate the OR in the usual
way.
Case-Cohort Study (controls are sampled
from the population row totals, which
includes both cases and controls)
Using a 2:1 sampling ratio
Exposed Not Total

to nickel exposed
to nickel use all 56 cases
Tumor 46 a 10 b 56
No Tumor 343 280 56
sample 56 x 2
Total 389 c 290 d 679
controls
The odds ratio is then a direct calucation of the risk ratio.
OR = (a x kd)/(b x kc) = (kad)/(kbc) = (ad)/(bc) , where k=(56x2)/679
RR = (a/c)/(b/d) = (ad)/(bc) = OR
Case-cohort design (sample controls
from total row of population 2 x 2 table)
in exposed)
Risk Ratio 3.43 (OR=3.48) 2.65 (OR=2.67)
For the case-cohort design, the rare-disease

assumption is not required for the OR to
be an estimate of RR (Rothman and
Greenland, 1998, p.110). We have
demonstrated that to be the case.
[Rothman KJ, Greenland S. (1998). Modern Epidemiology,

2nd ed. Philadelphia, PA.]
It is nice to be able to use the OR to directly
estimate RR, and not worry about the rare
disease assumption at all.
It comes with a price, however. Since your
controls are now “messy”, with cases mixed in,
you do not have as clear of a signal for the
effect, so statistical power is reduced. You need
to sample additional controls to make up the
difference (to get it back to the power of the
classic case-control study).
Case-Cohort Study Design With Risk Set
Sampling
In this design, you again keep all of the cases.
You then, again, sample controls from the total row

of population 2 x 2 table (sampled from cases &
controls). This time, however, you sample from
total row subjects which have the same or longer
time-at-risk. This is called risk set sampling.
Exposed Non-Exposed

Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
In this design, we also use a type of “total row”

sampling. That is, we select our controls from
the “Beginning N” column’s of the life table.
Exposed Non-Exposed

Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
For the 5+2 cases that occurred on day 1, we

sample our controls from the 50+50 persons still
at risk on day 1.
Exposed Non-Exposed

Risk Risk Risk
Ratio
1 50 5 0.10 50 2 0.04 2.5
2 30 10 0.33 40 8 0.20 1.7
3 10 10 1.00 20 10 0.50 2.0
For the 10+8 cases that occurred on day 2, we

sample our controls from the 30+40 persons still
at risk on day 2. …and so on.
We do this by forming risk sets. For every case,
we form a risk set that includes all subjects with
an equal or longer follow-up time. Then we
sample 2 controls from that risk set, if we are
using a 2:1 sampling ratio, that we match with
that case.
This is identical to sampling on the correct row

from the Beginning N column, like we did above.
We have already seen that the OR from a case-
cohort study design directly estimates the RR.
We are now doing a version of the case-cohort

approach for each row of the life table.
We know that the HR is just the summary RR

across the rows of the life table.
If we use conditional logistic regression, then, to

account for the row-specific matching, it would
seem the OR should directly estimate the HR.
Let’s see if that is true.
This time in the simulation, we will take the OR

from the conditional logistic regression, rather
than calculate if from a 2 x 2 table like we did for
the previous simulations.
The mean of the 1,000 conditional logistic

regression ORs will be our estimate of the HR.
Case-cohort design with risk set
sampling.
in exposed)
Hazard Ratio 5.02 (OR=5.42) 4.19 (OR=4.43)
We were close, but the estimates appear to be biased.

The way it is really done is to use risk set sampling
followed by an actual Cox regression.
To adjust the standard error for the way the sampling was
done, there are three approaches:
Prentice
Self and Prentice
Barlow
In Stata,
Prentice:
stcascoh, alpha(.18) // risk set sampling
stcox nickel, robust
Self and Prentice

stcascoh, alpha(.18) // risk set sampling with log weights (_wSelPre)
stcox nickel, robust offset(_wSelPre)
Barlow
stcascoh, alpha(.18) // risk set sampling with log weights (_wBarlow)
stcox nickel, robust offset(_wBarlow)
Case-cohort design with risk set
sampling (Prentice Method)
in exposed)
Hazard Ratio 5.02 (HR=5.08) 4.19
Estimates appear unbiased using this approach.

4986697

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

4986697

Uploaded by

Copyright:

Available Formats

Case-Control Studies: Statistical

December 16, 2010

[Rothman KJ, Epidemiology: An Introduction, 2002, p.73]

Goal: contrast the statistical approaches of

Cohort study Case-Control Study

N = fixed , n = free to vary

E Not-E Case-Control Study

Given a study’s inclusion/exclusion criteria,

We can do that just fine with relative

Zhang J, Yu KF. What’s the relative risk? A

How far have we gotten, thus far, in verifying

We have seen that the OR is the same in

We have seen that the OR approximates the

However, cohort studies rarely use the odds

Instead, cohort studies use survival analysis.

This type of analysis ignores time-at-risk.

Follow- Begin Disease Day- Begin Disease Day- Day-

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

The risk ratio uses partial information (shown in

Risk Ratio = (25/50)/(20/50) =1.25

Analyzing these data in this way, we do not

Let’s see if we can do better with a rate ratio

Follow- Begin Disease Day- Begin Disease Day- Day-

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

The rate ratio uses partial information (shown in

Rate Ratio = (25/90)/(20/110) =1.53

Analyzing these data in this way, we almost

The reason the rate ratio analysis failed to

x-------------------------------------x (censored) Group A

x-------------------------------------x (died) Group B

This analysis uses time-at-risk is a very

Follow- Begin Disease Day- Begin Disease Day- Day-

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

From Cox regression, HR = 1.92, p = 0.032

Showing a life table like this and pointing out

Recall, we are trying to verify that a case-

It appears, then, that we need to incorporate

The slight variant is called the case-cohort

The dataset comes from Breslow and Day

I also modified the dataset, to create a

Using a 2:1 sampling ratio

(Each sample keeps all 56 subjects from

Exposed Not Total

For the case-cohort design, the rare-disease

[Rothman KJ, Greenland S. (1998). Modern Epidemiology,

In this design, you again keep all of the cases.

You then, again, sample controls from the total row

Follow- Begin Disease Day- Begin Disease Day- Day-

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

In this design, we also use a type of “total row”

Follow- Begin Disease Day- Begin Disease Day- Day-

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

For the 5+2 cases that occurred on day 1, we

Follow- Begin Disease Day- Begin Disease Day- Day-

2 30 10 0.33 40 8 0.20 1.7

3 10 10 1.00 20 10 0.50 2.0

For the 10+8 cases that occurred on day 2, we

This is identical to sampling on the correct row

We are now doing a version of the case-cohort