You are on page 1of 13

Case–Control Studies of controls and cases.

Each study, whether cohort or


case–control, must be considered on its own merits.
Conventional wisdom also holds that cohort
studies are useful for evaluating the range of effects
There are two primary types of nonexperimental related to a single exposure, while case–control stud-
studies in epidemiology. The first, the cohort study ies provide information only about the one disease
(see Cohort Studies) (also called the follow-up study that afflicts the cases. This thinking conflicts with the
or incidence study), is a direct analog of the exper- idea that case–control studies can be viewed sim-
iment. Different exposure groups are compared, but ply as more efficient cohort studies. Just as one can
it is the investigator who selects subjects to observe, choose to measure more than one disease outcome
and classifies these subjects by exposure status, rather in a cohort study, it is possible to conduct a set
than assigning them to exposure groups. The sec- of case–control studies nested within the same pop-
ond, the incident case–control study, or simply the ulation using several disease outcomes as the case
case–control study, employs an extra step of sam- series. The case–cohort study (see the section titled
pling from the source population for cases. A cohort “Case–Cohort Studies”) is particularly well suited
study includes all persons in the population at risk to this task, allowing one control group to be com-
of becoming a study case. In contrast, a case-control pared with several series of cases. Whether or not
study selects only a sample of those persons, and the case–cohort design is the form of case–control
does so partly on the basis of their final disease sta- study that is used, case–control studies do not have
tus. Thus, by design, a person’s outcome influences to be characterized as being limited with respect to
their chance of becoming a subject in the case- the number of disease outcomes that can be studied.
control study. This extra sampling step can make a For diseases that are sufficiently rare, cohort
case–control study much more efficient than a cohort studies become impractical, and case–control studies
study of the same population, but it introduces a num- become the only useful alternative. On the other hand,
ber of subtleties and avenues for bias that are absent if exposure is rare, ordinary case–control studies are
in typical cohort studies. inefficient, and one must use methods that selectively
Conventional wisdom about case–control studies recruit additional exposed subjects, such as special
is that they do not yield estimates of effect that are cohort studies or two-stage designs. If both the
as valid as measures obtained from cohort studies. exposure and the outcome are rare, two-stage designs
This thinking may reflect common misunderstand- may be the only informative option, as they employ
ings in conceptualizing case–control studies, but it oversampling of both exposed and diseased subjects.
also reflects concern about quality of exposure infor- Ideally, a case–control study can be conceptual-
mation and biases in case or control selection. For ized as a more efficient version of a corresponding
example, if exposure information comes from inter- cohort study. Under this conceptualization, the cases
views, then cases will have usually reported the in the case–control study are the same cases as would
exposure information after learning of their diagno- ordinarily be included in the cohort study. Rather than
sis, which can lead to errors in the responses that are including all of the experience of the source popula-
related to the disease (recall bias). While it is true tion that gave rise to the cases (the study base), as
that recall bias does not occur in prospective cohort would be the usual practice in a cohort design, con-
studies, neither does it occur in all case–control trols are selected from the source population. The
studies. Exposure information that is taken from sampling of controls from the population that gave
records whose creation predated disease occurrence rise to the cases affords the efficiency gain of a
will not be subject to recall bias. Similarly, while a case–control design over a cohort design. The con-
cohort study may log information on exposure for an trols provide an estimate of the prevalence of the
entire source population at the outset of the study, it exposure and covariates in the source population.
still requires tracing of subjects to ascertain exposure When controls are selected from members of the
variation and outcomes, and the success of this trac- population who were at risk for disease at the begin-
ing may be related to exposure. These concerns are ning of the study’s follow-up period, the case–control
analogous to case–control problems of loss of sub- odds ratio (see Odds and Odds Ratio) estimates
jects with unknown exposure and to biased selection the risk ratio that would be obtained from a cohort
2 Case–Control Studies

design. When controls are selected from members within strata defined by the factors that are used for
of the population who were noncases at the times stratification in the analysis, such as factors used for
that each case occurs, or otherwise in proportion restriction or matching.
to the person-time accumulated by the cohort, the Using person-time to illustrate, the goal requires
case–control odds ratio estimates the rate ratio that that B1 has the same ratio to the amount of exposed
would be obtained from a cohort design. Finally, person-time (T1 ) as B0 has to the amount of unex-
when controls are selected from members of the pop- posed person-time (T0 ):
ulation who were noncases at the end of the study’s
B1 B0
follow-up period, the case–control odds ratio esti- = (1)
mates the incidence odds ratio that would be obtained T1 T0
from a cohort design. With each control selection Here B1 /T1 and B0 /T0 are the control sampling
strategy, the odds ratio calculation is the same, but the rates – that is, the number of controls selected per
measure of effect estimated by the odds ratio differs. unit of person-time. Suppose A1 exposed cases and
Study designs that implement each of these control A0 unexposed cases occur over the study period. The
selection paradigms will be discussed after topics that exposed and unexposed rates are then
are common to all designs.
A1
I1 = (2)
T1
Common Elements of Case–Control
and
Studies
A0
In a cohort study, the numerator and denominator I0 = (3)
T0
of each disease frequency (incidence proportion,
incidence rate, or incidence odds) are measured, We can use the frequencies of exposed and
which requires enumerating the entire population and unexposed controls as substitutes for the actual
keeping it under surveillance. A case–control study denominators of the rates to obtain exposure-specific
attempts to observe the population more efficiently by case–control ratios, or pseudorates:
using a control series in place of complete assessment
A1
of the denominators of the disease frequencies. The Pseudorate1 = (4)
cases in a case–control study should be the same B1
people who would be considered cases in a cohort and
study of the same population.
A0
Pseudorate0 = (5)
B0
Pseudofrequencies and the Odds Ratio
These pseudorates have no epidemiologic inter-
The primary goal for control selection is that the pretation by themselves. Suppose, however, that the
exposure distribution among controls be the same as control sampling rates B1 /T1 and B0 /T0 are equal to
it is in the source population of cases. The rationale the same value r, as would be expected if controls
for this goal is that, if it is met, we can use the control are selected independently of exposure. If this com-
series in place of the denominator information in mon sampling rate r is known, the actual incidence
measures of disease frequency to determine the ratio rates can be calculated by simple algebra, since apart
of the disease frequency in exposed people relative to from sampling error, B1 /r should equal the amount
that among unexposed people. This goal will be met of exposed person-time in the source population, and
if we can sample controls from the source population B0 /r should equal the amount of unexposed person-
such that the ratio of the number of exposed controls time in the source population: B1 /r = B1 /(B1 /T1 ) =
(B1 ) to the total exposed experience of the source T1 and B0 /r = B0 /(B0 /T0 ) = T0 . To get the inci-
population is the same as the ratio of the number of dence rates, we need only multiply each pseudorate
unexposed controls (B0 ) to the unexposed experience by the common sampling rate, r.
of the source population, apart from sampling error. If the common sampling rate is not known, which
For most purposes, this goal need only be followed is often the case, we can still compare the sizes of
Case–Control Studies 3

the pseudorates by division. Specifically, if we divide as the ratio of cases to controls among the exposed
the pseudorate for exposed by the pseudorate for subjects (A1 /B1 ), divided by the ratio of cases to
unexposed, we obtain controls among the unexposed subjects (A0 /B0 ). This
ratio can also be viewed as the odds of being exposed
Pseudorate1 A1 /B1 A1 /[(B1 /T1 )T1 ] among cases (A1 /A0 ) divided by the odds of being
= =
Pseudorate0 A0 /B0 A0 /[(B0 /T0 )T0 ] exposed among controls (B1 /B0 ), in which case it is
A1 /(r · T1 ) A1 /T1 termed the exposure odds ratio. While either inter-
= = (6) pretation will give the same result, viewing this odds
A0 /(r · T0 ) A0 /T0
ratio as the ratio of case–control ratios shows more
In other words, the ratio of the pseudorates for directly how the control group substitutes for the
the exposed and unexposed is an estimate of the denominator information in a cohort study and how
ratio of the incidence rates in the source population, the ratio of pseudofrequencies gives the same result
provided that the control sampling rate is independent as the ratio of the incidence rates, incidence propor-
of exposure. Thus, using the case–control study tion, or incidence odds in the source population, if
design, one can estimate the incidence rate ratio sampling is independent of exposure.
in a population without obtaining information on
every subject in the population. Similar derivations
in the section titled “Variants of the Case–Control Defining the Source Population
Design” show that one can estimate the risk ratio by
sampling controls from those at risk for disease at If the cases are a representative sample of all cases
the beginning of the follow-up period (case–cohort in a precisely defined and identified population, and
design) and that one can estimate the incidence odds the controls are sampled directly from this source
ratio by sampling controls from the noncases at the population, the study is said to be population-based
end of the follow-up period (cumulative case–control or a primary-base study. For a population-based
design). With these designs, the pseudofrequencies case–control study, random sampling of controls may
correspond to the incidence proportions and incidence be feasible if a population registry exists or can be
odds, respectively, multiplied by common sampling compiled. When random sampling from the source
rates. population of cases is feasible, it is usually the most
There is a statistical penalty for using a sample of desirable option.
the denominators, rather than measuring the person- Random sampling of controls does not necessarily
time experience for the entire source population: mean that every person should have an equal proba-
the precision of the estimates of the incidence rate bility of being selected to be a control. As explained
ratio from a case–control study is less than the above, if the aim is to estimate the incidence rate
precision from a cohort study of the entire population ratio, then we would employ longitudinal (density)
that gave rise to the cases (the source population). sampling, in which a person’s control selection prob-
Nevertheless, the loss of precision that stems from ability is proportional to the person’s time at risk.
sampling controls will be small if the number of For example, in a case–control study nested within
controls selected per case is large. Furthermore, an occupational cohort, workers on an employee ros-
the loss is balanced by the cost savings of not ter will have been followed for varying lengths of
having to obtain information on everyone in the time, and a random sampling scheme should reflect
source population. The cost savings might allow the this varying time to estimate the incidence rate ratio.
epidemiologist to enlarge the source population and When it is not possible to identify the source
so obtain more cases, resulting in a better overall population explicitly, simple random sampling is not
estimate of the incidence rate ratio, statistically and feasible, and other methods of control selection must
otherwise, than would be possible using the same be used. Such studies are sometimes called studies
expenditures to conduct a cohort study. of secondary bases, because the source population
The ratio of the two pseudorates in a case–control is identified secondarily to the definition of a case-
study is usually written as A1 B0 /A0 B1 , and is finding mechanism. A secondary source population or
sometimes called the cross-product ratio. The cross- secondary base is therefore a source population that
product ratio in a case–control study can be viewed is defined from (secondary to) a given case series.
4 Case–Control Studies

Consider a case–control study in which the cases Case Selection


are patients treated for severe psoriasis at the Mayo
Clinic. These patients come to the Mayo Clinic from Ideally, case selection will amount to a direct
all corners of the world. What is the specific source sampling of cases within a source population. There-
population that gives rise to these cases? To answer fore, apart from random sampling, all people in the
this question, we would have to know exactly who source population who develop the disease of inter-
would go to the Mayo Clinic, if he or she had est are presumed to be included as cases in the
severe psoriasis. We cannot enumerate this source case–control study. It is not always necessary, how-
population because many people in it do not know ever, to include all cases from the source population.
themselves that they would go to the Mayo Clinic Cases, like controls, can be randomly sampled for
for severe psoriasis, unless they actually developed inclusion in the case–control study, so long as this
severe psoriasis. This secondary source might be sampling is independent of the exposure under study
defined as a population spread around the world within the strata defined by the stratification factors
that constitutes those people who would go to the that are used in the analysis. Of course, if fewer than
Mayo Clinic if they developed severe psoriasis. It all cases are sampled, the study precision will be
is this secondary source from which the control lower in proportion to the sampling fraction.
series for the study would ideally be drawn. The The cases identified in a single clinic or treated by
challenge to the investigator is to apply eligibility a single medical practitioner are possible case series
criteria to the cases and controls so that there is good for case–control studies. The corresponding source
correspondence between the controls and this source population for the cases treated in a clinic comprises
population. For example, cases of severe psoriasis all people who would attend that clinic and would
and controls might be restricted to those in counties be recorded with the diagnosis of interest, if they
within a certain distance of the Mayo Clinic, so that had the disease in question. It is important to specify
at least a geographic correspondence between the “if they had the disease in question” because clinics
controls and the secondary source population can be serve different populations for different diseases,
assured. This restriction might, however, leave very depending on referral patterns and the reputation of
few cases for study. the clinic in specific speciality areas. As noted above,
Unfortunately, the concept of a secondary base is without a precisely identified source population, it
often tenuously connected to underlying realities, and may be difficult or impossible to select controls in an
can be highly ambiguous. For the psoriasis example, unbiased fashion.
whether a person would go to the Mayo Clinic
depends on many factors that vary over time, such Control Selection
as whether the person is encouraged to go by their
regular physicians and whether the person can afford The definition of the source population determines
to go. It is not clear, then, how or even whether the population from which controls are sampled.
one could precisely define the secondary base, let Ideally, control selection will amount to a direct
alone draw a sample from it; thus it is not clear sampling of people within the source population. On
one could ensure that controls were members of the the basis of the principles explained above regarding
base at the time of sampling. We therefore prefer the role of the control series, many general rules
to conceptualize and conduct case–control studies for control selection can be formulated. Two basic
as starting with a well-defined source population, rules are as follows: (a) Controls should be selected
and then identify and recruit cases and controls to from the same population – the source population –
represent the disease and exposure experience of that gives rise to the study cases. If this rule cannot
that population. When one takes a case series as be followed, there needs to be solid evidence that
a starting point instead, it is incumbent upon the the population supplying controls has an exposure
investigator to demonstrate that a source population distribution identical to that of the population that
can be operationally defined to allow the study to be is the source of cases, which is a very stringent
recast and evaluated relative to this source. Similar demand that is rarely demonstrable. (b) Within strata
considerations apply when one takes a control series defined by factors that are used for stratification in the
as a starting point, as is sometimes done [1]. analysis, controls should be selected independently
Case–Control Studies 5

of their exposure status, in that the sampling rate for viewed case–control studies as if they were cohort
controls (r in the above discussion) should not vary studies done backwards, some authors argued that
with exposure. case–control studies ought to be restricted to those at
If these rules and the corresponding case rule risk for exposure (i.e., those with exposure opportu-
are met, then the ratio of pseudofrequencies will, nity). Excluding sterile women from a case–control
apart from sampling error, equal the ratio of the study of an adverse effect of oral contraceptives and
corresponding measure of disease frequency in the matching for duration of employment in an occu-
source population. If the sampling rate is known, then pational study are examples of attempts to control
the actual measures of disease frequency can also for exposure opportunity. Such restrictions do not
be calculated [2]. Wacholder et al. have elaborated directly address validity issues, and can ultimately
on the principles of control selection in case–control harm study precision by reducing the number of
studies [3–5]. unexposed subjects available for study [8]. If the fac-
When one wishes controls to represent person- tor used for restriction (e.g., sterility) is unrelated
time, sampling of the person-time should be constant to the disease, it will not be a confounder, and hence
across exposure levels. This requirement implies that the restriction will yield no benefit to the validity of
the sampling probability of any person as a control the estimate of effect. Furthermore, if the restriction
should be proportional to the amount of person-time reduces the study size, the precision of the estimate
that person spends at risk of disease in the source of effect will be reduced.
population. For example, if in the source population Another principle sometimes used in cohort stud-
one person contributes twice as much person-time ies is that the study cohort should be “clean” at
during the study period as another person, the first start of follow-up, including only people who have
person should have twice the probability of the never had the disease. Misapplying this principle to
second of being selected as a control. case–control design suggests that the control group
This difference in probability of selection is auto- ought to be “clean”, including only people who are
matically induced by sampling controls at a steady healthy, for example. Illness arising after the start
rate per unit time over the period in which cases of the follow-up period is not reason to exclude
occur (longitudinal, or density sampling), rather than subjects from a cohort analysis, and such exclusion
by sampling all controls at a point in time (such as can lead to bias. Similarly, controls with illness that
the start or end of the study). With longitudinal sam- arose after exposure should not be removed from the
pling of controls, a population member present for control series. Nonetheless, in studies of the rela-
twice as long as another will have twice the chance tion between cigarette smoking and colorectal cancer,
of being selected. certain authors recommended that the control group
If the objective of the study is to estimate a risk should exclude people with colon polyps, because
or rate ratio, it should be possible for a person to colon polyps are associated with smoking and are pre-
be selected as a control and yet remain eligible to cursors of colorectal cancer [9]. But such exclusion
become a case, so that person might appear in the reduces the prevalence of the exposure in the controls
study as both a control and a case. This possibility below that in the actual source population of cases,
may sound paradoxical or wrong, but is, nevertheless, and hence biases the effect estimates upward [10].
correct. It corresponds to the fact that in a cohort
study, a case contributes to both the numerator and
the denominator of the estimated incidence. If the Sources for Control Series
controls are intended to represent person-time and are
The methods suggested below for control sampling
selected longitudinally, similar arguments show that
apply when the source population cannot be explicitly
a person selected as a control should remain eligible
enumerated, so random sampling is not possible.
to be selected as a control again, and thus might be
These methods should only be implemented subject
included in the analysis repeatedly as a control [6, 7].
to the reservations about secondary bases described
above.
Common Fallacies in Control Selection
In cohort studies, the study population is restricted Neighborhood Controls. If the source population
to people at risk for the disease. Because they cannot be enumerated, it may be possible to select
6 Case–Control Studies

controls through sampling of residences. This method be difficult or impossible to enumerate or even define
is not straightforward. Usually, a geographic roster very clearly, it seems reasonable to expect that other
of residences is not available, so a scheme must be hospital or clinic patients will represent this source
devised to sample residences without enumerating population better than general-population controls.
them all. For convenience, investigators may sample The major problem with any nonrandom sampling of
controls who are individually matched to cases from controls is the possibility that they are not selected
the same neighborhood. That is, after a case is independently of exposure in the source population.
identified, one or more controls residing in the same Patients hospitalized with other diseases, for example,
neighborhood as that case are identified and recruited may be unrepresentative of the exposure distribution
into the study. If neighborhood is related to exposure, in the source population either because exposure is
the matching should be taken into account in the associated with hospitalization, or because the expo-
analysis. sure is associated with the other diseases, or both. For
Neighborhood controls are often used when the example, suppose the study aims to evaluate the rela-
cases are recruited from a convenient source, such tion between tobacco smoking and leukemia using
as a clinic or hospital. Such usage can introduce hospitalized cases. If controls are people hospitalized
bias, however, for the neighbors selected as controls with other conditions, many of them will have been
may not be in the source population of the cases. hospitalized for conditions associated with smoking.
For example, if the cases are from a particular A variety of other cancers, as well as cardiovascu-
hospital, neighborhood controls may include people lar diseases and respiratory diseases, are related to
who would not have been treated at the same hospital smoking. Thus, a control series of people hospital-
had they developed the disease. If being treated at the ized for diseases other than leukemia would include
hospital from which cases are identified is related to a higher proportion of smokers than would the source
the exposure under study, then using neighborhood population of the leukemia cases.
controls would introduce a bias. For any given study, Limiting the diagnoses for controls to conditions
the suitability of using neighborhood controls needs for which there is no prior indication of an associ-
to be evaluated with regard to the study variables on ation with the exposure improves the control series.
which the research focuses. For example, in a study of smoking and hospitalized
leukemia cases, one could exclude from the control
Hospital- or Clinic-Based Controls. As noted series anyone who was hospitalized with a disease
above, the source population for hospital- or clinic- known to be related to smoking. Such an exclusion
based case–control studies is not often identifiable, policy may exclude most of the potential controls,
since it represents a group of people who would be since cardiovascular disease by itself would represent
treated in a given clinic or hospital if they developed a large proportion of hospitalized patients. Neverthe-
the disease in question. In such situations, a random less, even a few common diagnostic categories should
sample of the general population will not necessarily suffice to find enough control subjects, so that the
correspond to a random sample of the source popula- exclusions will not harm the study by limiting the
tion. If the hospitals or clinics that provide the cases size of the control series. Indeed, in limiting the scope
for the study only treat a small proportion of cases in of eligibility criteria, it is reasonable to exclude cate-
the geographic area, then referral patterns to the hos- gories of potential controls even on the suspicion that
pital or clinic are important to take into account in a given category might be related to the exposure. If
the sampling of controls. For these studies, a control wrong, the cost of the exclusion is that the control
series comprising patients from the same hospitals series becomes more homogeneous with respect to
or clinics as the cases may provide a less biased diagnosis and perhaps a little smaller. But if right,
estimate of effect than general-population controls then the exclusion is important to the ultimate validity
(such as those obtained from case neighborhoods or of the study.
by random-digit dialing). The source population does On the other hand, an investigator can rarely be
not correspond to the population of the geographic sure that an exposure is not related to a disease or to
area, but only to the people who would seek treat- hospitalization for a specific diagnosis. Consequently,
ment at the hospital or clinic were they to develop the it would be imprudent to use only a single diagnostic
disease under study. While the latter population may category as a source of controls. Using a variety of
Case–Control Studies 7

diagnoses has the advantage of potentially diluting These exclusion and inclusion criteria apply only to
the biasing effects of including a specific diagnostic the diagnosis that brought the person into the reg-
group that is related to the exposure. istry or database from which controls are selected.
Excluding a diagnostic category from the list of The history of an exposure-related disease should not
eligibility criteria for identifying controls is intended be a basis for exclusion. If, however, the exposure
simply to improve the representativeness of the con- directly affects the chance of entering the registry or
trol series with respect to the source population. database, the study will be subject to the Berksonian
Such an exclusion criterion does not imply that there bias mentioned earlier for hospital studies.
should be exclusions based on disease history [11].
For example, in a case–control study of smoking and
Other Considerations for Subject Selection
hospitalized leukemia patients, one might use hospi-
talized controls but exclude any who are hospitalized Representativeness. Some textbooks have stressed
because of cardiovascular disease. This exclusion the need for representativeness in the selection of
criterion for controls does not imply that leukemia cases and controls. The advice has been that cases
cases who have had cardiovascular disease should should be representative of all people with the disease
be excluded; only if the cardiovascular disease was and that controls should be representative of the
a cause of the hospitalization, should the case be entire nondiseased population. Such advice can be
excluded. For controls, the exclusion criterion should misleading. A case–control study may be restricted to
only apply to the cause of the hospitalization used any type of case that may be of interest: female cases,
to identify the study subject. A person who was hos- old cases, severely ill cases, cases that died soon after
pitalized because of a traumatic injury and who is disease onset, mild cases, cases from Philadelphia,
thus eligible to be a control would not be excluded cases among factory workers, and so on. In none
if he or she had previously been hospitalized for car- of these examples would the cases be representative
diovascular disease. The source population includes of all people with the disease, yet, in each one,
people who have had cardiovascular disease, and they perfectly valid case–control studies are possible [13].
should be included in the control series. Excluding The definition of a case can be virtually anything that
such people would lead to an underrepresentation of the investigator wishes.
smoking relative to the source population and pro- Ordinarily, controls should represent the source
duce an upward bias in the effect estimates. population for cases, rather than the entire nondis-
If exposure directly affects hospitalization (for eased population. The latter may differ vastly from
example, if the decision to hospitalize is in part based the source population for the cases by age, race, sex
on exposure history), the resulting bias cannot be (e.g., if the cases come from a Veterans administra-
remedied without knowing the hospitalization rates, tion hospital), socioeconomic status, occupation, and
even if the exposure is unrelated to the study disease so on – including the exposure of interest. One of
or the control diseases. This problem was in fact the reasons for emphasizing the similarities rather
one of the first problems of hospital-based studies than the differences between cohort and case–control
to receive detailed analysis [12], and is often called studies is that numerous principles apply to both types
Berksonian bias. of study, but are more evident in the context of cohort
studies. In particular, many principles relating to sub-
Other Diseases. In many settings, especially in ject selection apply identically to both types of study.
populations with established disease registries or For example, it is widely appreciated that cohort
insurance-claims databases, it may be most conven- studies can be based on special cohorts, rather than on
ient to choose controls from people who are diag- the general population. It follows that case–control
nosed with other diseases. The considerations needed studies can be conducted by sampling cases and con-
for valid control selection from other diagnoses par- trols from within those special cohorts. The resulting
allel those just discussed for hospital controls. It is controls should represent the distribution of exposure
essential to exclude any diagnoses known or sus- across those cohorts, rather than the general popu-
pected to be related to exposure, and better still to lation, reflecting the more general rule that controls
include only diagnoses for which there is some evi- should represent the source population of the cases
dence to indicate they are unrelated to exposure. in the study, not the general population.
8 Case–Control Studies

Comparability of Information. Some authors Variants of the Case–Control Design


have recommended that information obtained about
cases and controls should be of comparable or equal Nested Case–Control Studies
accuracy, to ensure nondifferentiality (equal distri-
bution) of measurement errors [3]. The rationale for Epidemiologists sometimes refer to specific case–
this principle is the notion that nondifferential mea- control studies as nested case–control studies when
surement error biases the observed association toward the population within which the study is conducted
the null, and so will not generate a spurious associa- is a fully enumerated cohort, which allows formal
tion, and that bias in studies with nondifferential error random sampling of cases and controls to be car-
is more predictable than in studies with differential ried out. The term is usually used in reference to a
error. case–control study conducted within a cohort study,
The comparability-of-information principle is in which further information (perhaps from expensive
often used to guide selection of controls and collec- tests) is obtained on most or all cases, but for econ-
tion of data. For example, it is the basis for using omy is obtained from only a fraction of the remaining
proxy respondents instead of direct interviews for liv- cohort members (the controls). Nonetheless, many
ing controls, whenever case information is obtained population-based case–control studies can be thought
from proxy respondents. Unfortunately, in most set- of as nested within an enumerated source population.
tings, the arguments for the principle are logically
unsound. For example, in a study that used proxy Case–Cohort Studies
respondents for cases, use of proxy respondents for
the controls might lead to greater bias than use of The case–cohort study is a case–control study in
direct interviews with controls, even if measurement which the source population is a cohort, and every
error is differential. The comparability-of-information person in this cohort has an equal chance of being
principle is therefore applicable only under very lim- included in the study as a control, regardless of
ited conditions. In particular, it would seem to be use- how much time that person has contributed to the
ful only when confounders and effect modifiers are person-time experience of the cohort or whether
measured with negligible error, and when measure- the person developed the study disease. This is a
ment error is reduced by using comparable sources logical way to conduct a case–control study when
of information. Otherwise, the effect of forcing com- the effect measure of interest is the ratio of incidence
parability of information may be as unpredictable as proportions rather than a rate ratio, as is common
the effect of using noncomparable information. in perinatal studies. The average risk (or incidence
proportion) of falling ill during a specified period may
Timing of Classification and Diagnosis. The prin- be written as
ciples for classifying persons, cases, and person-time A1
units in cohort studies according to exposure status R1 = (7)
N1
also apply to cases and controls in case–control stud-
ies. If the controls are intended to represent person- for the exposed subcohort and
time (rather than persons) in the source population, A0
one should apply principles for classifying person- R0 = (8)
N0
time to the classification of controls. In particular,
principles of person-time classification lead to the for the unexposed subcohort, where R1 and R0 are the
rule that controls should be classified by their expo- incidence proportions among the exposed and unex-
sure status as of their selection time. Exposures posed, respectively, and N1 and N0 are the initial
accrued after that time should be ignored. The rule sizes of the exposed and unexposed subcohorts. (This
necessitates that information (such as exposure his- discussion applies equally well to exposure variables
tory) be obtained in a manner that allows one to with several levels, but, for simplicity, we consider
ignore exposures accrued after the selection time. In only a dichotomous exposure.) Controls should be
a similar manner, cases should be classified as of selected such that the exposure distribution among
time of diagnosis or disease onset, accounting for any them will estimate without bias the exposure distribu-
built-in lag periods or induction-period hypotheses. tion in the source population. In a case–cohort study,
Case–Control Studies 9

the distribution we wish to estimate is among the is independent of exposure. Thus, using a case–
N1 + N0 cohort members, not among their person- cohort design, one can estimate the risk ratio in a
time experience [14–16]. cohort without obtaining information on every cohort
The objective is to select controls from the source member.
cohort such that the ratio of the number of exposed Thus far, we have implicitly assumed that there
controls (B1 ) to the number of exposed cohort is no loss to follow-up or competing risks in the
members (N1 ) is the same as the ratio of the underlying cohort. If there are such problems, it is
number of unexposed controls (B0 ) to the number still possible to estimate risk or rate ratios from a
of unexposed cohort members (N0 ), apart from case–cohort study, provided that we have data on
sampling error: the time spent at risk by the sampled subjects or
we use certain sampling modifications [17]. These
B1 B0
= (9) procedures require the usual assumptions for rate-
N1 N0 ratio estimation in cohort studies, namely, that loss
Here, B1 /N1 and B0 /N0 are the control sampling to follow-up and competing risks are either not
fractions (the number of controls selected per cohort associated with exposure or not associated with
member). Apart from random error, these sampling disease risk.
fractions will be equal if controls have been selected An advantage of the case–cohort design is that
independently of exposure. it facilitates conduct of a set of case–control studies
We can use the frequencies of exposed and from a single cohort, all of which use the same control
unexposed controls as substitutes for the actual group. Just as one can measure the incidence rate of
denominators of the incidence proportions to obtain a variety of diseases within a single cohort, one can
“pseudorisks”: conduct a set of simultaneous case–control studies
using a single control group. A sample from the
A1
Pseudorisk1 = (10) cohort is the control group needed to compare with
B1 any number of case groups. If matched controls are
and selected from people at risk at the time a case occurs
(as in risk-set sampling, which is described in the
A0 section titled “Density Case–Control Studies”), the
Pseudorisk0 = (11)
B0 control series must be tailored to a specific group
These pseudorisks have no epidemiologic inter- of cases. To have a single control series serve many
pretation by themselves. Suppose, however, that the case groups, another sampling scheme must be used.
control sampling fractions are equal to the same The case–cohort approach is a good choice in such
fraction, f . Then, apart from sampling error, Bl /f a situation.
should equal N1 , the size of the exposed subco- Wacholder has discussed the advantages and dis-
hort; and B0 /f should equal N0 , the size of the advantages of the case–cohort design relative to the
unexposed subcohort: Bl /f = Bl /(B1 /N1 ) = N1 and usual type of case–control study [18]. One point to
B0 /f = B0 /(B0 /N0 ) = N0 . Thus, to get the inci- note is that, because of the overlap of membership
dence proportions, we need only multiply each pseu- in the case and control groups (controls who are
dorisk by the common sampling fraction, f . If this selected may also develop disease and enter the study
fraction is not known, we can still compare the sizes as cases), one will need to select more controls in a
of the pseudorisks by division: case–cohort study than in an ordinary case–control
study with the same number of cases, if one is
Pseudorisk1 A1 /B1 A1 /[(B1 /N1 )N1 ] to achieve the same amount of statistical precision.
= =
Pseudorisk0 A0 /B0 A0 /[(B0 /N0 )N0 ] Extra controls are needed because the statistical preci-
A1 /f N1 A1 /N1 sion of a study is strongly determined by the numbers
= = (12) of distinct cases and noncases. Thus, if 20% of the
A0 /f N0 A0 /N0
source cohort members will become cases, and all
In other words, the ratio of pseudorisks is an cases will be included in the study, one will have
estimate of the ratio of incidence proportions (risk to select 1.25 times as many controls as cases in a
ratio) in the source cohort if control sampling case–cohort study to insure that there will be as many
10 Case–Control Studies

controls who never become cases in the study. On [19]. It is also possible to conduct unmatched density
average, only 80% of the controls in such a situation sampling using probability sampling methods if one
will remain noncases; the other 20% will become knows the time interval at risk for each population
cases. Of course, if the disease is uncommon, the member. One then selects a control by sampling
number of extra controls needed for a case–cohort members with probability proportional to time at
study will be small. risk, and then randomly samples a time to measure
exposure within the interval at risk.
As mentioned earlier, a person selected as a
Density Case–Control Studies
control, and who remains in the study population
Earlier, we described how case–control odds ratios at risk after selection should remain eligible to be
will estimate rate ratios if the control series is selected selected once again as a control. Thus, although
so that the ratio of the person-time denominators unlikely in typical studies, the same person may
T1 /T0 is validly estimated by the ratio of exposed to appear in the control group two or more times.
unexposed controls B1 /B0 . That is, to estimate rate Note, however, that including the same person at
ratios, controls should be selected so that the exposure different times does not necessarily lead to exposure
distribution among them is, apart from random error, (or confounder) information being repeated, because
the same as it is among the person-time in the this information may change with time. For example,
source population. Such control selection is called in a case–control study of an acute epidemic of
density sampling because it provides for estimation intestinal illness, one might ask about food ingested
of relations among incidence rates, which have been within the previous day or days. If a contaminated
called incidence densities. food item was a cause of the illness for some cases,
If a subject’s exposure may vary over time, then a then the exposure status of a case or control chosen
case’s exposure history is evaluated up to the time 5 days into the study might well differ from what
the disease occurred. A control’s exposure history it would have been 2 days into the study when the
is evaluated up to an analogous index time, usually subject might also have been included as a control.
taken as the time of sampling; exposure after the time
of selection must be ignored. This rule helps ensure Cumulative (“Epidemic”) Case–Control Studies
that the number of exposed and unexposed controls
will be in proportion to the amount of exposed and In some research settings, case–control studies may
unexposed person-time in the source population. address a risk that ends before subject selection
The time during which a subject is eligible to be begins. For example, a case–control study of an
a control should be the time in which that person epidemic of diarrheal illness after a social gathering
is also eligible to become a case, should the disease may begin after all the potential cases have occurred
occur. Thus, a person in whom the disease has already (because the maximum induction time has elapsed).
developed or who has died is no longer eligible to In such a situation, an investigator might select
be selected as a control. This rule corresponds to controls from that portion of the population that
the treatment of subjects in cohort studies. Every remains after eliminating the accumulated cases; that
case that is tallied in the numerator of a cohort is, one selects controls from among noncases (those
study contributes to the denominator of the rate until who remain noncases at the end of the epidemic
the time that the person becomes a case, when the follow-up).
contribution to the denominator ceases. One way to Suppose that the source population is a cohort
implement this rule is to choose controls from the set and that a fraction f of both exposed and unexposed
of people in the source population who are at risk of noncases are selected to be controls. Then the ratio
becoming a case at the time that the case is diagnosed. of pseudofrequencies will be
This set is sometimes referred to as the risk set A1 /B1 A1 /f (N1 − A1 ) A1 /(N1 − A1 )
for the case, and this type of control sampling is = = (13)
A0 /B0 A0 /f (N0 − A0 ) A0 /(N0 − A0 )
sometimes called risk-set sampling. Controls sampled
in this manner are matched to the case with respect which is the incidence odds ratio for the cohort. The
to sampling time; thus, if time is related to exposure, latter ratio will provide a reasonable approximation to
the resulting data should be analyzed as matched data the rate ratio, provided that the proportions falling ill
Case–Control Studies 11

in each exposure group during the risk period are low, to study treatments for which effects occur within
that is, less than about 20%, and that the prevalence of a short induction period and do not persist, i.e.,
exposure remains reasonably steady during the study carryover effects must be absent, so that the effect
period. If the investigator prefers to estimate the risk of the second intervention is not intermingled with
ratio rather than the incidence rate ratio, the study the effect of the first.
odds ratio can still be used [20], but the accuracy The case-crossover study is a case–control ana-
of this approximation is only about half as good as logue of the crossover study [25]. For each case, one
that of the odds ratio approximation to the rate ratio or more predisease or postdisease time periods are
[21]. The use of this approximation in the cumulative selected as matched “control” periods for the case.
design is the basis for the common and mistaken The exposure status of the case at the time of the
notion that a rare-disease assumption is needed to disease onset is compared with the distribution of
estimate risk ratios in all case-control studies. exposure status for that same person in the control
Prior to the 1970s, the standard conceptualiza- periods. Such a comparison depends on the assump-
tion of case–control studies involved the cumulative tion that neither exposure nor confounders change
design, in which controls are selected from noncases with time in a systematic way.
at the end of a follow-up period. As discussed by Only a limited set of research topics are amenable
numerous authors [19, 22, 23], density designs and to the case-crossover design. The exposure must
case–cohort designs have several advantages outside
vary over time within individuals, rather than stay
of the acute epidemic setting, including potentially
constant. If the exposure does not vary within a
much less sensitivity to bias from exposure-related
person, then there is no basis for comparing exposed
loss to follow-up.
and unexposed time periods of risk within the person.
Like the crossover study, the exposure must also
Case-Specular and Case-Crossover Studies have a short induction time and a transient effect;
otherwise, exposures in the distant past could be the
When the exposure under study is defined by prox- cause of a recent disease onset (a carryover effect).
imity to an environmental source (e.g., a power line),
Maclure [25] used the case-crossover design to
it may be possible to construct a specular (hypothet-
study the effect of sexual activity on incident myocar-
ical) control for each case, by conducting a “thought
dial infarction. This topic is well suited to a case-
experiment”. Either the case or the exposure source
crossover design because the exposure is intermittent
is imaginarily moved to another location that would
and is presumed to have a short induction period
be equally likely were there no exposure effect; the
for the hypothesized effect. Any increase in risk for
case exposure level under this hypothetical config-
uration is then treated as the (matched) “control” a myocardial infarction from sexual activity is pre-
exposure for the case [24]. When the specular control sumed to be confined to a short time following the
arises by examining the exposure experience of the activity. A myocardial infarction is an outcome well
case outside of the time in which exposure could be suited to this type of study because it is thought to
related to disease occurrence, the result is called a be triggered by events close in time.
case-crossover study. Each case in a case-crossover study is automati-
The classic crossover study is a type of experiment cally matched with its control on all characteristics
in which two (or more) treatments are compared, (e.g., sex and birth date) that do not change within
as in any experimental study. In a crossover study, individuals. A matched analysis of case-crossover
however, each subject receives both treatments, with data automatically adjusts for all such fixed con-
one following the other. Preferably, the order in founders, whether or not they are measured. Con-
which the two treatments are applied is randomly trol for measured time-varying confounders is pos-
chosen for each subject. Enough time should be sible using modeling methods for matched data. It
allocated between the two administrations so that is also possible to adjust case-crossover estimates
the effect of each treatment can be measured and for bias owing to time trends in exposure through
can subside before the other treatment is given. A use of longitudinal data from a nondiseased control
persistent effect of the first intervention is called group (case-time controls) [26]. Nonetheless, these
a carryover effect. A crossover study is only valid trend adjustments themselves depend on additional
12 Case–Control Studies

no-confounding assumptions and may introduce bias or the unexposed population changes with time or
if those assumptions are not met [27]. there is migration into the prevalence pool, the preva-
lence odds ratio may be further removed from the rate
Two-Stage Sampling ratio. Consequently, it is always preferable to select
incident rather than prevalent cases when studying
Another variant of the case–control study uses two- disease etiology.
stage or two-phase sampling [28, 29]. In this type Prevalent cases are usually drawn in studies of
of study, the control series comprises a relatively congenital malformations. In such studies, cases
large number of people (possibly everyone in the ascertained at birth are prevalent because they have
source population), from whom exposure information survived with the malformation from the time of
or perhaps some limited amount of information its occurrence until birth. It would be etiologically
on other relevant variables is obtained. Then, for more useful to ascertain all incident cases, includ-
only a subsample of the controls, more detailed ing affected abortuses that do not survive until birth.
information is obtained on exposure or on other Many of these, however, do not survive until ascer-
study variables that may need to be controlled in tainment is feasible, and thus it is virtually inevitable
the analysis. More detailed information may also that case–control studies of congenital malforma-
be limited to a subsample of cases. This two-stage tions are based on prevalent cases. In this example,
approach is useful when it is relatively inexpensive the source population comprises all conceptuses, and
to obtain the exposure information (e.g., by telephone miscarriage and induced abortion represent emigra-
interview), but the covariate information is more tion before the ascertainment date. Although an expo-
expensive to obtain (say, by laboratory analysis). It sure will not affect duration of a malformation, it may
is also useful when exposure information already has very well affect risks of miscarriage and abortion.
been collected on the entire population (e.g., job Other situations in which prevalent cases are
histories for an occupational cohort), but covariate commonly used are studies of chronic conditions
information is needed (e.g., genotype). This situation with ill-defined onset times and limited effects on
arises in cohort studies when more information is mortality, such as obesity and multiple sclerosis, and
required than was gathered at baseline. This type of studies of health services utilization.
study requires special analytic methods to take full
advantage of the information collected at both stages.

Conclusion
Case–Control Studies with Prevalent Cases
Case–control studies are sometimes based on preva- Epidemiologic research employs a range of study
lent cases rather than incident cases. When it is designs, including both experimental and nonex-
impractical to include only incident cases, it may perimental studies. Among nonexperimental studies,
still be possible to select existing cases of illness cohort designs are sometimes thought to be inherently
at a point in time. If the prevalence odds ratio in less susceptible to bias than case–control designs.
the population is equal to the incidence rate ratio, Nonetheless, most of the biases that are associated
then the odds ratio from a case-control study based with case–control studies are not inherent to the
on prevalent cases can unbiasedly estimate the rate design, nor are cohort studies immune from bias. For
ratio. The conditions required for the prevalence odds example, recall bias will not occur in case–control
ratio to equal the rate ratio are very strong, however, study when exposure comes from records taken
and a simple relation does not exist for age-specific before disease onset, and selection bias can occur
ratios. If exposure is associated with duration of ill- in cohort studies that suffer from loss to follow-up.
ness or migration out of the prevalence pool, then No epidemiologic study is perfect, and this caution
a case–control study based on prevalent cases can- applies to cohort studies as well as case–control stud-
not by itself distinguish exposure effects on disease ies. A clear understanding of the principles of study
incidence from the exposure association with disease design is essential for valid study design, conduct,
duration or migration, unless the strengths of the lat- and analysis, and for proper interpretation of results,
ter associations are known. If the size of the exposed regardless of the design.
Case–Control Studies 13

Acknowledgment [16] Miettinen, O.S. (1982). Design options in epidemiologic


research: an update, Scandinavian Journal of Work
This article is adapted from Modern Epidemiology, Third Environment Health 8(Suppl. 1), 7–14.
Edition, Rothman KJ, Greenland S, and Lash TL, eds., [17] Flanders, W.D., DerSimonian, R. & Rhodes, P. (1990).
Lippincott Williams & Wilkins, 2008. Estimation of risk ratios in case-base studies with
competing risks, Statistics in Medicine 9, 423–435.
[18] Wacholder, S. (1991). Practical considerations in choos-
References ing between the case-cohort and nested case-control
design, Epidemiology 2, 155–158.
[1] Greenland, S. (1985). Control-initiated case-control [19] Greenland, S. & Thomas, D.C. (1982). On the need
studies, International Journal of Epidemiology 14, for the rare disease assumption in case-control studies,
130–134. American Journal of Epidemiology 116, 547–553.
[2] Rothman, K.J. & Greenland, S. (1998). Modern [20] Cornfield, J. (1951). A method of estimating compara-
Epidemiology, 2nd Edition, Lippincott, Philadelphia, tive rates from clinical data. Application to cancer of the
Chapter 21. lung, breast and cervix, Journal of the National Cancer
[3] Wacholder, S., McLaughlin, J.K., Silverman, D.T. & Institute 11, 1269–1275.
Mandel, J.S. (1992). Selection of controls in case-control [21] Greenland, S. (1987). Interpretation and choice of effect
studies. I. Principles, American Journal of Epidemiology measures in epidemiologic analysis, American Journal
135, 1019–1028. of Epidemiology 125, 761–768.
[4] Wacholder, S., McLaughlin, J.K., Silverman, D.T. & [22] Sheehe, P.R. (1962). Dynamic risk analysis in retro-
Mandel, J.S. (1992). Selection of controls in case-control spective matched-pair studies of disease, Biometrics 18,
studies. I. Principles, American Journal of Epidemiology 323–341.
135, 1029–1041. [23] Miettinen, O.S. (1976). Estimability and estimation in
[5] Wacholder, S., McLaughlin, J.K., Silverman, D.T. & case-referent studies, American Journal of Epidemiology
Mandel, J.S. (1992). Selection of controls in case-control 103, 226–235.
studies. I. Principles, American Journal of Epidemiology [24] Zaffanella, L.E., Savitz, D.A., Greenland, S. & Ebi, K.L.
135, 1042–1050. (1998). The residential case-specular method to study
[6] Lubin, J.H. & Gail, M.H. (1984). Biased selection of wire codes, magnetic fields, and disease, Epidemiology
controls for case-control analyses of cohort studies, 9, 16–20.
Biometrics 40, 63–75. [25] Maclure, M. (1991). The case-crossover design: a
[7] Robins, J.M., Gail, M.H. & Lubin, J.H. (1986). More on method for studying transient effects on the risk of
biased selection of controls for case-control analyses of acute events, American Journal of Epidemiology 133,
cohort studies, Biometrics 42, 293–299. 144–153.
[8] Poole, C. (1986). Exposure opportunity in case-control [26] Suissa, S. (1995). The case-time-control design, Epi-
studies, American Journal of Epidemiology 123, demiology 6, 248–253.
352–358. [27] Greenland, S. (1996). Confounding and exposure trends
[9] Terry, M.B. & Neugut, A.L. (1998). Cigarette smok- in case-crossover and case-time-control designs, Epi-
ing and the colorectal adenoma-carcinoma sequence: a demiology 7, 231–239.
hypothesis to explain the paradox, American Journal of [28] Walker, A.M. (1982). Anamorphic analysis: sampling
Epidemiology 147, 903–910. and estimation for confounder effects when both expo-
[10] Poole, C. (1999). Controls who experienced hypothetical sure and disease are known, Biometrics 38, 1025–1032.
causal intermediates should not be excluded from case- [29] White, J.E. (1982). A two stage design for the study
control studies, American Journal of Epidemiology 150, of the relationship between a rare exposure and a
547–551. rare disease, American Journal of Epidemiology 115,
[11] Lubin, J.H. & Hartge, P. (1984). Excluding controls: 119–128.
misapplications in case-control studies, American Jour-
nal of Epidemiology 120, 791–793.
[12] Berkson, J. (1946). Limitations of the application of 4-
fold tables to hospital data, Biometrics Bulletin 2, 47–53.
Related Articles
[13] Cole, P. (1979). The evolving case-control study, Jour-
nal of Chronic Diseases 32, 15–27. Absolute Risk Reduction
[14] Thomas, D.B. (1972). Relationship of oral contracep-
tives to cervical carcinogenesis, Obstetrics and Gyne- Epidemiology as Legal Evidence
cology 40, 508–518. History of Epidemiologic Studies
[15] Kupper, L.L., McMichael, A.J. & Spirtas, R. (1975). A
hybrid epidemiologic design useful in estimating relative
risk, Journal of the American Statistical Association 70, KENNETH J. ROTHMAN, SANDER GREENLAND
524–528. AND TIMOTHY L. LASH

You might also like