Chapter 08
Chapter 08
8
Case-Control Studies
8.1 Introduction
8.2 Identification of Cases and Controls
• Identification of Cases
• Selection of Controls
• Number of Controls per Case
8.3 Obtaining Information on Exposure
8.4 Data Analysis
• Dichotomous Exposure
• Multiple Levels of Exposure
• Matched-Pairs
• Matched-Tuples
8.5 Advanced Topic (Optional): Statistical Equivalence of the Case-Control Odds Ratios and
Source Population Relative Risks
• Incidence Density Sampling Justification (“Modern Epidemiology”)
• Cumulative Incidence Sampling Justification (“Traditional Epidemiology”)
8.1 Introduction
The two most common types of nonexperimental study designs in epidemiology are cohort
studies and case-control studies. The objective of both types of studies is to clarify the
relationship between antecedent exposures and subsequent disease. They differ, however, in the
way they select their study subjects.
Recall from previous chapters that cohort studies start by selecting disease-free subjects from a
source population. Individual study subjects are then classified according to their exposure status
before individual experiences are followed either prospectively or retrospectively over time.
Cohort
┌ exposed individuals → disease incidence┐
Source Compare incidence of
population disease
└nonexposed individuals→disease incidence ┘
In contrast, case-control studies start by selecting diseased subjects (cases) and nondiseased
subjects (controls) from a source population. Prior exposures are ascertained in the cases and
controls. Thus, the primary distinction between case-control studies and cohort studies is the manner in
which subjects are selected for study.
Case-Control
┌ cases → prior ┐
exposure
Source Compare odds of exposure
population
└ noncases → prior ┘
exposure
Nevertheless, case-control studies and cohort studies bear many similarities. Both use observational data
to address longitudinal relationship between antecedent exposures and diseases in source populations. In
fact, case-control studies can be viewed as a variant of the cohort design in which the etiologic histories
of the case series provides the numerator information for the source population’s disease rates and the
control series provides information about the ratio of the person-time denominators that form the rates.
This can seem counterintuitive, but is explained further in §8.4 of this chapter. Much like a cohort study,
a case-control study is a comparison of exposed and non-exposed rates. However, by studying only a
fraction of the non-cases in the source population, the case-control method gain statistical efficiencies that
beyond the cohort study approach. This gain in statistical efficiency is especially important when studying
rare diseases.
Because case-control studies do not enumerate the absolute sizes of the exposed and non-
exposed groups in the source population, they are unable to directly calculate absolute rates of
disease. They, can, however, quantify the relative rate of disease associated with the exposure in
the form of a statistic known as the exposure odds ratio. Table 8.1 shows the notation we will
use when describing case-control data. In this table, A1 represents the number of exposed cases,
A0 represents the number of nonexposed cases, B1 represents the number of exposed controls, and
B0 represents the number of nonexposed controls.*
TABLE 8.1. 2-by-2 table notation for case-control studies.
Cases Controls
Exposure + A1 B1
Exposure – A0 B0
Cross-tabulated data from the 2-by-2 table are used to calculate this exposure odds ratio:
(8.1)
(8.2)
which is merely the cross-product ratio of table counts. The overhead hat symbol (“^”) in both
*
Notice that in this notation, A and B represent the disease status of study participants and subscript 1 and 0 represent
their exposure status.
formulas designates them as sample odds ratios, which are statistical estimates of the “true” odds
ratio parameter (Chapter 9).
Because ORs from case-control studies are direct estimates of the Rate Ratio in the source
population, they have the same interpretation as RRs from cohort studies (§7.6): they quantify the
extent to which the exposure multiplies risk in the exposed group relative to that of the
nonexposed group. Thus, an OR of 1 indicates no association between the exposure and disease,
an OR of 2 indicates that the exposure doubles risk, an OR of 0.5 indicates that the exposure cuts
the risk in half, and so on.
Illustrative Example 8.1 (Odds Ratio, esophageal cancer and alcohol consumption). A case-
control study identified 200 cases of esophageal cancer in men from the regional hospitals of the
Ille-et-Vilaine (Brittany) region of France between January 1972 and April of 1974. Seven-
hundred seventy-five men were selected from electoral lists from this region to serve as controls
(Tuyns et al., 1977; Breslow & Day, 1980). Table 8.2 looks at the effect of alcohol consumption
on esophageal cancer risk with alcohol consumption dichotomized at less than or more than 80
grams per day. Ninety-six (48%) of the 200 cases are classified as exposed. In contrast, 109
(14%) of the 775 controls are classified as exposed. Thus, the = 5.64, indicating
5.64 times the risk of esophageal cancer in the high alcohol consumer group relative to the low
alcohol consumer group.
TABLE 8.2. Data for Illustrative Example 8.1. Alcohol consumption and esophageal cancer.
Alcohol (grams/day) Cases Controls
≥ 80 96 109
200 775
Source: Tuyns et al. (1977).
Identification of Cases
The source population is defined as the population that gave rise to cases of the disease in the
study. You may think of this as the base for the study, or the study base.
Case-control studies will generally attempt to identify all the cases in the source population. This
is aided by use of a standard case definition, based on current understandings of underlying
pathologies and available technologies. The objective of the case definition is to aid in
identifying cases as accurately, efficiently, and uniformly as possible.
Case-control studies may restrict themselves to cases of recent onset (incident cases) or cases
that occurred at any time in the past (prevalent cases). It is preferable to study incident cases
when possible, because prevalence depends on both incidence and the duration of illness (§3.4),
and the duration of an illness depends on factors other than cause. There are times, however,
when we have no choice but to study prevalent cases. Use of prevalent cases is acceptable if the
duration of the illness is independent of its cause.
Selection of Controls
The purpose of the control series is to reflect the relative frequency of the exposure in the source
population. Valid selection of controls, therefore, relies on carefully defining the population that
gave rise to the cases.
The underlying source population in a case-control study can be either a closed population (i.e., a
cohort) or an open population. Case-control studies based in open-populations may be referred to
as population-based case-control studies. Case-control studies based in cohorts are called
nested case–control studies. Nesting a case-control study in a cohort provides a well-defined
sampling frame for the selection of controls: a random sample of the cohort will serve as the
control series. Here is an example in which an HMO cohort served as the sampling frame for the
selection of controls..
Illustrative Example 8.2 (Selection of Controls, Vasectomy and Prostate Cancer). A case-
control study evaluated the relationship between vasectomy and prostate cancer in a health
maintenance organization (HMO) source population. Cases consisted of 175 histologically
confirmed instances of prostate cancer occurring between 1989 and 1991 that were treated in this
HMO system. The control series consisted of 258 similarly-aged men selected at random from
the HMO’s general membership roles. Additional data were collected via questionnaire and
medical record review.
Table 8.3 presents cross-tabulated data from this study. The = 0.95, suggesting a
scant negative association between the exposure and disease. However, since the odds ratio is so
close to 1, and we have not yet accounted for random and systematic errors in the data, the odds
ratio of 0.95 is reasonably assumed to suggest no association between vasectomy and prostate
cancer.
Vasectomy + 61 93
175 258
Source: Zhu et al. (1996).
The above example illustrates an important benefit of the case-control approach: the study was
completed with a modest total sample size of 175 + 258 = 433 study subjects. Because prostate
cancer occurs on the order of 150 new cases per 100,000 men per year, studying a cohort of
100,000 men for a year would still not be likely to derive 175 incident cases. This demonstrates
the statistical efficiency of case-control studies when studying rare diseases.
Identifying the source population for open population case-control studies presents a slightly
greater challenge. If cases come from a particular set of hospitals or health care facilities, then
the catchment area for those health facilities may serve as the sampling frame for the selection
of controls. In Illustrative Example 8.1, for example, the electoral lists from the regions served
by the hospitals used to indentify cases served as the sampling frame for the selection of
controls.
†
WinPEPI (Windows Programs for EPIemiologists) is a free computer application for Windows computers that
calculates numerous epidemiologic statistics. The program can be downloaded from
[Link]/pepi4windows.
Dichotomous Exposure
Use of cross-tabulated 2-by-2 data to calculate the odds ratio was introduced in section 1 of this
chapter. Recall that the odds ratio is a statistical estimate of the rate ratio in the underlying source
population. Thus, an odds ratio of 1 indicates no association between the exposure and disease,
an odds ratio significantly greater than 1 indicates a positive association, and an odds ratio
significantly less than 1 indicate a negative association. The farther the odds ratio is from a value
of 1 (either toward 0 or toward infinity), the stronger the association. Thus, the of 5.6 in
Illustrative Example 8.1 indicated a strong positive association between high alcohol
consumption and esophageal cancer. The of 0.95 in Illustrative Example 8.2 indicated a very
scant negative association (nearly no association) between prostatic cancer and prior vasectomy.
Small studies produce odds ratio estimates that are relatively imprecise, whereas large studies
provide precise estimates. The precision of a given odds ratio can be gauged by calculating its
confidence interval with an application such as WinPEPI’s Compare2 program (Abramson,
2011) or [Link]’s “Counts → Two by Two Table” application (Dean et al., 2011).
Figure 8.2 is a screenshot from OpenEpi’s “Counts → Two by Two Table” application with data
from Illustrative Example 8.1. Figure 8.3 is a screenshot of the application’s output. Note that the
95% confidence interval for the odds ratio is calculated in several different ways. The Exact
Mid-P method (Berry & Armitage, 1995) is highlighted, revealing that data are consistent with
odds ratios between 3.992 and 7.947 with 95% confidence. Keep in mind that confidence
intervals address imprecision (random error), but does not address systematic sources of error.‡
Figure 8.2. Screen shot from OpenEpi’s 2-by-2 table application showing data from Illustrative
Example 8.1 .[[Link]]
Figure 8.3. Output from OpenEpi’s 2-by-2 table application showing results for Illustrative
Example 8.1. [[Link]]
‡
Chapter 9 will discuss the difference between imprecision (random error) and bias (systematic error).
0 – 39 29 386
40 – 79 75 280
80 – 119 51 87
120+ 45 22
200 775
Source: Tuyns et al. (1977).
When the exposure categories are nominal, selection of the referent (“nonexposed”) group is
arbitrary. When the exposure levels can be placed in rank order, as they are in Table 8.4, the least
exposed group serves as the referent group for each comparison. The notation shown in Table
8.5 is adopted, and the odds ratio associated with each level of exposure is:
(8.3)
TABLE 8.5. Notation for Case–Control Studies with Multiple Levels of Exposurea
Exposure Level Cases Controls
0 A0 B0
1 A1 B1
i Ai Bi
k Ak Bk
a
Exposure levels {i: 0, 1, , k} are graded from low to high, where level 0 represents the least exposed group.
Illustrative Example 8.3 (Multiple Levels Of Exposure, Esophageal Cancer And Alcohol
Consumption). For the data in Table 8.4, 1,§ = 3.57,
§
By definition, the odds ratio in the referent group will always equal 1.
A test for trend for these data can be calculated with WinPEPI’s “Compare2 → Categorical Data
(2 x k data)” program or [Link]’s “Counts → Dose Response” application. Figure
8.4 shows the input from [Link]. Data are from Table 8.5. The test for trend derives a P-
value that is less than 0.000001 (Figure 8.5).
Figure 8.5. Screen shot from OpenEpi’s dose-response application for counts showing the data
from Illustrative Example 8.3.[[Link]]
Figure 8.5. Output from OpenEpi’s dose-response application for counts showing results for
Illustrative Example 8.3. [[Link]]
Matched-Pairs
Case–control studies may match cases to controls on extraneous factors that might otherwise
confound the comparison. For example, a case-control study may match cases to controls on age
to control for age differences in exposed group and nonexposed group in the source population.
Table 8.6 shows notation for matched-pair case-control data. In this table, cell t represents the
number of matched-pairs in which both the case and control pair-member are classified as
exposed, table cell v represents the number of matched-pairs in which the case pair-member is
classified as exposed and the control pair-member is classified as nonexposed, and so on. The
odds ratio for matched-pair data is then calculated with formula 8.4.
(8.4)
Exposed t u
Nonexposed v w
TABLE 8.7. Continual Tampon Use and Toxic Shock Syndrome, Matched-Pairs
Case pair-member
Control pair-member Exposed Nonexposed
Exposed 33 1
Nonexposed 9 1
Source: Shands et al. (1980).
Matched-Tuples
Instead of matching a single control to each case, the study may choose to match two or more
controls per case. This will increase the statistical precision of odds ratio estimates. The odds
ratio when tuple matching is used can be calculated with this formula:
(8.5)
To illustrate this approach, let us consider an example that matched 3 controls per case.
Illustrative Example 8.5 (Matched-tuples, toxic shock syndrome and Rely brand
tampons). Amid the national publicity and press speculation that materialized about
tampon use and toxic shock syndrome in the summer of 1980, there was added concern
that recently introduced highly absorbent tampon brands such as Rely brand tampons
were causing the problem. In the fall of that year, the CDC launched a study to test the
hypothesis that one or more brands of tampons were more strongly associated with toxic
shock syndrome than were other brands (Schlech et al., 1982). Much like in Illustrative
Example 8.4, cases were asked to provide the names of female friends of approximately
the same age who lived in their geographic area to serve as controls. In this instance,
however, each case identified three friend controls. Each unit of observation now represents
**
Results are shown here with three decimals to allow the reader to place them in the output. In technical reports,
results should be rounded to one or two decimal places.
a “four-tuple.”
One of the analyses from this study assessed the use of Rely brand tampons. Fourteen
sets of four-tuples were studied. Table 8.8 exhibits the standard way to display data of
this type. Table 8.9 exhibits are alternative method of display. Using the data in Table 8.9
we note = 7.67, suggesting
a more than 7-fold risk of toxic shock syndrome associated with Rely brand tampon use.
To use WinPEPI to calculate results for matched case-control studies with multiple controls per
case, select PairsEtc’s program E. ‘Yes-no’ variable: compare subjects with 2 or more controls.
Enter the number of “Controls per case” in the appropriate field and then enter the data in the
format shown in Table 8.9 and Figure 8.8. Figure 8.9 reveals an odds ratio of 7.66 with a 95%
confidence interval of 1.61 to 36.59.
TABLE 8.8. Use of Rely Brand Tampons and Toxic Shock Syndrome among single brand
tampon users, Case-Control Data, Matched-Tuples
Case pair-member
Number of Exposed
Controls in Matched Set Exposed Nonexposed
3 of 3 1 0
2 of 3 1 1
1 of 3 5 1
0 of 3 4 1
Source: Schlech et al. (1982).
Table 8.9. Format for entering data into WinPEPI’s PAIRSetc program E. for matched
samples with multiple controls per case.
Case Status Number of Exposed Number of Counts needed for
Controls Sets calculations
Exposed Case 3 1 0 nonexposed controls
Figure 8.8. Screen shot from WinPEPI’s PairsEtc program E. showing partial input for
Illustrative Example 8.5.[[Link]]
Figure 8.9. Screen shot from WinPEPI’s PairsEtc program E. showing result from Illustrative
Example 8.5.[[Link]]
††
If the disease is rare, it is unlikely the same person would serve as both a case and control.
The second element of the proof requires a rare disease assumption. When the number of cases is
small relative to the size of the population, the incidence odds ratio is approximately equal to the
incidence proportion: Incidence Odds Ratio = (A1/B1)/(A0/B0) (A1/N1)/(A0/N0) = Incidence
Proportion Ratio. Thus, the exposure odds ratio, which is stochastically equal to the incidence
odds ratio, is also equal to the incidence proportion ratio (risk ratio).
Exercises
8.1 Influenza vaccination and primary cardiac arrest. A case-control study
examined the relationship between influenza vaccination and primary cardiac
arrest mortality in King County, Washington between October 1988 to July 1994
(Siscovick et al. 2000). Fatal cases of primary cardiac arrest ( n = 315) were
identified from paramedic reports. Community controls ( n = 549) were
identified using a random digit dialing technique. Spouses of cases and controls
were interviewed to ascertain who had received influenza vaccination during the
prior year.
(A) Explain why this is a case-control study and not a cohort study.
(B) Why was random digit dialing used to select controls?
(C) This study interviewed the spouses of cases to ascertain information about
cases because the disease was fatal. Why were spouses used to ascertain
information for the controls?
(D) Cross-tabulated results are shown in Table 8.10. Calculate the odds ratio
associated with vaccination and interpret the results.
TABLE 8.10. Case-control study of primary cardiac arrest and influenza vaccination.
Cases Controls
Vaccinated 79 176
8.2 Wynder and Graham, 1950. The May 27, 1950 issue of the Journal of the
American Medical Association contained two landmark studies on smoking and
lung cancer. One study was by Levin and co-workers. The other study, by
Wynder and Graham, is considered here. Table 8.11 contains data from the
Wynder and Graham study. Data are self-reported average use of tobacco during
the last 20 years. If a patient smoked for less than 20 years, their amount
smoked was reduced in proportion to the duration of smoking (e.g., a subject
who smoked 20 cigarettes per day for 10 years was classified as smoking 10
cigarettes daily. Pipe and cigar smoking were considered by arbitrarily counting
1 cigar or 2 pipefuls of tobacco to be the equivalent of 5 cigarettes. Calculate
the odds ratio associated with exposure level of smoking. Comment on your
findings.
TABLE 8.11. Case–Control Study with Multiple Levels of Exposure
Smoking Levela Cases Controls
5 123 64
4 186 98
3 213 274
2 61 147
1 14 82
0 8 115
605 780
a
Classification of smoking habits:
5 Chain smokers (35 cigarettes or more per day for at least 20 years)
4 Excessive smokers (21–34 cigarettes per day for more than 20 years)
3 Heavy smokers (16–20 cigarettes per day for more than 20 years)
2 Moderately heavy smokers (10–15 cigarettes per day for more than 20 years)
1 Light smokers (1–9 cigarettes per day for more than 20 years)
0 Nonsmoker (less than 1 cigarette per day for more than 20 years)
Source: Wynder and Graham (1950).
(A)
(B)
References
Abramson, J. H. (2011). WINPEPI updated: computer programs for epidemiologists, and their
teaching potential. Epidemiologic Perspectives and Innovation, 8(1), 1. [Link]
[Link]/content/8/1/1
Anscombe, F. J. (1956). On estimating binomial response relations. Biometrika, 43, 461–464.
Belmont Report. (1979). Ethical principles and guidelines for the protection of human subjects
of research. The National Commission for the Protection of Human Subjects of Biomedical
and Behavioral Research: The Belmont Report. Available:
[Link]
Berry, G., & Armitage, P. (1995). Mid-P confidence intervals: a brief review. The Statistician,
44(4), 417-423.
Breslow, N. E., & Day, N. E. (1980). Statistical Methods in Cancer Research. Volume 1 - The
Analysis of Case-Control Studies. Lyon: International Agency for Research on Cancer.
CDC. (1980). Toxic shock syndrome—United States. MMWR, 29 (June, 27), 297–299.
CDC. (1992). Tampons and toxic shock syndrome. Washington, DC: Association of Teachers of
Preventive Medicine.
CDC. (1997). Toxic shock syndrome—United States. Editorial note—1997. MMWR, 46, 492–
495.
Cornfield, J. (1951). A method of estimating comparative rates from clinical data. Journal of
National Cancer Institute, 11, 1269–1275.
Dean, A. G., Sullivan, K. M., & Soe, M. M. (2011, 2011/23/06). OpenEpi: Open Source
Epidemiologic Statistics for Public Health, Version 2.3.1. 2011/07/12, from
[Link].
Fleiss, J. L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John
Wiley & Sons. See page 64, formula 5.20.
Gail, M., Williams, R., Byar, D. P., & Brown, C. (1976). How many controls? J Chronic Dis,
29(11), 723-731.
Jewell, N. P. (1986). On the bias of commonly used measures of association for 2 × 2 tables.
Biometrics, 42, 351–358.
Levin, M. L., Goldstein, H., & Gerhardt, P. R. (1950). Cancer and tobacco smoking. JAMA,
143(I), 336–338.
Mantel, N. (1980). Biased inferences in tests for average partial associations. American
Statistician, 34, 190-191.