Professional Documents
Culture Documents
Chance
It is a common practice to judge a result significant, if it is of such a magnitude that it
would have been produced by chance not more frequently than once in twenty trials. This
is an arbitrary, but convenient, level of significance for the practical investigator, but it
does not mean that he allows himself to be deceived once in every twenty experiments.
—Ronald Fisher
1929 (1)
(called the “null hypothesis”) that there is no dif- more effective. Error of this kind, resulting in a “false-
ference. This traditional way of assessing the role of positive” conclusion that the treatment is effective, is
chance, associated with the familiar “P value,” has referred to as a type I error or ! error, the probability
been popular since statistical testing was introduced of saying that there is a difference in treatment effects
at the beginning of the 20th century. The hypothesis when there is not. On the other hand, the new treat-
testing approach leads to dichotomous conclusions: ment might be more effective, but the study concludes
Either an effect is present or there is insufficient evi- that it is not. This “false-negative” conclusion is called
dence to conclude an effect is present. a type II error or " error—the probability of saying
The other approach, called estimation, uses sta- that there is no difference in treatment effects when
tistical methods to estimate the range of values that is there is. “No difference” is a simplified way of saying
likely to include the true value—of a rate, measure of that the true difference is unlikely to be larger than a
effect, or test performance. This approach has gained certain size, which is considered too small to be of prac-
popularity recently and is now favored by most medi- tical consequence. It is not possible to establish that
cal journals, at least for reporting main effects, for there is no difference at all between two treatments.
reasons described below. Figure 11.1 is similar to 2 × 2 tables comparing
the results of a diagnostic test to the true diagnosis
HYPOTHESIS TESTING (see Chapter 8). Here, the “test” is the conclusion of
a clinical trial based on a statistical test of results from
In the usual situation, the principal conclusions of the trial’s sample of patients. The “gold standard” for
a trial are expressed in dichotomous terms, such as validity is the true difference in the treatments being
a new treatment is either better or not better than compared—if it could be established, for example, by
usual care, corresponding to the results being either making observations on all patients with the illness
statistically significant (unlikely to be purely by or a large number of samples of these patients. Type I
chance) or not. There are four ways in which the sta- error is analogous to a false-positive test result, and
tistical conclusions might relate to reality (Fig. 11.1). type II error is analogous to a false-negative test result.
Two of the four possibilities lead to correct con- In the absence of bias, random variation is responsible
clusions: (i) The new treatment really is better, and for the uncertainty of a statistical conclusion.
that is the conclusion of the study; and (ii) the treat- Because random variation plays a part in all obser-
ments really have similar effects, and the study con- vations, it is an oversimplification to ask whether
cludes that a difference is unlikely. chance is responsible for the results. Rather, it is a
question of how likely random variation is to account
False-Positive and False-Negative for the findings under the particular conditions of the
Statistical Results study. The probability of error due to random varia-
tion is estimated by means of inferential statistics,
There are also two ways of being wrong. The new a quantitative science that, given certain assumptions
treatment and usual care may actually have similar about the mathematical properties of the data, is
effects, but it is concluded that the new treatment is the basis for calculations of the probability that the
results could have occurred by chance alone.
Statistics is a specialized field with its own jargon
TRUE
(e.g., null hypothesis, variance, regression, power,
DIFFERENCE
and modeling) that is unfamiliar to many clini-
Present Absent cians. However, leaving aside the genuine complex-
ity of statistical methods, inferential statistics should
Type I (α) be regarded by the non-expert as a useful means to
Significant Correct an end. Statistical testing is a means by which the
CONCLUSION error
OF effects of random variation are estimated.
STATISTICAL The next two sections discuss type I and type II
TEST Not Type II errors and place hypothesis testing, as it is used to
Correct
significant ( β) error estimate the probabilities of these errors, in context.
Figure 11.1 ■ The relationship between the results of Concluding That a Treatment Works
a statistical test and the true difference between two
treatment groups. (Absent is a simplification. It really means Most statistics encountered in the medical litera-
that the true difference is not greater than a specified amount.) ture concern the likelihood of a type I error and are
Chapter 11: Chance 177
expressed by the familiar P value. The P value is a own preferences for what is statistically significant.
quantitative estimate of the probability that differ- However, P values >1 in 5 are usually reported as
ences in treatment effects in the particular study at simply P > 0.20, because nearly everyone can agree
hand could have happened by chance alone, assum- that a probability of a type I error >1 in 5 is unaccept-
ing that there is in fact no difference between the ably high. Similarly, below very low P values (e.g.,
groups. Another way of expressing this is that P is an P < 0.001) chance is a very unlikely explanation for
answer to the question, “If there were no difference an observed difference, and little further information
between treatment effects and the trial was repeated is conveyed by describing this chance more precisely.
many times, what proportion of the trials would con- Another approach is to accept the primacy of P ≤
clude that the difference between the two treatments 0.05 and describe results that come close to that
was at least as large as that found in the study?” standard with terms such as “almost statistically sig-
In this presentation, P values are called Pα, to dis- nificant,” “did not achieve statistical significance,”
tinguish them from estimates of the other kind of “marginally significant,” or “a trend.” These value-
error resulting from random variation, type II errors, laden terms suggest that the finding should have been
which are referred to as Pβ. When a simple P is found statistically significant but for some annoying reason
in the scientific literature, it ordinarily refers to Pα. was not. It is better to simply state the result and exact
The kind of error estimated by Pα applies whenever P value (or point estimate and confidence interval,
one concludes that one treatment is more effective see below) and let the reader decide for him or herself
than another. If it is concluded that the Pα exceeds how much chance could have accounted for the result.
some limit (see below) so there is no statistical differ-
ence between treatments, then the particular value of Statistical Significance and
Pα is not as relevant; in that situation, Pβ (probability Clinical Importance
of type II error) applies.
A statistically significant difference, no matter how
small the P, does not mean that the difference is clini-
Dichotomous and Exact P Values cally important. A P value of <0.0001, if it emerges
It has become customary to attach special significance from a well-designed study, conveys a high degree
to P values below 0.05 because it is generally agreed of confidence that a difference really exists but says
that a chance of <1 in 20 is a small enough risk of nothing about the magnitude of that difference or its
being wrong. A chance of 1 in 20 is so small, in fact, clinical importance. In fact, trivial differences may be
that it is reasonable to conclude that such an occur- highly statistically significant if a large enough num-
rence is unlikely to have arisen by chance alone. It ber of patients are studied.
could have arisen by chance, and 1 in 20 times it will,
but it is unlikely.
Differences associated with Pα < 0.05 are called Example
statistically significant. However, setting a cutoff
point at 0.05 is entirely arbitrary. Reasonable people The drug donepezil, a cholinesterase inhibitor,
might accept higher values or insist on lower ones, was developed for the treatment of Alzheimer
depending on the consequences of a false-positive con- disease. In a randomized controlled trial to es-
clusion in a given situation. For example, one might tablish whether the drug produced worthwhile
be willing to accept a higher chance of a false-positive improvements, 565 patients with Alzheimer
statistical test if the disease is severe, there is currently disease were randomly allocated to donepezil
no effective treatment, and the new treatment is safe. or placebo (3). The statistical significance of
On the other hand, one might be reluctant to accept a some trial end points was impressive: Both the
false-positive test if usual care is effective and the new mini-mental state examination and the Bristol
treatment is dangerous or much more expensive. This Activities of Daily Living Scale were statistically
reasoning is similar to that applied to the importance different at P < 0.0001. However, the actual dif-
of false-positive and false-negative diagnostic tests ferences were small, 0.8 on a 30-point scale for
(Chapter 8). the mini-mental state examination and 1 on a
To accommodate various opinions about what is 60-point scale for the Bristol Activities of Daily
and is not unlikely enough, some researchers report Living Scale. Moreover, other outcomes, which
the exact probabilities of P (e.g., 0.03, 0.07, 0.11), more closely represented the burden of illness
rather than lumping them into just two categories and care of these patients, were similar in the
(≤0.05 or >0.05). Users are then free to apply their
178 Clinical Epidemiology: The Essentials
Table 11.1
donepezil and placebo groups. These included Some Statistical Tests Commonly Used in
entering institutional care and progression of Clinical Research
disability (both primary end points) as well
as behavioral and psychological symptoms, Test When Used
caregiver psychopathology, formal care costs,
To Test the Statistical Significance of a Difference
unpaid caregiver time, and adverse events or
death. The authors concluded that the benefits Chi square (χ2) Between two or more
of donepezil were “below minimally relevant proportions (when there are a
large number of observations)
thresholds.”
Fisher’s exact Between two proportions (when
there are a small number of
observations)
On the other hand, very unimpressive P values
can result from studies with strong treatment effects Mann-Whitney U Between two medians
if there are few patients in the study. Student t Between two means
F test Between two or more means
Statistical Tests
To Describe the Extent of Association
Statistical tests are used to estimate the probability Regression Between an independent
of a type I error. The test is applied to the data to coefficient (predictor) variable and a
obtain a numerical summary for those data called a dependent (outcome) variable
test statistic. That number is then compared to a sam-
Pearson’s r Between two variables
pling distribution to come up with a probability of
a type I error (Fig. 11.2). The distribution is under To Model the Effects of Multiple Variables
the null hypothesis, the proposition that there is Logistic regression With a dichotomous outcome
no true difference in outcome between treatment Cox proportional With a time-to-event outcome
groups. This device is for mathematical reasons, not hazards
because “no difference” is the working scientific hypoth-
esis of the investigators conducting the study. One ends
up rejecting the null hypothesis (concluding there is a The chi-square (χ2) test for nominal data (counts)
difference) or failing to reject it (concluding that there is more easily understood than most and can be used
is insufficient evidence in support of a difference). Note to illustrate how statistical testing works. The extent
that not finding statistical significance is not the same as to which the observed values depart from what would
there being no difference. Statistical testing is not able have been expected if there were no treatment effect is
to establish that there is no difference at all. used to calculate a P value.
Some commonly used statistical tests are listed in
Table 11.1. The validity of many tests depends on
certain assumptions about the data; a typical assump- Example
tion is that the data have a normal distribution. If the
data do not satisfy these assumptions, the resulting Cardiac arrest outside the hospital has a poor
P value may be misleading. Other statistical tests, outcome. Animal studies suggested that hypo-
called non-parametric tests, do not make assump- thermia might improve neurologic outcomes.
tions about the underlying distribution of the data. To test this hypothesis in humans, 77 patients
A discussion of how these statistical tests are derived who remained unconscious after resuscita-
and calculated and of the assumptions on which they tion from out-of-hospital cardiac arrest were
rest can be found in any biostatistics textbook.
Estimate of probability
Test
Data that observed value could
statistic
be by chance alone
Statistical Compare to
test standard
distribution
Figure 11.2 ■ Statistical testing.
Chapter 11: Chance 179
trials have misrepresented the truth because these Visual presentation of negative results can be con-
particular studies had the bad luck to turn out in a vincing. Alternatively, one can examine confidence
relatively unlikely way? intervals (see Point Estimates and Confidence Inter-
vals, below) and learn a lot about whether the study
was large enough to rule out clinically important dif-
ferences if they existed.
Example Of course, reasons for false-negative results other
than chance also need to be considered: biologic rea-
One of the examples in Chapter 9 was a ran- sons such as too short follow-up or too small dose
domized controlled trial of the effects on car- of niacin, as well as study limitations such as non-
diovascular outcomes of adding niacin versus compliance and missed outcome events.
placebo in patients with lipid abnormalities Type II errors have received less attention than
who were already taking statin drugs (5). It was type I errors for several reasons. They are more dif-
a “negative” trial: Primary outcomes occurred ficult to calculate. Also, most professionals simply
in 16.4% of patients taking niacin and 16.2% in prefer things that work and consider negative results
patients taking placebo, and the authors con- unwelcome. Authors are less likely to submit nega-
cluded that “there was no incremental clinical tive studies to journals and when negative studies
benefit from the addition of niacin to statin are reported at all, the authors may prefer to empha-
therapy.” The statistical question associated size subgroups of patients in which treatment dif-
with this assertion is: How likely was it that the ferences were found. Authors may also emphasize
study found no benefit when there really is reasons other than chance to explain why true differ-
one? After all, there were only a few hundred ences might have been missed. Whatever the reason
cardiovascular outcome events in the study so for not considering the probability of a type II error,
the play of chance might have obscured treat- it is the main question that should be asked when the
ment effects. Figure 11.3 shows time-to-event results of a study are interpreted as “no difference.”
curves for the primary outcome in the two
treatment groups. Patients in the niacin and HOW MANY STUDY PATIENTS
control groups had remarkably similar curves ARE ENOUGH?
throughout follow-up, making a protective ef-
fect of niacin implausible. Suppose you are reading about a clinical trial that
compares a promising new treatment to usual care
50
Cumulative percent of patients
with primary outcome
40
30
10
Placebo plus statin
0
0 1 2 3 4
Years
Number at risk
Niacin plus statin 1,718 1,606 1,366 903 428
Placebo plus statin 1,696 1,581 1,381 910 436
and finds no difference. You are aware that random to detect the smallest degree of improvement that
variation can be the reason for whatever differences would be clinically meaningful?” On the other hand,
are or are not observed, and you wonder if the num- if one is interested in detecting only very large dif-
ber of patients in this study is large enough to make ferences between treated and control groups (i.e.,
chance an unlikely explanation for what was found. strong treatment effects) then fewer patients need to
Alternatively, you may be planning to do such a study be studied.
and have the same question. Either way, you need to
understand how many patients would be needed Type I Error
to make a strong comparison of the effects of the two
Sample size is also related to the risk of a type I error
treatments?
(concluding that treatment is effective when it is not).
The acceptable probability for a risk of this kind is a
Statistical Power value judgment. If one is prepared to accept the con-
The probability that a study will find a statistically sequences of a large chance of falsely concluding that
significant difference when a difference really exists is the treatment is effective, one can reach conclusions
called the statistical power of the study. Power and with fewer patients. On the other hand, if one wants
Pβ are complementary ways of expressing the same to take only a small risk of being wrong in this way,
concept. a larger number of patients will be required. As dis-
cussed earlier, it is customary to set Pα at 0.05 (1 in
Statistical power = 1 – Pβ 20) or sometimes 0.01 (1 in 100).
Power is analogous to the sensitivity of a diagnos-
tic test. One speaks of a study being powerful when it Type II Error
has a high probability of detecting differences when The chosen risk of a type II error is another determi-
treatments really do have different effects. nant of sample size. An acceptable probability of this
error is also a judgment that can be freely made and
Estimating Sample Size changed to suit individual tastes. Probability of Pβ is
Requirements often set at 0.20, a 20% chance of missing true differ-
ences in a particular study. Conventional type II errors
From the point of view of hypothesis testing of
are much larger than type I errors, reflecting a higher
nominal data (counts), an adequate sample size
depends on four characteristics of the study: the value placed on being sure an effect is really present
magnitude of the difference in outcome between when it is said to be.
treatment groups, Pα and Pβ (the probability of the
false-positive and false-negative conclusions you Characteristics of the Data
are willing to accept), and the underlying outcome The statistical power of a study is also determined by
rate. the nature of the data. When the outcome is expressed
These determinants of adequate sample size by counts or proportions of events or time-to-event,
should be taken into account when investigators plan its statistical power depends on the rate of events: The
a study, to ensure that the study will have enough sta- larger the number of events, the greater the statistical
tistical power to produce meaningful results. To the power for a given number of people at risk. As Peto
extent that investigators have not done this well, or et al. (6) put it:
some of their assumptions were found to be inaccu-
rate, readers need to consider the same issues when In clinical trials of time to death (or of the time to
interpreting study results. some other particular “event”—relapse, metastasis,
first thrombosis, stroke. recurrence, or time to death
from a particular cause—the ability of the trial to
Effect Size distinguish between the merits of two treatments
Sample size depends on the magnitude of the dif- depends on how many patients die (or suffer a rel-
ference to be detected. One is free to look for dif- evant event), rather than on the number of patients
ferences of any magnitude and of course one hopes entered. A study of 100 patients, 50 of whom die, is
to be able to detect even very small differences, but about as sensitive as a study with 1,000 patients, 50
of whom die.
more patients are needed to detect small differ-
ences, everything else being equal. Therefore, it is If the data are continuous, such as blood pres-
best to ask, “What is a sufficient number of patients sure or serum cholesterol, power is affected by the
182 Clinical Epidemiology: The Essentials
Table 11.2
Determinants of Sample Size
Determined by
Date Type
Investigator Means Counts
1 1
Sample size varies according to: OR Variability
Effect size, Pα, Pβ Outcome rate
500
For most of the therapeutic questions encountered
today, a surprisingly large sample size is required. The
value of dramatic, powerful treatments, such as anti-
biotics for pneumonia or thyroid replacement for
hypothyroidism, was established by clinical expe-
rience or studying a small number of patients, but 0 20 40 60 80 100
such treatments come along rarely and many of them Proportional reduction in event rate (%)
are already well established. We are left with diseases, Figure 11.4 ■ The number of people required in each
many of which are chronic and have multiple, inter- of two treatment groups (of equal size), for various
acting causes, for which the effects of new treatments rates of outcome events in the untreated group, to
are generally small. This makes it especially important have an 80% chance of detecting a difference (P ! 0.05)
to plan clinical studies that are large enough to distin- in reduction in outcome event rates in treated relative
guish real from chance effects. to untreated patients. (Calculated from formula in Weiss
Figure 11.4 shows the relationship between sam- NS. Clinical epidemiology. The study of the outcome of illness.
ple size and treatment difference for several baseline New York: Oxford University Press; 1986.)
rates of outcome events. Studies involving fewer than
100 patients have a poor chance of detecting statisti-
cally significant differences for even large treatment Statistical precision is expressed as a confidence
effects. Looked at another way, it is difficult to detect interval, usually the 95% confidence interval,
effect sizes of <25%. In practice, statistical power can around the point estimate. Confidence intervals
be estimated by means of readily available formulas, are interpreted as follows: If the study is unbiased,
tables, nomograms, computer programs, or Web sites. there is a 95% chance that the interval includes the
true effect size. The more narrow the confidence
interval, the more certain one can be about the size of
POINT ESTIMATES AND the true effect. The true value is most likely to be close
CONFIDENCE INTERVALS to the point estimate, less likely to be near the outer
limits of the interval, and could (5 times out of 100)
The effect size that is observed in a particular study fall outside these limits altogether. Statistical preci-
(such as treatment effect in a clinical trial or relative sion increases with the statistical power of the study.
risk in a cohort study) is called the point estimate
of the effect. It is the best estimate from the study of
the true effect size and is the summary statistic usually
given the most emphasis in reports of research. Example
However, the true effect size is unlikely to be
exactly that observed in the study. Because of ran- The Women’s Health Initiative included a
dom variation, any one study is likely to find a result randomized controlled trial of the effects of
higher or lower than the true value. Therefore, a sum- estrogen plus progestin on chronic disease out-
mary measure is needed for the statistical precision comes in healthy postmenopausal women (9).
of the point estimate, the range of values likely to Figure 11.5 shows relative risk and confidence
encompass the true effect size.
184 Clinical Epidemiology: The Essentials
1 1 1 1
Risk =
100 1,000 10,000 100,000
1.0
Probability of detection
0.8
0.6
0.4
0.2
is no longer a need to estimate effect size, outcome needed to detect rare events such as uncommon
event rates, and variability among patients because side effects and complications. For that, a different
they are all known. Rather, attention should be approach, involving many more patients, is needed.
directed to point estimates and confidence intervals. An example is postmarketing surveillance of a drug,
With them, one can see the range of values that are in which thousands of users are monitored for side
consistent with the results and whether the effect sizes effects.
of interest are within this range or are ruled out by Figure 11.6 shows the probability of detecting an
the data. In the niacin study, summarized earlier as event as a function of the number of people under
an example of a negative trial, the hazard ratio was observation. A rule of thumb is: To have a good
1.02 and the 95% confidence interval was 0.87 to chance of detecting a 1/x event, one must observe 3x
1.21, meaning that the results were consistent with people (15). For example, to detect at least one event
a small degree of benefit or harm. Whether this mat- if the underlying rate is 1/1,000, one would need to
ters depends on the clinical importance attached to observe 3,000 people.
a difference in rates as large as represented by this
confidence interval.
MULTIPLE COMPARISONS
DETECTING RARE EVENTS The statistical conclusions of research have an aura of
authority that defies challenge, particularly by non-
It is sometimes important to know how likely a experts. However, as many skeptics have suspected, it
study is to detect a relatively uncommon event (e.g., is possible to “lie with statistics” even if the research
1/1,000), particularly if that event is severe, such as is well designed, the mathematics flawless, and the
bone marrow failure or life-threatening arrhythmia. investigators’ intentions beyond reproach.
A great many people must be observed in order to Statistical conclusions can be misleading because
have a good chance of detecting even one such event, the strength of statistical tests depends on the num-
much less to establish a relatively stable estimate of ber of research questions considered in the study and
its frequency. For most clinical research, sample size when those questions were asked. If many compari-
is planned to be sufficient to detect main effects, the sons are made among the variables in a large set of
answer sought for the primary research question. data, the P value associated with each individual com-
Sample size is likely to be well short of the number parison is an underestimate of how often the result of
186 Clinical Epidemiology: The Essentials
Age
<65 yr 1,714 19 (2.0) 7 (0.7)
65 to <75 yr 1,987 28 (2.7) 24 (2.0)
≥75 yr 1,897 66 (6.1) 20 (2.0)
Age
Female 2,321 64 (4.9) 25 (1.9)
Male 3,277 49 (2.7) 26 (1.4)
Estimated GFR
<50 mL/min 1,198 36 (5.8) 16 (2.5)
50 to <80 mL/min 2,374 59 (4.5) 22 (1.7)
≥80 mL/min 2,021 18 (1.6) 13 (1.1)
CHADS2 score
0–1 2,026 18 (1.6) 10 (0.9)
2 1,999 40 (3.7) 25 (2.1)
≥3 1,570 55 (6.3) 16 (1.9)
Heart failure
No 3,428 66 (3.6) 28 (1.5)
Yes 2,171 45 (3.8) 23 (1.8)
Table 11.4 taken into account, there would only be, at most,
Guidelines for Deciding Whether about 15 patients in each subgroup; if patients were
Apparent Differences in Effects within unevenly distributed among subgroups, there would
Subgroups Are Reala be even fewer in some.
What is needed then, in addition to tables show-
From the study itself: ing multiple subgroups, is a way of examining the
• Is the magnitude of the observed difference clinically effects of several variables together. This is accom-
important? plished by multivariable modeling—developing a
• How likely is the effect to have arisen by chance, mathematical expression of the effects of many vari-
taking into account: ables taken together. It is “multivariable” because it
the number of subgroups examined? examines the effects of multiple variables simultane-
the magnitude of the P value? ously. It is “modeling” because it is a mathematical
• Was a hypothesis that the effect would be observed construct, calculated from the data based on assump-
made before its discovery (or was justification for the tions about characteristics of the data (e.g., that the
effect argued for after it was found)?
variables are all normally distributed or all have the
• Was it one of a small number of hypotheses?
same variance).
From other information: Mathematical models are used in two general ways
• Was the difference suggested by comparisons within in clinical research. One way is to study the indepen-
rather than between studies? dent effect of one variable on outcome while taking
• Has the effect been observed in other studies? into account the effects of other variables that might
• Is there direct evidence that supports the existence confound or modify this relationship (discussed
of the effect?
under multivariable adjustment in Chapter 5). The
a
Adapted from Oxman AD, Guyatt GH. A consumer’s guide to second way is to predict a clinical event by calculating
subgroup analysis. Ann Intern Med 1992;116:78–84.
the combined effect of several variables acting together
(introduced in concept under Clinical Prediction
Rules in Chapter 7).
tend to be related to each other biologically (and
The basic structure of a multivariable model is:
as a consequence statistically), as is the case in the
above example where stroke and systemic embolism Outcome variable = constant + (β1 × variable1)
are different manifestations of the same clinical + (β2 × variable2) + . . .,
phenomenon.
where β1, β2, . . . are coefficients determined by the
data, and variable1, variable2, . . . are the variables that
MULTIVARIABLE METHODS might be related to outcome. The best estimates of
the coefficients are determined mathematically and
Most clinical phenomena are the result of many vari-
depend on the powerful calculating ability of modern
ables acting together in complex ways. For example,
computers.
coronary heart disease is the joint result of lipid
Modeling is done in many different ways, but
abnormalities, hypertension, cigarette smoking, fam-
some elements of the process are basic.
ily history, diabetes, diet, exercise, inflammation,
coagulation abnormalities, and perhaps personality. It 1. Identify all the variables that might be related to
is appropriate to try to understand these relationships the outcome of interest either as confounders or
by first examining relatively simple arrangements of effect modifiers. As a practical matter, it may not
the data, such as stratified analyses that show whether be possible to actually measure all of them and the
the effect of one variable is changed by the presence missing variables should be mentioned explicitly
or absence of one or more of the other variables. It is as a limitation.
relatively easy to understand the data when they are 2. If there are relatively few outcome events, the
displayed in this way. number of variables to be considered in the model
However, as mentioned in Chapter 7, it is usu- might need to be reduced to a manageable size,
ally not possible to account for more than a few vari- usually no more than several. Often this is done by
ables using this method because there are not enough selecting variables that, when taken one at a time,
patients with each combination of characteristics to are most strongly related to outcome. If a statis-
allow stable estimates of rates. For example, if 120 tical criterion is used at this stage, it is usual to
patients were studied, 60 in each treatment group, err on the side of including variables, for example,
and just one additional dichotomous variable was by choosing all variables showing an association
190 Clinical Epidemiology: The Essentials
with the outcome of interest at a cutoff level of Some commonly used kinds of models are logis-
P < 0.10. Evidence for the biologic importance tic regression (for dichotomous outcome variables
of the variable is also considered in making the such as those that occur in case-control studies) and
selection. Cox proportional hazards models (for time-to-event
3. Models, like other statistical tests, are based on studies).
assumptions about the structure of the data. Inves- Multivariable modeling is an essential strat-
tigators need to check whether these assumptions egy for dealing with the joint effects of multiple
are met in their particular data. variables. There is no other way to adjust for or
4. As for the actual models, there are many kinds to include many variables at the same time. How-
and many strategies that can be followed within ever, this advantage comes at a price. Models tend
models. All variables—exposure, outcome, and to be black boxes, and it is difficult to “get inside”
covariates—are entered in the model, with the them and understand how they work. Their validity
order determined by the research question. For is based on assumptions about the data that may
example, if some are to be controlled for in a not be met. They are clumsy at recognizing effect
causal analysis, they are entered in the model modification. An exposure variable may be strongly
first, followed by the variable of primary inter- related to outcome yet not appear in the model
est. The model will then identify the independent because it occurs rarely—and there is little direct
effect of the variable of primary interest. On the information on the statistical power of the model
other hand, if the investigator wants to make a for that variable. Finally, model results are easily
prediction based on several variables, the relative affected by quirks in the data, the results of ran-
strength of their association to the outcome vari- dom variation in the characteristics of patients from
able is determined by the model. sample to sample. It has been shown, for example,
that a model frequently identified a different set of
predictor variables and produced a different order-
ing of variables on different random samples of the
same dataset (20).
Example For these reasons, the models themselves cannot
be taken as a standard of validity and must be vali-
Gastric cancer is the second leading cause of dated independently. Usually, this is done by observ-
cancer death in the world. Investigators in ing whether or not the results of a model predicts
Europe analyzed data from a cohort recruited what is found in another, independent sample of
from 10 European countries to see whether patients. The results of the first model are consid-
alcohol was an independent risk factor for ered a hypothesis that is to be tested with new data.
stomach cancer (19). They identified nine vari- If random variation is mainly responsible for the
ables that were known risk factors or potential results of the first model, it is unlikely that the same
confounders of the association between the random effects will occur in the validating dataset,
main exposure (alcohol consumption) and dis- too. Other evidence for the validity of a model is its
ease (stomach cancer): age, study center, sex, biologic plausibility and its consistency with simpler,
physical activity, education, cigarette smoking, more transparent analyses of the data, such as strati-
diet, body mass index, and physical activity, fied analyses.
and in a subset of patients, serologic evidence
of Helicobacter pylori infection. As a limita- BAYESIAN REASONING
tion, they mentioned that they would have
included salt but did not have access to data An altogether different approach to the information
on salt intake. Modeling was with the Cox pro- contributed by a study is based on Bayesian inference.
portional hazards model, so they checked that We introduced this approach in Chapter 8 where we
the underlying assumption, that risk does not applied it to the specific case of diagnostic testing.
vary with time, was met. After adjustment for Bayesian inference begins with prior belief about
the other variables, heavy but not light alcohol the answer to a research question, analogous to pre-
consumption was associated with stomach can- test probability of a diagnostic test. Prior belief is
cer (hazard ratio 1.65, 95% confidence interval based on everything known about the answer up to
1.06–2.58); beer was associated, but not wine the point when new information is contributed by a
or liquor. study. Then, Bayesian inference asks how much the
results of the new study change that belief.
Chapter 11: Chance 191
Some aspects of Bayesian inference are compel- small number of hypotheses are identified before-
ling. Individual studies do not take place in an infor- hand and multiple comparisons are not as worrisome.
mation vacuum; rather, they are in the context of Rather, prior belief depends on the plausibility of the
all other information available at the time. Starting assertion rather than whether the assertion was estab-
each study from the null hypothesis—that there is lished before or after the study was begun.
no effect—is unrealistic because something is already Although Bayesian inference is appealing, so far it
known about the answer to the question before the has been difficult to apply because of poorly devel-
study is even begun. Moreover, results of individual oped ways of assigning numbers to prior belief and
studies change belief in relation to both their scien- to the information contributed by a study. Two
tific strengths and the direction and magnitude of exceptions are in cumulative summaries of research
their results. For example, if all preceding studies evidence (Chapter 13) and in diagnostic testing, in
were negative and the next one, which is of compa- which “belief ” is prior probability and the new infor-
rable strength, is found to be positive, then an effect mation is expressed as a likelihood ratio. However,
is still unlikely. On the other hand, a weak prior belief Bayesian inference is the conceptual basis for qualita-
might be reversed by a single strong study. Finally, tive thinking about cause (see Chapter 12).
with this approach it is not so important whether a
Review Questions
Read the following and select the best was 238 mg/dL in the group receiving the
response. new drug and 240 mg/dL in the group
receiving the old drug (P < 0.001). Which of
11.1. A randomized controlled trial of thrombo- the following best describes the meaning of
lytic therapy versus angioplasty for acute the P value in this study?
myocardial infarction finds no difference
A. Bias is unlikely to account for the
in the main outcome, survival to discharge
observed difference.
from hospital. The investigators explored
B. The difference is clinically important.
whether this was also true for subgroups of
C. A difference as big or bigger than what
patients defined by age, number of vessels
was observed could have arisen by chance
affected, ejection fraction, comorbidity, and
one time in 1,000.
other patient characteristics. Which of the
D. The results are generalizable to other
following is not true about this subgroup
patients with hypertension.
analysis?
E. The statistical power of this study was
A. Examining subgroups increases the inadequate.
chance of a false-positive (misleading
statistically significant) result in one of 11.3. In a well-designed clinical trial of treatment
the comparisons. for ovarian cancer, remission rate at 1 year is
B. Examining subgroups increases the 30% in patients offered a new drug and 20%
chance of a false-negative finding in one in those offered a placebo. The P value is 0.4.
of these subgroups, relative to the main Which of the following best describes the
result. interpretation of this result?
C. Subgroup analyses are bad scientific
A. Both treatments are effective.
practice and should not be done.
B. Neither treatment is effective.
D. Reporting results in subgroups helps
C. The statistical power of this study is
clinicians tailor information in the study
60%.
to individual patients.
D. The best estimate of treatment effect size
is 0.4.
11.2. A new drug for hyperlipidemia was com-
E. There is insufficient information to
pared with placebo in a randomized con-
decide whether one treatment is better
trolled trial of 10,000 patients. After 2 years,
than the other.
serum cholesterol (the primary outcome)
192 Clinical Epidemiology: The Essentials
REFERENCES
1. Fisher R in Proceedings of the Society for Psychical Research, 11. Mai PL, Wideroff L, Greene MH, et al. Prevalence of family
1929, quoted in Salsburg D. The Lady Tasting Tea. New York: history of breast, colorectal, prostate, and lung cancer in a popu-
Henry Holt and Co; 2001. lation-based study. Public Health Genomics 2010;13:495–503.
2. Johnson AF. Beneath the technological fix: outliers and prob- 12. Venge P, Johnson N, Lindahl B, et al. Normal plasma levels
ability statements. J Chronic Dis 1985;38:957–961. of cardiac troponin I measured by the high-sensitivity cardiac
3. Courtney C, Farrell D, Gray R, et al for the AD2000 Collab- troponin I access prototype assay and the impact on the diag-
orative Group. Long-term donepezil treatment in 565 patients nosis of myocardial ischemia. J Am Coll Cardiol 2009;54:
with Alzheimer’s disease (AD2000): randomized double-blind 1165–1172.
trial. Lancet 2004:363:2105–2115. 13. McCormack K, Scott N, Go PMNYH, et al. Laparoscopic
4. Bernard SA, Gray TW, Buist MD, et al. Treatment of coma- techniques versus open techniques for inguinal hernia repair.
tose survivors of out-of-hospital cardiac arrest with induced Cochrane Database Syst Rev 2003;(1):CD001785.
hypothermia. N Engl J Med 2002;346:557–563. 14. Goodman SN, Berlin JA. The use of predicted confidence
5. The AIM-HIGH Investigators. Niacin in patients with low intervals when planning experiments and the misuse of
HDL cholesterol levels receiving intensive statin therapy. N power when interpreting results. Ann Intern Med 1994;121:
Engl J Med 2011;365:2255–2267. 200–206.
6. Peto R, Pike MC, Armitage P, et al. Design and analysis of 15. Sackett DL, Haynes RB, Gent M, et al. Compliance. In:
randomized clinical trials requiring prolonged observation of Inman WHW, ed. Monitoring for Drug Safety. Lancaster,
each patient. I. Introduction and design. Br J Cancer 1976;34: UK: MTP Press; 1980.
585–612. 16. Armitage P. Importance of prognostic factors in the analysis of
7. Lind J. A treatise on scurvy. Edinburgh; Sands, Murray and data from clinical trials. Control Clin Trials 1981;1:347–353.
Cochran, 1753 quoted by Thomas DP. J Royal Society Med 17. Hunter DJ, Kraft P. Drinking from the fire hose—statisti-
1997;80:50–54. cal issues in genomewide association studies. N Engl J Med
8. Weinstein SJ, Yu K, Horst RL, et al. Serum 25-hydroxyvita- 2007;357:436–439.
min D and risks of colon and rectal cancer in Finnish men. 18. Connolly SJ, Eikelboom J, Joyner C, et al. Apixaban in
Am J Epidemiol 2011;173:499–508. patients with atrial fibrillation. N Engl J Med 2011;364:
9. Rossouw JE, Anderson GL, Prentice RL, et al. for the Wom- 806–817.
en’s Health Initiative Investigators. Risks and benefits of estro- 19. Duell EJ, Travier N, Lujan-Barroso L, et al. Alcohol consump-
gen plus progestin in healthy postmenopausal women: prin- tion and gastric cancer risk in European Prospective Investi-
ciple results from the Women’s Health Initiative randomized gation into Cancer and Nutrition (EPIC) cohort. Am J Clin
controlled trial. JAMA 2002;288:321–333. Nutr 2011;94:1266–1275.
10. Braitman LE. Confidence intervals assess both clinical signifi- 20. Diamond GA. Future imperfect: the limitations of clinical
cance and statistical significance. Ann Intern Med 1991;114: prediction models and the limits of clinical prediction. J Am
515–517. Coll Cardiol 1989;14:12A–22A.