You are on page 1of 19

Chapter 11

Chance
It is a common practice to judge a result significant, if it is of such a magnitude that it
would have been produced by chance not more frequently than once in twenty trials. This
is an arbitrary, but convenient, level of significance for the practical investigator, but it
does not mean that he allows himself to be deceived once in every twenty experiments.
—Ronald Fisher
1929 (1)

KEY WORDS Many of us tend to underestimate the importance


of bias relative to chance when interpreting data, per-
Hypothesis testing Non-parametric haps because statistics are quantitative and appear so
Estimation statistics definitive. We might say, in essence, “If the statisti-
Statistically Two-tailed cal conclusions are strong, a little bit of bias can’t
significant One-tailed do much harm.” However, when data are biased, no
Type 1 (!) error Statistical power amount of statistical elegance can save the day. As one
Type II (") error Sample size scholar put it, perhaps taking an extreme position, “A
Inferential statistics Point estimate well designed, carefully executed study usually gives
Statistical testing Statistical precision results that are obvious without a formal analysis and
P value Confidence interval if there are substantial flaws in design or execution a
Statistical Multiple comparisons formal analysis will not help” (2).
significance Multivariable In this chapter, we discuss chance mainly in the
Statistical tests modeling context of controlled clinical trials because it is the
Null hypothesis Bayesian reasoning simplest way of presenting the concepts. However,
statistics are an element of all clinical research,
whenever one makes inferences about populations
Learning from clinical experience, whether during based on information obtained from samples. There
formal research or in the course of patient care, is is always a possibility that the particular sample of
impeded by two processes: bias and chance. As dis- patients in a study, even though selected in an unbi-
cussed in Chapter 1, bias is systematic error, the result ased way, might not be similar to the population
of any process that causes observations to differ sys- of patients as a whole. Statistics help estimate how
tematically from the true values. Much of this book well observations on samples approximate the true
has been about where bias might lurk, how to avoid it situation.
when possible, and how to control for it and estimate
its effects when bias is unavoidable. TWO APPROACHES TO CHANCE
On the other hand, random error, resulting from
the play of chance, is inherent in all observations. It Two general approaches are used to assess the role of
can be minimized but never avoided altogether. This chance in clinical observations.
source of error is called “random” because, on aver- One approach, called hypothesis testing, asks
age, it is as likely to result in observed values being on whether an effect (difference) is present or is not
one side of the true value as on the other. by using statistical tests to examine the hypothesis
175
176 Clinical Epidemiology: The Essentials

(called the “null hypothesis”) that there is no dif- more effective. Error of this kind, resulting in a “false-
ference. This traditional way of assessing the role of positive” conclusion that the treatment is effective, is
chance, associated with the familiar “P value,” has referred to as a type I error or ! error, the probability
been popular since statistical testing was introduced of saying that there is a difference in treatment effects
at the beginning of the 20th century. The hypothesis when there is not. On the other hand, the new treat-
testing approach leads to dichotomous conclusions: ment might be more effective, but the study concludes
Either an effect is present or there is insufficient evi- that it is not. This “false-negative” conclusion is called
dence to conclude an effect is present. a type II error or " error—the probability of saying
The other approach, called estimation, uses sta- that there is no difference in treatment effects when
tistical methods to estimate the range of values that is there is. “No difference” is a simplified way of saying
likely to include the true value—of a rate, measure of that the true difference is unlikely to be larger than a
effect, or test performance. This approach has gained certain size, which is considered too small to be of prac-
popularity recently and is now favored by most medi- tical consequence. It is not possible to establish that
cal journals, at least for reporting main effects, for there is no difference at all between two treatments.
reasons described below. Figure 11.1 is similar to 2 × 2 tables comparing
the results of a diagnostic test to the true diagnosis
HYPOTHESIS TESTING (see Chapter 8). Here, the “test” is the conclusion of
a clinical trial based on a statistical test of results from
In the usual situation, the principal conclusions of the trial’s sample of patients. The “gold standard” for
a trial are expressed in dichotomous terms, such as validity is the true difference in the treatments being
a new treatment is either better or not better than compared—if it could be established, for example, by
usual care, corresponding to the results being either making observations on all patients with the illness
statistically significant (unlikely to be purely by or a large number of samples of these patients. Type I
chance) or not. There are four ways in which the sta- error is analogous to a false-positive test result, and
tistical conclusions might relate to reality (Fig. 11.1). type II error is analogous to a false-negative test result.
Two of the four possibilities lead to correct con- In the absence of bias, random variation is responsible
clusions: (i) The new treatment really is better, and for the uncertainty of a statistical conclusion.
that is the conclusion of the study; and (ii) the treat- Because random variation plays a part in all obser-
ments really have similar effects, and the study con- vations, it is an oversimplification to ask whether
cludes that a difference is unlikely. chance is responsible for the results. Rather, it is a
question of how likely random variation is to account
False-Positive and False-Negative for the findings under the particular conditions of the
Statistical Results study. The probability of error due to random varia-
tion is estimated by means of inferential statistics,
There are also two ways of being wrong. The new a quantitative science that, given certain assumptions
treatment and usual care may actually have similar about the mathematical properties of the data, is
effects, but it is concluded that the new treatment is the basis for calculations of the probability that the
results could have occurred by chance alone.
Statistics is a specialized field with its own jargon
TRUE
(e.g., null hypothesis, variance, regression, power,
DIFFERENCE
and modeling) that is unfamiliar to many clini-
Present Absent cians. However, leaving aside the genuine complex-
ity of statistical methods, inferential statistics should
Type I (α) be regarded by the non-expert as a useful means to
Significant Correct an end. Statistical testing is a means by which the
CONCLUSION error
OF effects of random variation are estimated.
STATISTICAL The next two sections discuss type I and type II
TEST Not Type II errors and place hypothesis testing, as it is used to
Correct
significant ( β) error estimate the probabilities of these errors, in context.

Figure 11.1 ■ The relationship between the results of Concluding That a Treatment Works
a statistical test and the true difference between two
treatment groups. (Absent is a simplification. It really means Most statistics encountered in the medical litera-
that the true difference is not greater than a specified amount.) ture concern the likelihood of a type I error and are
Chapter 11: Chance 177

expressed by the familiar P value. The P value is a own preferences for what is statistically significant.
quantitative estimate of the probability that differ- However, P values >1 in 5 are usually reported as
ences in treatment effects in the particular study at simply P > 0.20, because nearly everyone can agree
hand could have happened by chance alone, assum- that a probability of a type I error >1 in 5 is unaccept-
ing that there is in fact no difference between the ably high. Similarly, below very low P values (e.g.,
groups. Another way of expressing this is that P is an P < 0.001) chance is a very unlikely explanation for
answer to the question, “If there were no difference an observed difference, and little further information
between treatment effects and the trial was repeated is conveyed by describing this chance more precisely.
many times, what proportion of the trials would con- Another approach is to accept the primacy of P ≤
clude that the difference between the two treatments 0.05 and describe results that come close to that
was at least as large as that found in the study?” standard with terms such as “almost statistically sig-
In this presentation, P values are called Pα, to dis- nificant,” “did not achieve statistical significance,”
tinguish them from estimates of the other kind of “marginally significant,” or “a trend.” These value-
error resulting from random variation, type II errors, laden terms suggest that the finding should have been
which are referred to as Pβ. When a simple P is found statistically significant but for some annoying reason
in the scientific literature, it ordinarily refers to Pα. was not. It is better to simply state the result and exact
The kind of error estimated by Pα applies whenever P value (or point estimate and confidence interval,
one concludes that one treatment is more effective see below) and let the reader decide for him or herself
than another. If it is concluded that the Pα exceeds how much chance could have accounted for the result.
some limit (see below) so there is no statistical differ-
ence between treatments, then the particular value of Statistical Significance and
Pα is not as relevant; in that situation, Pβ (probability Clinical Importance
of type II error) applies.
A statistically significant difference, no matter how
small the P, does not mean that the difference is clini-
Dichotomous and Exact P Values cally important. A P value of <0.0001, if it emerges
It has become customary to attach special significance from a well-designed study, conveys a high degree
to P values below 0.05 because it is generally agreed of confidence that a difference really exists but says
that a chance of <1 in 20 is a small enough risk of nothing about the magnitude of that difference or its
being wrong. A chance of 1 in 20 is so small, in fact, clinical importance. In fact, trivial differences may be
that it is reasonable to conclude that such an occur- highly statistically significant if a large enough num-
rence is unlikely to have arisen by chance alone. It ber of patients are studied.
could have arisen by chance, and 1 in 20 times it will,
but it is unlikely.
Differences associated with Pα < 0.05 are called Example
statistically significant. However, setting a cutoff
point at 0.05 is entirely arbitrary. Reasonable people The drug donepezil, a cholinesterase inhibitor,
might accept higher values or insist on lower ones, was developed for the treatment of Alzheimer
depending on the consequences of a false-positive con- disease. In a randomized controlled trial to es-
clusion in a given situation. For example, one might tablish whether the drug produced worthwhile
be willing to accept a higher chance of a false-positive improvements, 565 patients with Alzheimer
statistical test if the disease is severe, there is currently disease were randomly allocated to donepezil
no effective treatment, and the new treatment is safe. or placebo (3). The statistical significance of
On the other hand, one might be reluctant to accept a some trial end points was impressive: Both the
false-positive test if usual care is effective and the new mini-mental state examination and the Bristol
treatment is dangerous or much more expensive. This Activities of Daily Living Scale were statistically
reasoning is similar to that applied to the importance different at P < 0.0001. However, the actual dif-
of false-positive and false-negative diagnostic tests ferences were small, 0.8 on a 30-point scale for
(Chapter 8). the mini-mental state examination and 1 on a
To accommodate various opinions about what is 60-point scale for the Bristol Activities of Daily
and is not unlikely enough, some researchers report Living Scale. Moreover, other outcomes, which
the exact probabilities of P (e.g., 0.03, 0.07, 0.11), more closely represented the burden of illness
rather than lumping them into just two categories and care of these patients, were similar in the
(≤0.05 or >0.05). Users are then free to apply their
178 Clinical Epidemiology: The Essentials

Table 11.1
donepezil and placebo groups. These included Some Statistical Tests Commonly Used in
entering institutional care and progression of Clinical Research
disability (both primary end points) as well
as behavioral and psychological symptoms, Test When Used
caregiver psychopathology, formal care costs,
To Test the Statistical Significance of a Difference
unpaid caregiver time, and adverse events or
death. The authors concluded that the benefits Chi square (χ2) Between two or more
of donepezil were “below minimally relevant proportions (when there are a
large number of observations)
thresholds.”
Fisher’s exact Between two proportions (when
there are a small number of
observations)
On the other hand, very unimpressive P values
can result from studies with strong treatment effects Mann-Whitney U Between two medians
if there are few patients in the study. Student t Between two means
F test Between two or more means
Statistical Tests
To Describe the Extent of Association
Statistical tests are used to estimate the probability Regression Between an independent
of a type I error. The test is applied to the data to coefficient (predictor) variable and a
obtain a numerical summary for those data called a dependent (outcome) variable
test statistic. That number is then compared to a sam-
Pearson’s r Between two variables
pling distribution to come up with a probability of
a type I error (Fig. 11.2). The distribution is under To Model the Effects of Multiple Variables
the null hypothesis, the proposition that there is Logistic regression With a dichotomous outcome
no true difference in outcome between treatment Cox proportional With a time-to-event outcome
groups. This device is for mathematical reasons, not hazards
because “no difference” is the working scientific hypoth-
esis of the investigators conducting the study. One ends
up rejecting the null hypothesis (concluding there is a The chi-square (χ2) test for nominal data (counts)
difference) or failing to reject it (concluding that there is more easily understood than most and can be used
is insufficient evidence in support of a difference). Note to illustrate how statistical testing works. The extent
that not finding statistical significance is not the same as to which the observed values depart from what would
there being no difference. Statistical testing is not able have been expected if there were no treatment effect is
to establish that there is no difference at all. used to calculate a P value.
Some commonly used statistical tests are listed in
Table 11.1. The validity of many tests depends on
certain assumptions about the data; a typical assump- Example
tion is that the data have a normal distribution. If the
data do not satisfy these assumptions, the resulting Cardiac arrest outside the hospital has a poor
P value may be misleading. Other statistical tests, outcome. Animal studies suggested that hypo-
called non-parametric tests, do not make assump- thermia might improve neurologic outcomes.
tions about the underlying distribution of the data. To test this hypothesis in humans, 77 patients
A discussion of how these statistical tests are derived who remained unconscious after resuscita-
and calculated and of the assumptions on which they tion from out-of-hospital cardiac arrest were
rest can be found in any biostatistics textbook.

Estimate of probability
Test
Data that observed value could
statistic
be by chance alone
Statistical Compare to
test standard
distribution
Figure 11.2 ■ Statistical testing.
Chapter 11: Chance 179

randomized to cooling (hypothermia) or usual expected if there were no treatment effect.


care (4). The primary outcome was survival to Because they are squared, it does not matter
hospital discharge with relatively good neuro- whether the observed rates exceed or fall short
logic function. of the expected. By dividing the squared dif-
ference in each cell by the expected number,
the difference is adjusted for the number of
Observed Rates patients in that cell.
The χ2 statistic for these data is:
Survival with Good
Neurological Function
(21 − 16.75)2 (9 − 13.25)2 (22 − 26.25)2
Yes No Total
+ +
16.75 13.25 26.75
Hypothermia 21 22 43 (25 − 20.75) 2
+ = 4. 0
Usual care 9 25 34 20.75
Total 30 47 77 This χ2 is then compared to a table (avail-
able in books and computer programs) relat-
ing χ2 values to probabilities for that number
Success rates were 49% in the patients treat-
of cells to obtain the probability of a χ2 of 4.0.
ed with hypothermia and 26% in the patients
It is intuitively obvious that the larger the χ2,
on usual care. How likely would it be for a study
the more likely chance is to account for the ob-
of this size to observe a difference in rates as
served differences. The resulting P value for a
great as this or greater if there was in fact no
chi-square test statistic of 4.0 and a 2 × 2 table
difference in effectiveness? That depends on
was 0.046, which is the probability of a false-
how far the observed results depart from what
positive conclusion that the treatments had
would have been expected if the treatments
different effects. That is, the study results meet
were of similar effectiveness and only random
the conventional criterion for statistical signifi-
variation accounted for the different rates. If
cance, P ≤ 0.05.
treatment had no effect on outcome, apply-
ing the success rate for the patients as a whole
(30/77 = 39%) to the number of patients in each
treatment group gives the expected number of When using statistical tests, the usual approach is
successes in each group: to test for the probability that an intervention is either
more or less effective than another to a statistically
important extent. In this situation, testing is called
Expected Rates
two-tailed, referring to both tails of a bell-shaped
Success curve describing the random variation in differences
between treatment groups of equal value, where the
Yes No Total two tails of the curve include statistically unlikely
Hypothermia 16.75 26.25 43 outcomes favoring one or the other treatment. Some-
Usual care 13.25 20.75 34 times there are compelling reasons to believe that
one treatment could only be better or worse than the
Total 30 47 77
other, in which case one-tailed testing is used, where
all of the type I error (5%) is in one of the tails, mak-
The χ2 statistic is the square of the differe- ing it easier to reach statistical significance.
nces between observed and expected divided
by expected, summarized over all four cells: Concluding That a Treatment
Does Not Work
(Observed number − Expected number)2
χ2 = ∑ Some trials are unable to conclude that one treatment
Expected number
is better than the other. The risk of a false-negative
The magnitude of the χ2 statistic is deter- result is particularly large in studies with relatively
mined by how different all of the observed few patients or outcome events. The question then
numbers are from what would have been arises: How likely is a false-negative result (type II
or β error)? Could the “negative” findings in such
180 Clinical Epidemiology: The Essentials

trials have misrepresented the truth because these Visual presentation of negative results can be con-
particular studies had the bad luck to turn out in a vincing. Alternatively, one can examine confidence
relatively unlikely way? intervals (see Point Estimates and Confidence Inter-
vals, below) and learn a lot about whether the study
was large enough to rule out clinically important dif-
ferences if they existed.
Example Of course, reasons for false-negative results other
than chance also need to be considered: biologic rea-
One of the examples in Chapter 9 was a ran- sons such as too short follow-up or too small dose
domized controlled trial of the effects on car- of niacin, as well as study limitations such as non-
diovascular outcomes of adding niacin versus compliance and missed outcome events.
placebo in patients with lipid abnormalities Type II errors have received less attention than
who were already taking statin drugs (5). It was type I errors for several reasons. They are more dif-
a “negative” trial: Primary outcomes occurred ficult to calculate. Also, most professionals simply
in 16.4% of patients taking niacin and 16.2% in prefer things that work and consider negative results
patients taking placebo, and the authors con- unwelcome. Authors are less likely to submit nega-
cluded that “there was no incremental clinical tive studies to journals and when negative studies
benefit from the addition of niacin to statin are reported at all, the authors may prefer to empha-
therapy.” The statistical question associated size subgroups of patients in which treatment dif-
with this assertion is: How likely was it that the ferences were found. Authors may also emphasize
study found no benefit when there really is reasons other than chance to explain why true differ-
one? After all, there were only a few hundred ences might have been missed. Whatever the reason
cardiovascular outcome events in the study so for not considering the probability of a type II error,
the play of chance might have obscured treat- it is the main question that should be asked when the
ment effects. Figure 11.3 shows time-to-event results of a study are interpreted as “no difference.”
curves for the primary outcome in the two
treatment groups. Patients in the niacin and HOW MANY STUDY PATIENTS
control groups had remarkably similar curves ARE ENOUGH?
throughout follow-up, making a protective ef-
fect of niacin implausible. Suppose you are reading about a clinical trial that
compares a promising new treatment to usual care

50
Cumulative percent of patients
with primary outcome

40

30

Niacin plus statin


20

10
Placebo plus statin

0
0 1 2 3 4
Years
Number at risk
Niacin plus statin 1,718 1,606 1,366 903 428
Placebo plus statin 1,696 1,581 1,381 910 436

Figure 11.3 ■ Example of a “negative” trial. (Redrawn with permission from


The AIM-HIGH Investigators. Niacin in patients with low HDL cholesterol levels
receiving intensive statin therapy. N Engl J Med 2011;365:2255–2267.)
Chapter 11: Chance 181

and finds no difference. You are aware that random to detect the smallest degree of improvement that
variation can be the reason for whatever differences would be clinically meaningful?” On the other hand,
are or are not observed, and you wonder if the num- if one is interested in detecting only very large dif-
ber of patients in this study is large enough to make ferences between treated and control groups (i.e.,
chance an unlikely explanation for what was found. strong treatment effects) then fewer patients need to
Alternatively, you may be planning to do such a study be studied.
and have the same question. Either way, you need to
understand how many patients would be needed Type I Error
to make a strong comparison of the effects of the two
Sample size is also related to the risk of a type I error
treatments?
(concluding that treatment is effective when it is not).
The acceptable probability for a risk of this kind is a
Statistical Power value judgment. If one is prepared to accept the con-
The probability that a study will find a statistically sequences of a large chance of falsely concluding that
significant difference when a difference really exists is the treatment is effective, one can reach conclusions
called the statistical power of the study. Power and with fewer patients. On the other hand, if one wants
Pβ are complementary ways of expressing the same to take only a small risk of being wrong in this way,
concept. a larger number of patients will be required. As dis-
cussed earlier, it is customary to set Pα at 0.05 (1 in
Statistical power = 1 – Pβ 20) or sometimes 0.01 (1 in 100).
Power is analogous to the sensitivity of a diagnos-
tic test. One speaks of a study being powerful when it Type II Error
has a high probability of detecting differences when The chosen risk of a type II error is another determi-
treatments really do have different effects. nant of sample size. An acceptable probability of this
error is also a judgment that can be freely made and
Estimating Sample Size changed to suit individual tastes. Probability of Pβ is
Requirements often set at 0.20, a 20% chance of missing true differ-
ences in a particular study. Conventional type II errors
From the point of view of hypothesis testing of
are much larger than type I errors, reflecting a higher
nominal data (counts), an adequate sample size
depends on four characteristics of the study: the value placed on being sure an effect is really present
magnitude of the difference in outcome between when it is said to be.
treatment groups, Pα and Pβ (the probability of the
false-positive and false-negative conclusions you Characteristics of the Data
are willing to accept), and the underlying outcome The statistical power of a study is also determined by
rate. the nature of the data. When the outcome is expressed
These determinants of adequate sample size by counts or proportions of events or time-to-event,
should be taken into account when investigators plan its statistical power depends on the rate of events: The
a study, to ensure that the study will have enough sta- larger the number of events, the greater the statistical
tistical power to produce meaningful results. To the power for a given number of people at risk. As Peto
extent that investigators have not done this well, or et al. (6) put it:
some of their assumptions were found to be inaccu-
rate, readers need to consider the same issues when In clinical trials of time to death (or of the time to
interpreting study results. some other particular “event”—relapse, metastasis,
first thrombosis, stroke. recurrence, or time to death
from a particular cause—the ability of the trial to
Effect Size distinguish between the merits of two treatments
Sample size depends on the magnitude of the dif- depends on how many patients die (or suffer a rel-
ference to be detected. One is free to look for dif- evant event), rather than on the number of patients
ferences of any magnitude and of course one hopes entered. A study of 100 patients, 50 of whom die, is
to be able to detect even very small differences, but about as sensitive as a study with 1,000 patients, 50
of whom die.
more patients are needed to detect small differ-
ences, everything else being equal. Therefore, it is If the data are continuous, such as blood pres-
best to ask, “What is a sufficient number of patients sure or serum cholesterol, power is affected by the
182 Clinical Epidemiology: The Essentials

Table 11.2
Determinants of Sample Size

Determined by

Date Type
Investigator Means Counts
1 1
Sample size varies according to: OR Variability
Effect size, Pα, Pβ Outcome rate

degree to which patients vary among themselves.


The greater the variation from patient to patient
with respect to the characteristic being measured, Example
the more difficult it is to be confident that the
observed differences (or lack of difference) between A Small Sample Size
groups is not because of this variation, rather than a That Was Adequate
true treatment effect. For many centuries, scurvy, a vitamin C defi-
In designing a study, investigators choose the ciency syndrome causing bleeding into gums
smallest treatment effect that is clinically important and joints progressing to disability and death,
(larger treatment effects will be easier to detect) and was a huge problem for sailors. While aboard
the type I and type II errors they are willing to accept. ship in 1747, James Lind studied the treatment
They also obtain estimates of outcome event rates of scurvy in one of the earliest controlled tri-
or variation among patients. It is possible to design als of treatment (7). He divided 12 badly af-
studies that maximize power for a given sample size— fected sailors, who were confined to the ship’s
such as by choosing patients with a high event rate sick bay, into 6 treatment groups of 2 sailors
or similar characteristics—as long as they match the each. Treatments were oranges and lemons,
research question. cider, elixir of vitriol, vinegar, seawater, or
nutmeg. Except for the treatment, the sailors
Interrelationships remained in the same place and ate the same
food. Six days after treatment was begun, the
The relationships among the four variables that
two sailors given oranges and lemons, which
together determine an adequate sample size are sum-
contained ample amounts of vitamin C, were
marized in Table 11.2. The variables can be traded
greatly improved: One was fit for duty and one
off against one another. In general, for any given
was well enough to nurse the others. Sailors
number of patients in the study, there is a trade-off
given other treatments were not improved.
between type 1 and type 2 errors. Everything else
The P value (not calculated by Lind) for 2 of
being equal, the more one is willing to accept one
2 successes versus 10 of 10 failures was 0.02
kind of error, the less it will be necessary to risk the
(assuming that Lind believed beforehand that
other. Neither kind of error is inherently worse than
oranges and lemons were the intervention of
the other. It is, of course, possible to reduce both
interest and that the other five interventions
type 1 and type 2 errors if the number of patients is
were inactive controls).
increased, outcome events are more frequent, vari-
ability is decreased, or a larger treatment effect is
sought. A Large Sample Size
For conventional levels of Pα and Pβ, the That Was Inadequate
relationship between the size of treatment effect Low serum vitamin D levels may be a risk factor
and the number of patients needed for a trial is for colorectal cancer, but results of studies of
illustrated by the following examples. One repre- this question have been inconsistent with each
sents a situation in which a relatively small num- other. In 1985, investigators established (origi-
ber of patients was sufficient, and the other is one nally, for another purpose) a cohort of 29,133
in which a very large number of patients was too Finnish men aged 50 to 69 and measured their
small.
Chapter 11: Chance 183

baseline serum vitamin D levels (8). Incident


colon and rectal cancers were identified from
1,500 Outcome event
the National Cancer Registry. After 12 years

Number of people in each group


rate in the
of follow-up, 239 colon cancers and 192 rectal
untreated group
cancers developed in the cohort. After adjust-
ment for confounders, serum vitamin D levels 0.50 0.20 0.05
were positively associated with colon cancer in-
cidence and inversely associated to rectal can- 1,000

cer incidence, but neither of these findings was


statistically significant.

500
For most of the therapeutic questions encountered
today, a surprisingly large sample size is required. The
value of dramatic, powerful treatments, such as anti-
biotics for pneumonia or thyroid replacement for
hypothyroidism, was established by clinical expe-
rience or studying a small number of patients, but 0 20 40 60 80 100
such treatments come along rarely and many of them Proportional reduction in event rate (%)
are already well established. We are left with diseases, Figure 11.4 ■ The number of people required in each
many of which are chronic and have multiple, inter- of two treatment groups (of equal size), for various
acting causes, for which the effects of new treatments rates of outcome events in the untreated group, to
are generally small. This makes it especially important have an 80% chance of detecting a difference (P ! 0.05)
to plan clinical studies that are large enough to distin- in reduction in outcome event rates in treated relative
guish real from chance effects. to untreated patients. (Calculated from formula in Weiss
Figure 11.4 shows the relationship between sam- NS. Clinical epidemiology. The study of the outcome of illness.
ple size and treatment difference for several baseline New York: Oxford University Press; 1986.)
rates of outcome events. Studies involving fewer than
100 patients have a poor chance of detecting statisti-
cally significant differences for even large treatment Statistical precision is expressed as a confidence
effects. Looked at another way, it is difficult to detect interval, usually the 95% confidence interval,
effect sizes of <25%. In practice, statistical power can around the point estimate. Confidence intervals
be estimated by means of readily available formulas, are interpreted as follows: If the study is unbiased,
tables, nomograms, computer programs, or Web sites. there is a 95% chance that the interval includes the
true effect size. The more narrow the confidence
interval, the more certain one can be about the size of
POINT ESTIMATES AND the true effect. The true value is most likely to be close
CONFIDENCE INTERVALS to the point estimate, less likely to be near the outer
limits of the interval, and could (5 times out of 100)
The effect size that is observed in a particular study fall outside these limits altogether. Statistical preci-
(such as treatment effect in a clinical trial or relative sion increases with the statistical power of the study.
risk in a cohort study) is called the point estimate
of the effect. It is the best estimate from the study of
the true effect size and is the summary statistic usually
given the most emphasis in reports of research. Example
However, the true effect size is unlikely to be
exactly that observed in the study. Because of ran- The Women’s Health Initiative included a
dom variation, any one study is likely to find a result randomized controlled trial of the effects of
higher or lower than the true value. Therefore, a sum- estrogen plus progestin on chronic disease out-
mary measure is needed for the statistical precision comes in healthy postmenopausal women (9).
of the point estimate, the range of values likely to Figure 11.5 shows relative risk and confidence
encompass the true effect size.
184 Clinical Epidemiology: The Essentials

OUTCOME Statistical significance at the 0.05 level can be


obtained from 95% confidence intervals. If the point
Stroke corresponding to no effect (i.e., a relative risk of 1 or a
treatment difference of 0) falls outside the 95% confi-
dence intervals for the observed effect, the results are
statistically significant at the 0.05 level. If the confi-
Hip fracture
dence intervals include this point, the results are not
statistically significant.
Confidence intervals have advantages over P val-
Breast cancer ues. They put the emphasis where it belongs, on the
size of the effect. Confidence intervals help the reader
to see the range of plausible values and so to decide
Endometrial
whether an effect size they regard as clinically mean-
cancer ingful is consistent with or ruled out by the data (10).
They also provide information about statistical power.
0 1 2 If the confidence interval is relatively wide and barely
includes the value corresponding to no effect, readers
Relative risk
can see that low power might have been the reason for
Figure 11.5 ■ Example of confidence intervals. The the negative result. On the other hand, if the confi-
relative risk and confidence intervals for outcomes in the dence interval is narrow and includes no effect, a large
Women’s Health Initiative: a randomized controlled trial of
effect is ruled out.
estrogen plus progestin in healthy postmenopausal women.
(Data from Writing Group for the Women’s Health Initiative
Point estimates and confidence intervals are used to
Investigators. Risks and benefits of estrogen plus proges- characterize the statistical precision of any rate (inci-
tin in healthy postmenopausal women. JAMA 2002;288: dence and prevalence), diagnostic test performance,
321–333.) comparisons of rates (relative and attributable risks),
and other summary statistics. For example, studies
have shown that 7.0% (95% confidence interval,
5.2–9.4) of adults have a clinically important fam-
ily history of prostate cancer (11); that the sensitiv-
intervals for four of these outcomes: stroke, ity of a high-sensitivity cardiac troponin assay (at the
hip fracture, breast cancer, and endometrial optimal cutoff point) for acute coronary syndrome
cancer. The four illustrate various possibilities was 84.8% (95% confidence interval 82.8–86.6)
for how confidence intervals are interpreted. (12); and that return to usual activity after inguinal
Estrogen plus progestin was a risk factor for hernia repair was shorter for laparoscopic than open
stroke; the best estimate of this risk is the surgery (hazard ratio 0.56, 95% confidence interval
point estimate, a relative risk of 1.41, but the 0.51–0.61) (13).
data are consistent with a relative risk as low Confidence intervals have become the usual way of
as 1.07 or as high as 1.85. Estrogen plus pro- reporting the main results of clinical research because
gestin protected against hip fracture, prevent- of their many advantages over the hypothesis testing
ing as much as 65% and as little as 2% of frac- (P value) approach. P values are still used because of
tures. That is, the data are consistent with very tradition and as a convenience when many results are
little benefit, although substantial benefit is reported and it would not be feasible to include con-
likely and even larger benefits are consistent fidence intervals for all of them.
with the results. Although risk of breast cancer
is likely to be increased, the data are consis- Statistical Power after a
tent with no effect (the lower end of the con- Study Is Completed
fidence interval includes a relative risk of 1.0).
Finally, the study is not very informative for Earlier in the chapter, we discussed how calculation
endometrial cancer. Confidence intervals are of statistical power based on the hypothesis testing
very wide, so not only was there no clear risk approach is performed before a study is undertaken
or benefit, but also the estimate of risk was to ensure that enough patients will be entered to have
so imprecise that substantial risk or benefit re- a good chance of detecting a clinically meaningful
mained possible. effect if one is present. However, after the study is
completed, this approach is less relevant (14). There
Chapter 11: Chance 185

1 1 1 1
Risk =
100 1,000 10,000 100,000
1.0

Probability of detection
0.8

0.6

0.4

0.2

1,000 10,000 100,000 1,000,000

Size of treatment group


Figure 11.6 ■ The probability of detecting one event according to the
rate of the event and the number of people observed. (Redrawn with per-
mission from Guess HA, Rudnick SA. Use of cost effectiveness analysis in planning
cancer chemoprophylaxis trials. Control Clin Trials 1983;4:89–100.)

is no longer a need to estimate effect size, outcome needed to detect rare events such as uncommon
event rates, and variability among patients because side effects and complications. For that, a different
they are all known. Rather, attention should be approach, involving many more patients, is needed.
directed to point estimates and confidence intervals. An example is postmarketing surveillance of a drug,
With them, one can see the range of values that are in which thousands of users are monitored for side
consistent with the results and whether the effect sizes effects.
of interest are within this range or are ruled out by Figure 11.6 shows the probability of detecting an
the data. In the niacin study, summarized earlier as event as a function of the number of people under
an example of a negative trial, the hazard ratio was observation. A rule of thumb is: To have a good
1.02 and the 95% confidence interval was 0.87 to chance of detecting a 1/x event, one must observe 3x
1.21, meaning that the results were consistent with people (15). For example, to detect at least one event
a small degree of benefit or harm. Whether this mat- if the underlying rate is 1/1,000, one would need to
ters depends on the clinical importance attached to observe 3,000 people.
a difference in rates as large as represented by this
confidence interval.
MULTIPLE COMPARISONS
DETECTING RARE EVENTS The statistical conclusions of research have an aura of
authority that defies challenge, particularly by non-
It is sometimes important to know how likely a experts. However, as many skeptics have suspected, it
study is to detect a relatively uncommon event (e.g., is possible to “lie with statistics” even if the research
1/1,000), particularly if that event is severe, such as is well designed, the mathematics flawless, and the
bone marrow failure or life-threatening arrhythmia. investigators’ intentions beyond reproach.
A great many people must be observed in order to Statistical conclusions can be misleading because
have a good chance of detecting even one such event, the strength of statistical tests depends on the num-
much less to establish a relatively stable estimate of ber of research questions considered in the study and
its frequency. For most clinical research, sample size when those questions were asked. If many compari-
is planned to be sufficient to detect main effects, the sons are made among the variables in a large set of
answer sought for the primary research question. data, the P value associated with each individual com-
Sample size is likely to be well short of the number parison is an underestimate of how often the result of
186 Clinical Epidemiology: The Essentials

that comparison, among the others, is likely to arise Table 11.3


by chance. As implausible as it might seem, the inter- How Multiple Comparisons Can
pretation of the P value from a single statistical test Be Misleading
depends on the context in which it is done.
To understand how this might happen, consider 1. Make multiple comparisons within a study.
the following example. 2. Apply tests of statistical significance to each
comparison.
3. Find a few comparisons that are “interesting”
(statistically significant).
4. Build an article around one of these interesting
Example findings.
5. Do not mention the context of the individual
comparison (how many questions were examined
Suppose a large study has been done in which
and which was considered primary before the data
there are multiple subgroups of patients and
was examined).
many different outcomes. For instance, it might 6. Construct a post hoc argument for the plausibility of
be a clinical trial of the value of a treatment for the isolated finding.
coronary artery disease for which patients are in
12 clinically meaningful subgroups (e.g., one-,
two-, and three-vessel disease; good and bad
ventricular function; the presence or absence comparisons really were made. Sometimes, interest-
of arrhythmias; and various combinations of ing findings have been selected from a larger number
these), and four outcomes are considered (e.g., of uninteresting ones that are not mentioned. This
death, myocardial infarction, heart failure, process of deciding after the fact what is and is not
and angina). Suppose also that there are no important about a mass of data can introduce con-
true associations between treatment and out- siderable distortion of reality. Table 11.3 summarizes
come in any of the subgroups and for any of how this misleading situation can arise.
the outcomes. Finally, suppose that the effects How can the statistical effects of multiple com-
of treatment are assessed separately for each parisons be taken into account when interpreting
subgroup and for each outcome—a process research? Although ways of adjusting P values have
that involves a great many comparisons; in this been proposed, probably the best advice is to be aware
case, 12 subgroups multiplied by 4 outcomes = of the problem and to be cautious about accepting
48 comparisons. With a Pα = .05, 1 in 20 of these positive conclusions of studies in which multiple
comparisons (in this case, 2 to 3) is likely to be comparisons were made. As put by Armitage (16):
statistically significant by chance alone. In the
general case, if 20 comparisons are made, on If you dredge the data sufficiently deeply and suf-
average, 1 would be found to be statistically ficiently often, you will find something odd. Many
significant; if 100 comparisons are made, about of these bizarre findings will be due to chance. I do
5 would be likely to be statistically significant, not imply that data dredging is not an occupation
and so on. Thus, when a great many compari-
for honorable persons, but rather that discoveries
that were not initially postulated as among the major
sons have been made, a few will be found that
objectives of the trial should be treated with extreme
are strong enough, because of random varia- caution.
tion, to appear statistical significant even if
no true associations between variables exist in A special case of multiple comparisons occurs
nature. As the saying goes, “If you torture the when data in a clinical trial are examined repeatedly
data long enough, they will confess!” as they accrue, to assure that the trial is stopped as
soon as there is an answer, regardless of how long the
trial was planned to run. If this is done, as it often is
for ethical reasons, the final P value is usually adjusted
This phenomenon is referred to as the multiple for the number of looks at the data. There is a statisti-
comparisons problem. Because of this problem, the cal incentive to keep the number of looks to a mini-
strength of evidence from clinical research depends mum. In any case, if the accruing data are examined
on how focused its questions were at the outset. repeatedly, it will be more difficult to reach statistical
Unfortunately, when the results of research are significance after the multiple looks at the data are
presented, it is not always possible to know how many taken into account.
Chapter 11: Chance 187

Another special case is genome-wide association


studies, in which more than 500,000 single-nucleotide statistically significant. This analysis suggests
polymorphisms may be examined in cases and con- that effect modification was not present, but
trols (17). A common way to manage multiple com- this conclusion was limited by low statistical
parisons, to divide the usual P value of 0.05 by the precision in subgroups. Multiple comparisons,
number of comparisons, would require a genome- leading to a false-positive finding in subgroups,
wide level of statistical significance of 0.0000001 would have been an issue if there had been a
(10−7), which would be difficult to achieve because statistically significant effect in one or more of
sample sizes in these studies are constrained and rela- the subgroups.
tive risks are typically small. Because of this, belief in
results of genome-wide association studies relies on
the consistency and strength of associations across
many studies.
Subgroup analyses tell clinicians about effect
SUBGROUP ANALYSIS modification so they can tailor their care of individ-
ual patients as closely as possible to study results in
It is tempting to go beyond the main results of a study patients like them. However, subgroup analyses incur
to examine results within subgroups of patients with risks of misleading results because of the increased
characteristics that might be related to treatment chance of finding effects in a particular subgroup
effect (i.e., to look for effect modification, as dis- that are not present, in the long run, in nature, that
cussed in Chapter 5). Because characteristics present is, finding false-positive results because of multiple
at the time of randomization are randomly allocated comparisons.
into treatment groups, the consequence of subgroup In practice, the effects of multiple comparisons
analysis is to break the trial as a whole into a set of may not be as extreme when treatment effects in
smaller randomized controlled trials, each with a the various subgroups are not independent of each
smaller sample size. other. To the extent that the variables are related to
each other, rather than independent, the risk of false-
positive findings is lessened. In the atrial fibrillation
and anticoagulant example, age and prior stroke are
components of the CHADS2 score (a metric for
Example risk of stroke), but the three are treated as separate
subgroups.
Atrial fibrillation is treated with anticoagu- Another danger is coming to a false-negative con-
lants, vitamin K antagonists in high-risk pa- clusion. Within subgroups defined by certain kinds of
tients, to prevent stroke. Investigators studied patients or specific kinds of outcomes, there are fewer
stroke prevention in patients with atrial fibril- patients than for the study as a whole, often too few to
lation who could not take vitamin K antago- rule out false-negative findings. Studies are, after all,
nists. They randomly assigned these patients designed to have enough patients to answer the main
to a new anticoagulant, apixaban, or to aspirin research question with sufficient statistical power.
(18). There were 164 outcome events and the They are ordinarily not designed to have sufficient
hazard ratio was 0.45 (95% confidence inter- statistical power in subgroups, where the number of
val 0.32–0.62) favoring apixaban. Figure 11.7 patients and outcome events is smaller. Guidelines for
shows the hazard ratio for stroke in subgroups, deciding whether a finding in a subgroup is real are
with point estimates indicated by boxes (with summarized in Table 11.4.
their size proportional subgroup size) and
confidence intervals. A total of 42 subgroups Multiple Outcomes
were reported, 21 for one outcome (stroke
and systemic embolism, shown) and the same Another version of multiple looks at the data is to
21 for another outcome (major bleeding, not report multiple outcomes—different manifesta-
shown). The hazard ratios for most subgroups tions of effectiveness, intermediate outcomes, and
are around the hazard ratio for the study as a harms. Usually this is handled by naming one of
whole (shown by the vertical dashed line) and the outcomes primary and the others secondary and
although some were higher or lower, none was then being more guarded about conclusions for the
secondary outcomes. As with subgroups, outcomes
188 Clinical Epidemiology: The Essentials

No. of Hazard ratio with


Characteristic Aspirin Apixaban
Patients apixaban (95% Cl)
no. of events (%/yr)

Overall 5,599 113 (3.7) 51 (1.6)

Age
<65 yr 1,714 19 (2.0) 7 (0.7)
65 to <75 yr 1,987 28 (2.7) 24 (2.0)
≥75 yr 1,897 66 (6.1) 20 (2.0)

Age
Female 2,321 64 (4.9) 25 (1.9)
Male 3,277 49 (2.7) 26 (1.4)

Estimated GFR
<50 mL/min 1,198 36 (5.8) 16 (2.5)
50 to <80 mL/min 2,374 59 (4.5) 22 (1.7)
≥80 mL/min 2,021 18 (1.6) 13 (1.1)

CHADS2 score
0–1 2,026 18 (1.6) 10 (0.9)
2 1,999 40 (3.7) 25 (2.1)
≥3 1,570 55 (6.3) 16 (1.9)

Prior stroke or TIA


No 4,835 80 (3.0) 41 (1.5)
Yes 764 33 (8.3) 10 (2.5)

Study aspirin dose


<162 mg daily 3,602 85 (4.3) 39 (1.9)
≥162 mg daily 1,978 27 (2.4) 12 (1.1)

Previous VKA use


Yes 2,216 52 (4.2) 17 (1.4)
No 3,383 61 (3.3) 34 (1.8)

Patient refused VKA


No 3,506 73 (3.8) 35 (1.8)
Yes 2,092 40 (3.4) 16 (1.45)

Heart failure
No 3,428 66 (3.6) 28 (1.5)
Yes 2,171 45 (3.8) 23 (1.8)

0.05 0.25 1.00 4.00


Apixaban Aspirin
better better
Figure 11.7 ■ A subgroup analysis from a randomized controlled trial of the effectiveness of apixaban
versus aspirin on stroke and systemic embolism in patients with atrial fibrillation. GFR (glomerular filtration
rate) is a measure of kidney function. CHADS2 score is a prediction rule for the risk of embolism in patients with atrial
fibrillation. (Redrawn with permission from Connolly SJ, Eikelboom J, Joyner C, et al. Apixaban in patients with atrial
fibrillation. N Engl J Med 2011;364:806–817.)
Chapter 11: Chance 189

Table 11.4 taken into account, there would only be, at most,
Guidelines for Deciding Whether about 15 patients in each subgroup; if patients were
Apparent Differences in Effects within unevenly distributed among subgroups, there would
Subgroups Are Reala be even fewer in some.
What is needed then, in addition to tables show-
From the study itself: ing multiple subgroups, is a way of examining the
• Is the magnitude of the observed difference clinically effects of several variables together. This is accom-
important? plished by multivariable modeling—developing a
• How likely is the effect to have arisen by chance, mathematical expression of the effects of many vari-
taking into account: ables taken together. It is “multivariable” because it
the number of subgroups examined? examines the effects of multiple variables simultane-
the magnitude of the P value? ously. It is “modeling” because it is a mathematical
• Was a hypothesis that the effect would be observed construct, calculated from the data based on assump-
made before its discovery (or was justification for the tions about characteristics of the data (e.g., that the
effect argued for after it was found)?
variables are all normally distributed or all have the
• Was it one of a small number of hypotheses?
same variance).
From other information: Mathematical models are used in two general ways
• Was the difference suggested by comparisons within in clinical research. One way is to study the indepen-
rather than between studies? dent effect of one variable on outcome while taking
• Has the effect been observed in other studies? into account the effects of other variables that might
• Is there direct evidence that supports the existence confound or modify this relationship (discussed
of the effect?
under multivariable adjustment in Chapter 5). The
a
Adapted from Oxman AD, Guyatt GH. A consumer’s guide to second way is to predict a clinical event by calculating
subgroup analysis. Ann Intern Med 1992;116:78–84.
the combined effect of several variables acting together
(introduced in concept under Clinical Prediction
Rules in Chapter 7).
tend to be related to each other biologically (and
The basic structure of a multivariable model is:
as a consequence statistically), as is the case in the
above example where stroke and systemic embolism Outcome variable = constant + (β1 × variable1)
are different manifestations of the same clinical + (β2 × variable2) + . . .,
phenomenon.
where β1, β2, . . . are coefficients determined by the
data, and variable1, variable2, . . . are the variables that
MULTIVARIABLE METHODS might be related to outcome. The best estimates of
the coefficients are determined mathematically and
Most clinical phenomena are the result of many vari-
depend on the powerful calculating ability of modern
ables acting together in complex ways. For example,
computers.
coronary heart disease is the joint result of lipid
Modeling is done in many different ways, but
abnormalities, hypertension, cigarette smoking, fam-
some elements of the process are basic.
ily history, diabetes, diet, exercise, inflammation,
coagulation abnormalities, and perhaps personality. It 1. Identify all the variables that might be related to
is appropriate to try to understand these relationships the outcome of interest either as confounders or
by first examining relatively simple arrangements of effect modifiers. As a practical matter, it may not
the data, such as stratified analyses that show whether be possible to actually measure all of them and the
the effect of one variable is changed by the presence missing variables should be mentioned explicitly
or absence of one or more of the other variables. It is as a limitation.
relatively easy to understand the data when they are 2. If there are relatively few outcome events, the
displayed in this way. number of variables to be considered in the model
However, as mentioned in Chapter 7, it is usu- might need to be reduced to a manageable size,
ally not possible to account for more than a few vari- usually no more than several. Often this is done by
ables using this method because there are not enough selecting variables that, when taken one at a time,
patients with each combination of characteristics to are most strongly related to outcome. If a statis-
allow stable estimates of rates. For example, if 120 tical criterion is used at this stage, it is usual to
patients were studied, 60 in each treatment group, err on the side of including variables, for example,
and just one additional dichotomous variable was by choosing all variables showing an association
190 Clinical Epidemiology: The Essentials

with the outcome of interest at a cutoff level of Some commonly used kinds of models are logis-
P < 0.10. Evidence for the biologic importance tic regression (for dichotomous outcome variables
of the variable is also considered in making the such as those that occur in case-control studies) and
selection. Cox proportional hazards models (for time-to-event
3. Models, like other statistical tests, are based on studies).
assumptions about the structure of the data. Inves- Multivariable modeling is an essential strat-
tigators need to check whether these assumptions egy for dealing with the joint effects of multiple
are met in their particular data. variables. There is no other way to adjust for or
4. As for the actual models, there are many kinds to include many variables at the same time. How-
and many strategies that can be followed within ever, this advantage comes at a price. Models tend
models. All variables—exposure, outcome, and to be black boxes, and it is difficult to “get inside”
covariates—are entered in the model, with the them and understand how they work. Their validity
order determined by the research question. For is based on assumptions about the data that may
example, if some are to be controlled for in a not be met. They are clumsy at recognizing effect
causal analysis, they are entered in the model modification. An exposure variable may be strongly
first, followed by the variable of primary inter- related to outcome yet not appear in the model
est. The model will then identify the independent because it occurs rarely—and there is little direct
effect of the variable of primary interest. On the information on the statistical power of the model
other hand, if the investigator wants to make a for that variable. Finally, model results are easily
prediction based on several variables, the relative affected by quirks in the data, the results of ran-
strength of their association to the outcome vari- dom variation in the characteristics of patients from
able is determined by the model. sample to sample. It has been shown, for example,
that a model frequently identified a different set of
predictor variables and produced a different order-
ing of variables on different random samples of the
same dataset (20).
Example For these reasons, the models themselves cannot
be taken as a standard of validity and must be vali-
Gastric cancer is the second leading cause of dated independently. Usually, this is done by observ-
cancer death in the world. Investigators in ing whether or not the results of a model predicts
Europe analyzed data from a cohort recruited what is found in another, independent sample of
from 10 European countries to see whether patients. The results of the first model are consid-
alcohol was an independent risk factor for ered a hypothesis that is to be tested with new data.
stomach cancer (19). They identified nine vari- If random variation is mainly responsible for the
ables that were known risk factors or potential results of the first model, it is unlikely that the same
confounders of the association between the random effects will occur in the validating dataset,
main exposure (alcohol consumption) and dis- too. Other evidence for the validity of a model is its
ease (stomach cancer): age, study center, sex, biologic plausibility and its consistency with simpler,
physical activity, education, cigarette smoking, more transparent analyses of the data, such as strati-
diet, body mass index, and physical activity, fied analyses.
and in a subset of patients, serologic evidence
of Helicobacter pylori infection. As a limita- BAYESIAN REASONING
tion, they mentioned that they would have
included salt but did not have access to data An altogether different approach to the information
on salt intake. Modeling was with the Cox pro- contributed by a study is based on Bayesian inference.
portional hazards model, so they checked that We introduced this approach in Chapter 8 where we
the underlying assumption, that risk does not applied it to the specific case of diagnostic testing.
vary with time, was met. After adjustment for Bayesian inference begins with prior belief about
the other variables, heavy but not light alcohol the answer to a research question, analogous to pre-
consumption was associated with stomach can- test probability of a diagnostic test. Prior belief is
cer (hazard ratio 1.65, 95% confidence interval based on everything known about the answer up to
1.06–2.58); beer was associated, but not wine the point when new information is contributed by a
or liquor. study. Then, Bayesian inference asks how much the
results of the new study change that belief.
Chapter 11: Chance 191

Some aspects of Bayesian inference are compel- small number of hypotheses are identified before-
ling. Individual studies do not take place in an infor- hand and multiple comparisons are not as worrisome.
mation vacuum; rather, they are in the context of Rather, prior belief depends on the plausibility of the
all other information available at the time. Starting assertion rather than whether the assertion was estab-
each study from the null hypothesis—that there is lished before or after the study was begun.
no effect—is unrealistic because something is already Although Bayesian inference is appealing, so far it
known about the answer to the question before the has been difficult to apply because of poorly devel-
study is even begun. Moreover, results of individual oped ways of assigning numbers to prior belief and
studies change belief in relation to both their scien- to the information contributed by a study. Two
tific strengths and the direction and magnitude of exceptions are in cumulative summaries of research
their results. For example, if all preceding studies evidence (Chapter 13) and in diagnostic testing, in
were negative and the next one, which is of compa- which “belief ” is prior probability and the new infor-
rable strength, is found to be positive, then an effect mation is expressed as a likelihood ratio. However,
is still unlikely. On the other hand, a weak prior belief Bayesian inference is the conceptual basis for qualita-
might be reversed by a single strong study. Finally, tive thinking about cause (see Chapter 12).
with this approach it is not so important whether a

Review Questions
Read the following and select the best was 238 mg/dL in the group receiving the
response. new drug and 240 mg/dL in the group
receiving the old drug (P < 0.001). Which of
11.1. A randomized controlled trial of thrombo- the following best describes the meaning of
lytic therapy versus angioplasty for acute the P value in this study?
myocardial infarction finds no difference
A. Bias is unlikely to account for the
in the main outcome, survival to discharge
observed difference.
from hospital. The investigators explored
B. The difference is clinically important.
whether this was also true for subgroups of
C. A difference as big or bigger than what
patients defined by age, number of vessels
was observed could have arisen by chance
affected, ejection fraction, comorbidity, and
one time in 1,000.
other patient characteristics. Which of the
D. The results are generalizable to other
following is not true about this subgroup
patients with hypertension.
analysis?
E. The statistical power of this study was
A. Examining subgroups increases the inadequate.
chance of a false-positive (misleading
statistically significant) result in one of 11.3. In a well-designed clinical trial of treatment
the comparisons. for ovarian cancer, remission rate at 1 year is
B. Examining subgroups increases the 30% in patients offered a new drug and 20%
chance of a false-negative finding in one in those offered a placebo. The P value is 0.4.
of these subgroups, relative to the main Which of the following best describes the
result. interpretation of this result?
C. Subgroup analyses are bad scientific
A. Both treatments are effective.
practice and should not be done.
B. Neither treatment is effective.
D. Reporting results in subgroups helps
C. The statistical power of this study is
clinicians tailor information in the study
60%.
to individual patients.
D. The best estimate of treatment effect size
is 0.4.
11.2. A new drug for hyperlipidemia was com-
E. There is insufficient information to
pared with placebo in a randomized con-
decide whether one treatment is better
trolled trial of 10,000 patients. After 2 years,
than the other.
serum cholesterol (the primary outcome)
192 Clinical Epidemiology: The Essentials

11.4. In a cohort study, vitamin A intake was C. 1/15,000


found to be a risk factor for hip fracture in D. 1/20,000
women. The relative risk (highest quintile E. 1/30,000
versus lowest quintile) was 1.48, and the
95% confidence interval was 1.05 to 2.07. 11.8. Which of the following is least related to the
Which of the following best describes the statistical power of a study with a dichoto-
meaning of this confidence interval? mous outcome?
A. The association is not statistically A. Effect size
significant at the P < 0.05 level. B. Type I error
B. A strong association between vitamin A C. Rate of outcome events in the control
intake and hip fracture was established. group
C. The statistical power of this study is 95%. D. Type II error
D. There is a 95% chance that a range of E. The statistical test used
relative risks as low as 1.05 and as high as
2.07 includes the true risk. 11.9. Which is the following best characterizes
E. Bias is an unlikely explanation for this the application of Bayesian reasoning to a
result. clinical trial?
A. Prior belief in the comparative
11.5. Which of the following is the best reason for
effectiveness of treatment is guided
calling P ≤ 0.05 “statistically significant?”
by equipoise.
A. It definitively rules out a false-positive B. The results of each new study changes
conclusion. belief in treatment effect from what it
B. It is an arbitrarily chosen but useful rule was before the study.
of thumb. C. Bayesian inference is an alternative way
C. It rules out a type II error. of calculating a P value.
D. It is a way of establishing a clinically D. Bayesian reasoning is based, like inferential
important effect size. statistics, on the null hypothesis.
E. Larger or smaller P values do not provide E. Bayesian reasoning depends on a well-
useful information. defined hypothesis before the study is
begun.
11.6. Which of the following is the biggest
advantage of multivariable modeling? 11.10. In a randomized trial of intensive glucose
lowering in type 2 diabetes, death rate was
A. Models can control for many variables
higher in the intensively treated patients:
simultaneously.
hazard ratio 1.22 (95% confidence interval
B. Models do not depend on assumptions
1.01–1.46). Which if the following is not
about the data.
true about this study?
C. There is a standardized and reproducible
approach to modeling. A. The results are consistent with almost no
D. Models make stratified analyses effect.
unnecessary. B. The best estimate of treatment effect is a
E. Models can control for confounding in hazard ratio of 1.22.
large randomized controlled trials. C. If a P value were calculated, the results
would be statistically significant at the
11.7. A trial randomizes 10,000 patients to two treat- 0.05 level.
ment groups of similar size, one offered chemo- D. A P value would provide as much
prevention and the other usual care. How information as the confidence interval.
frequently must a side effect of chemopreven- E. The results are consistent with 46%
tion occur for the study to have a good chance higher death rates in the intensively
of observing at least one such side effect? treated patients.
A. 1/5,000
Answers are in Appendix A.
B. 1/10,000
Chapter 11: Chance 193

REFERENCES

1. Fisher R in Proceedings of the Society for Psychical Research, 11. Mai PL, Wideroff L, Greene MH, et al. Prevalence of family
1929, quoted in Salsburg D. The Lady Tasting Tea. New York: history of breast, colorectal, prostate, and lung cancer in a popu-
Henry Holt and Co; 2001. lation-based study. Public Health Genomics 2010;13:495–503.
2. Johnson AF. Beneath the technological fix: outliers and prob- 12. Venge P, Johnson N, Lindahl B, et al. Normal plasma levels
ability statements. J Chronic Dis 1985;38:957–961. of cardiac troponin I measured by the high-sensitivity cardiac
3. Courtney C, Farrell D, Gray R, et al for the AD2000 Collab- troponin I access prototype assay and the impact on the diag-
orative Group. Long-term donepezil treatment in 565 patients nosis of myocardial ischemia. J Am Coll Cardiol 2009;54:
with Alzheimer’s disease (AD2000): randomized double-blind 1165–1172.
trial. Lancet 2004:363:2105–2115. 13. McCormack K, Scott N, Go PMNYH, et al. Laparoscopic
4. Bernard SA, Gray TW, Buist MD, et al. Treatment of coma- techniques versus open techniques for inguinal hernia repair.
tose survivors of out-of-hospital cardiac arrest with induced Cochrane Database Syst Rev 2003;(1):CD001785.
hypothermia. N Engl J Med 2002;346:557–563. 14. Goodman SN, Berlin JA. The use of predicted confidence
5. The AIM-HIGH Investigators. Niacin in patients with low intervals when planning experiments and the misuse of
HDL cholesterol levels receiving intensive statin therapy. N power when interpreting results. Ann Intern Med 1994;121:
Engl J Med 2011;365:2255–2267. 200–206.
6. Peto R, Pike MC, Armitage P, et al. Design and analysis of 15. Sackett DL, Haynes RB, Gent M, et al. Compliance. In:
randomized clinical trials requiring prolonged observation of Inman WHW, ed. Monitoring for Drug Safety. Lancaster,
each patient. I. Introduction and design. Br J Cancer 1976;34: UK: MTP Press; 1980.
585–612. 16. Armitage P. Importance of prognostic factors in the analysis of
7. Lind J. A treatise on scurvy. Edinburgh; Sands, Murray and data from clinical trials. Control Clin Trials 1981;1:347–353.
Cochran, 1753 quoted by Thomas DP. J Royal Society Med 17. Hunter DJ, Kraft P. Drinking from the fire hose—statisti-
1997;80:50–54. cal issues in genomewide association studies. N Engl J Med
8. Weinstein SJ, Yu K, Horst RL, et al. Serum 25-hydroxyvita- 2007;357:436–439.
min D and risks of colon and rectal cancer in Finnish men. 18. Connolly SJ, Eikelboom J, Joyner C, et al. Apixaban in
Am J Epidemiol 2011;173:499–508. patients with atrial fibrillation. N Engl J Med 2011;364:
9. Rossouw JE, Anderson GL, Prentice RL, et al. for the Wom- 806–817.
en’s Health Initiative Investigators. Risks and benefits of estro- 19. Duell EJ, Travier N, Lujan-Barroso L, et al. Alcohol consump-
gen plus progestin in healthy postmenopausal women: prin- tion and gastric cancer risk in European Prospective Investi-
ciple results from the Women’s Health Initiative randomized gation into Cancer and Nutrition (EPIC) cohort. Am J Clin
controlled trial. JAMA 2002;288:321–333. Nutr 2011;94:1266–1275.
10. Braitman LE. Confidence intervals assess both clinical signifi- 20. Diamond GA. Future imperfect: the limitations of clinical
cance and statistical significance. Ann Intern Med 1991;114: prediction models and the limits of clinical prediction. J Am
515–517. Coll Cardiol 1989;14:12A–22A.

You might also like