Professional Documents
Culture Documents
COMMENT HISTORY To fight denial, CHEMISTRY Three more unsung PUBLISHING As well as ORCID
and conflict from ants study Galileo and women — of astatine ID and English, list authors
and chimps to us p.308 Arendt p.309 discovery p.311 in their own script p.311
ILLUSTRATION BY DAVID PARKINS
W
hen was the last time you heard How do statistics so often lead scientists to literature with overstated claims and, less
a seminar speaker claim there deny differences that those not educated in famously, led to claims of conflicts between
was ‘no difference’ between statistics can plainly see? For several genera- studies where none exists.
two groups because the difference was tions, researchers have been warned that a We have some proposals to keep scientists
‘statistically non-significant’? statistically non-significant result does not from falling prey to these misconceptions.
If your experience matches ours, there’s ‘prove’ the null hypothesis (the hypothesis
a good chance that this happened at the that there is no difference between groups or PERVASIVE PROBLEM
last talk you attended. We hope that at least no effect of a treatment on some measured Let’s be clear about what must stop: we
someone in the audience was perplexed if, as outcome)1. Nor do statistically significant should never conclude there is ‘no differ-
frequently happens, a plot or table showed results ‘prove’ some other hypothesis. Such ence’ or ‘no association’ just because a P value
that there actually was a difference. misconceptions have famously warped the is larger than a threshold such as 0.05
2 1 M A RC H 2 0 1 9 | VO L 5 6 7 | NAT U R E | 3 0 5
©
2
0
1
9
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
COMMENT
or, equivalently, because a confidence Association released a statement in The some quality-control standard). And we
interval includes zero. Neither should we American Statistician warning against the are also not advocating for an anything-
conclude that two studies conflict because misuse of statistical significance and P val- goes situation, in which weak evidence
one had a statistically significant result and ues. The issue also included many commen- suddenly becomes credible. Rather, and in
the other did not. These errors waste research taries on the subject. This month, a special line with many others over the decades, we
efforts and misinform policy decisions. issue in the same journal attempts to push are calling for a stop to the use of P values
For example, consider a series of analyses these reforms further. It presents more than in the conventional, dichotomous way — to
of unintended effects of anti-inflammatory 40 papers on ‘Statistical inference in the 21st decide whether a result refutes or supports a
drugs2. Because their results were statistically century: a world beyond P < 0.05’. The edi- scientific hypothesis5.
non-significant, one set of researchers con- tors introduce the collection with the cau-
cluded that exposure to the drugs was “not tion “don’t say ‘statistically significant’”3. QUIT CATEGORIZING
associated” with new-onset atrial fibrillation Another article4 with dozens of signatories The trouble is human and cognitive more
(the most common disturbance to heart also calls on authors and journal editors to than it is statistical: bucketing results into
rhythm) and that the results stood in con- disavow those terms. ‘statistically significant’ and ‘statistically
trast to those from an earlier study with a We agree, and call for the entire concept non-significant’ makes people think that the
statistically significant outcome. of statistical significance to be abandoned. items assigned in that way are categorically
Now, let’s look at the actual data. The We are far from different6–8. The same problems are likely to
researchers describing their statistically “Eradicating alone. When we arise under any proposed statistical alterna-
non-significant results found a risk ratio categorization invited others to tive that involves dichotomization, whether
of 1.2 (that is, a 20% greater risk in exposed will help to halt read a draft of this frequentist, Bayesian or otherwise.
patients relative to unexposed ones). They overconfident comment and sign Unfortunately, the false belief that
also found a 95% confidence interval their names if they crossing the threshold of statistical sig-
claims,
that spanned everything from a trifling concurred with our nificance is enough to show that a result is
risk decrease of 3% to a considerable risk
unwarranted message, 250 did so ‘real’ has led scientists and journal editors to
increase of 48% (P = 0.091; our calcula- declarations of within the first 24 privilege such results, thereby distorting the
tion). The researchers from the earlier, sta- ‘no difference’ hours. A week later, literature. Statistically significant estimates
tistically significant, study found the exact and absurd we had more than are biased upwards in magnitude and poten-
same risk ratio of 1.2. That study was sim- statements 800 signatories — all tially to a large degree, whereas statistically
ply more precise, with an interval spanning about checked for an aca- non-significant estimates are biased down-
from 9% to 33% greater risk (P = 0.0003; our ‘replication demic affiliation or wards in magnitude. Consequently, any dis-
calculation). failure’.” other indication of cussion that focuses on estimates chosen for
It is ludicrous to conclude that the present or past work their significance will be biased. On top of
statistically non-significant results showed in a field that depends on statistical model- this, the rigid focus on statistical significance
“no association”, when the interval estimate ling (see the list and final count of signatories encourages researchers to choose data and
included serious risk increases; it is equally in the Supplementary Information). These methods that yield statistical significance for
absurd to claim these results were in contrast include statisticians, clinical and medical some desired (or simply publishable) result,
with the earlier results showing an identical researchers, biologists and psychologists or that yield statistical non-significance for
observed effect. Yet these common practices from more than 50 countries and across all an undesired result, such as potential side
show how reliance on thresholds of statisti- continents except Antarctica. One advocate effects of drugs — thereby invalidating
cal significance can mislead us (see ‘Beware called it a “surgical strike against thought- conclusions.
false conclusions’). less testing of statistical significance” and “an The pre-registration of studies and a
These and similar errors are widespread. opportunity to register your voice in favour commitment to publish all results of all
Surveys of hundreds of articles have found of better scientific practices”. analyses can do much to mitigate these
that statistically non-significant results are We are not calling for a ban on P values. issues. However, even results from pre-reg-
interpreted as indicating ‘no difference’ or Nor are we saying they cannot be used as istered studies can be biased by decisions
‘no effect’ in around half (see ‘Wrong inter- a decision criterion in certain special- invariably left open in the analysis plan9.
pretations’ and Supplementary Information). ized applications (such as determining This occurs even with the best of intentions.
In 2016, the American Statistical whether a manufacturing process meets Again, we are not advocating a ban on
P values, confidence intervals or other sta-
tistical measures — only that we should
SOURCE: V. AMRHEIN ET AL.
3 0 6 | NAT U R E | VO L 5 6 7 | 2 1 M A RC H 2 0 1 9
©
2
0
1
9
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.
COMMENT
Whether a P value is small or large, caution and data tabulation will be more detailed
SOURCE: V. AMRHEIN ET AL.
2 1 M A RC H 2 0 1 9 | VO L 5 6 7 | NAT U R E | 3 0 7
©
2
0
1
9
S
p
r
i
n
g
e
r
N
a
t
u
r
e
L
i
m
i
t
e
d
.
A
l
l
r
i
g
h
t
s
r
e
s
e
r
v
e
d
.