You are on page 1of 3

STATISTICAL ERRORS

P values, the ‘gold standard’ of statistical validity, are


not as reliable as many scientists assume.
BY REGINA NUZZO

F
or a brief moment in 2010, Matt Motyl was It turned out that the problem was not in Goodman, a physician and statistician at Stan-

DALE EDWIN MURRAY


on the brink of scientific glory: he had dis- the data or in Motyl’s analyses. It lay in the sur- ford. “Then ‘laws’ handed down from God are
covered that extremists quite literally see prisingly slippery nature of the P value, which no longer handed down from God. They’re
the world in black and white. is neither as reliable nor as objective as most actually handed down to us by ourselves,
The results were “plain as day”, recalls Motyl, scientists assume. “P values are not doing their through the methodology we adopt.”
a psychology PhD student at the University of job, because they can’t,” says Stephen Ziliak, an
Virginia in Charlottesville. Data from a study economist at Roosevelt University in Chicago, OU T OF C ON T E XT
of nearly 2,000 people seemed to show that Illinois, and a frequent critic of the way statis- P values have always had critics. In their almost
political moderates saw shades of grey more tics are used. nine decades of existence, they have been lik-
accurately than did either left-wing or right- For many scientists, this is especially worry- ened to mosquitoes (annoying and impossi-
wing extremists. “The hypothesis was sexy,” ing in light of the reproducibility concerns. In ble to swat away), the emperor’s new clothes
he says, “and the data provided clear support.” 2005, epidemiologist John Ioannidis of Stan- (fraught with obvious problems that everyone
The P value, a common index for the strength ford University in California suggested that ignores) and the tool of a “sterile intellectual
of evidence, was 0.01 — usually interpreted as most published findings are false2; since then, rake” who ravishes science but leaves it with
‘very significant’. Publication in a high-impact a string of high-profile replication problems no progeny3. One researcher suggested rechris-
journal seemed within Motyl’s grasp. has forced scientists to rethink how they evalu- tening the methodology “statistical hypothesis
But then reality intervened. Sensitive to con- ate results. inference testing”3, presumably for the acro-
troversies over reproducibility, Motyl and his At the same time, statisticians are looking nym it would yield.
adviser, Brian Nosek, decided to replicate the for better ways of thinking about data, to help The irony is that when UK statistician
study. With extra data, the P value came out as scientists to avoid missing important informa- Ronald Fisher introduced the P value in the
0.59 — not even close to the conventional level tion or acting on false alarms. “Change your 1920s, he did not mean it to be a definitive test.
of significance, 0.05. The effect had disappeared, statistical philosophy and all of a sudden dif- He intended it simply as an informal way to
and with it, Motyl’s dreams of youthful fame1. ferent things become important,” says Steven judge whether evidence was significant in the

1 5 0 | N AT U R E | VO L 5 0 6 | 1 3 F E B R UA RY 2 0 1 4
© 2014 Macmillan Publishers Limited. All rights reserved
R. NUZZO; SOURCE: T. SELLKE ET AL. AM. STAT. 55, 62–71 (2001) FEATURE NEWS

PROBABLE CAUSE
A P value measures whether an observed result can be attributed to chance. But it cannot answer a Chance of real effect
researcher’s real question: what are the odds that a hypothesis is correct? Those odds depend on how Chance of no real effect
strong the result was and, most importantly, on how plausibile the hypothesis is in the first place.

THE LONG SHOT THE TOSS-UP THE GOOD BET


19-to-1 odds against 1-to-1 odds 9-to-1 odds in favour
Before the experiment
The plausibility of the
hypothesis — the odds of 95% chance of
it being true — can be no real effect
estimated from previous 50% 50% 90% 10%
experiments, conjectured 5% chance
mechanisms and other of real effect
expert knowledge. Three
examples are shown here.

The measured P value P = 0.05 P = 0.01 P = 0.05 P = 0.01 P = 0.05 P = 0.01


A value of 0.05 is
conventionally deemed
‘statistically significant’; a
value of 0.01 is considered 11%
‘very significant’. chance of
real effect

After the experiment


A small P value can make
a hypothesis more
plausible, but the
difference may not be 89% chance of 30% 70% 71% 29% 89% 11% 96% 4% 99% 1%
dramatic. no real effect

old-fashioned sense: worthy of a second look. many of the authors were non-statisticians provide general rule-of-thumb conversions
The idea was to run an experiment, then see if without a thorough understanding of either (see ‘Probable cause’). According to one
the results were consistent with what random approach, they created a hybrid system that widely used calculation5, a P value of 0.01 cor-
chance might produce. Researchers would first crammed Fisher’s easy-to-calculate P value responds to a false-alarm probability of at least
set up a ‘null hypothesis’ that they wanted to into Neyman and Pearson’s reassuringly rigor- 11%, depending on the underlying probabil-
disprove, such as there being no correlation or ous rule-based system. This is when a P value ity that there is a true effect; a P value of 0.05
no difference between two groups. Next, they of 0.05 became enshrined as ‘statistically sig- raises that chance to at least 29%. So Motyl’s
would play the devil’s advocate and, assuming nificant’, for example. “The P value was never finding had a greater than one in ten chance of
that this null hypothesis was in fact true, cal- meant to be used the way it’s used today,” says being a false alarm. Likewise, the probability
culate the chances of getting results at least as Goodman. of replicating his original result was not 99%,
extreme as what was actually observed. This as most would assume, but something closer
probability was the P value. The smaller it was, W H AT D OES I T ALL ME AN ? to 73% — or only 50%, if he wanted another
suggested Fisher, the greater the likelihood that One result is an abundance of confusion about ‘very significant’ result6,7. In other words, his
the straw-man null hypothesis was false. what the P value means4. Consider Motyl’s inability to replicate the result was about as
For all the P value’s apparent precision, study about political extremists. Most scien- surprising as if he had called heads on a coin
Fisher intended it to be just one part of a fluid, tists would look at his original P value of 0.01 toss and it had come up tails.
non-numerical process that blended data and say that there was just a 1% chance of his Critics also bemoan the way that P values
and background knowledge to lead to scien- result being a false alarm. But they would be can encourage muddled thinking. A prime
tific conclusions. But it soon got swept into a wrong. The P value cannot say this: all it can example is their tendency to deflect attention
movement to make evidence-based decision- do is summarize the data assuming a specific from the actual size of an effect. Last year, for
making as rigorous and objective as possible. null hypothesis. It cannot work backwards and example, a study of more than 19,000 people
This movement was spearheaded in the late make statements about the underlying reality. showed8 that those who meet their spouses
1920s by Fisher’s bitter rivals, Polish math- That requires another piece of information: online are less likely to divorce (p < 0.002) and
ematician Jerzy Neyman and UK statistician the odds that a real effect was there in the first more likely to have high marital satisfaction
Egon Pearson, who introduced an alternative place. To ignore this would be like waking (p < 0.001) than those who meet offline (see
framework for data analysis that included sta- up with a headache and concluding that you Nature http://doi.org/rcg; 2013). That might
tistical power, false positives, false negatives have a rare brain tumour — possible, but so have sounded impressive, but the effects were
and many other concepts now familiar from unlikely that it requires a lot more evidence actually tiny: meeting online nudged the
introductory statistics classes. They pointedly to supersede an everyday explanation such as divorce rate from 7.67% down to 5.96%, and
left out the P value. an allergic reaction. The more implausible the barely budged happiness from 5.48 to 5.64 on
But while the rivals feuded — Neyman hypothesis — telepathy, aliens, homeopathy — a 7-point scale. To pounce on tiny P values
called some of Fisher’s work mathematically the greater the chance that an exciting finding and ignore the larger question is to fall prey to
“worse than useless”; Fisher called Neyman’s is a false alarm, no mat- the “seductive certainty of significance”, says
approach “childish” and “horrifying [for] intel- NATURE.COM ter what the P value is. Geoff Cumming, an emeritus psychologist at
lectual freedom in the west” — other research- For more on These are sticky con- La Trobe University in Melbourne, Australia.
ers lost patience and began to write statistics statistics, see: cepts, but some stat- But significance is no indicator of practical
manuals for working scientists. And because go.nature.com/xlj9lr isticians have tried to relevance, he says: “We should be asking,

1 3 F E B R UA RY 2 0 1 4 | VO L 5 0 6 | N AT U R E | 1 5 1
© 2014 Macmillan Publishers Limited. All rights reserved
NEWS FEATURE

‘How much of an effect is there?’, not ‘Is there how statistics is taught, how data analysis is in the study.” This disclosure will, he hopes,
an effect?’” done and how results are reported and inter- discourage P-hacking, or at least alert readers
Perhaps the worst fallacy is the kind of preted. But at least researchers are admitting to any shenanigans and allow them to judge
self-deception for which psychologist Uri that they have a problem, says Goodman. “The accordingly.
Simonsohn of the University of Pennsylvania wake-up call is that so many of our published A related idea that is garnering attention is
and his colleagues have popularized the term findings are not true.” Work by researchers two-stage analysis, or ‘preregistered replication’,
P-hacking; it is also known as data-dredging, such as Ioannidis shows the link between says political scientist and statistician Andrew
snooping, fishing, significance-chasing and theoretical statistical complaints and actual Gelman of Columbia University in New York
double-dipping. “P-hacking,” says Simonsohn, difficulties, says Goodman. “The problems City. In this approach, exploratory and con-
“is trying multiple things until you get the that statisticians have predicted are exactly firmatory analyses are approached differently
desired result” — even unconsciously. It may what we’re now seeing. We just don’t yet have and clearly labelled. Instead of doing four sepa-
be the first statistical term to rate a definition all the fixes.” rate small studies and reporting the results in
in the online Urban Dictionary, where one paper, for instance, researchers would
the usage examples are telling: “That first do two small exploratory studies
finding seems to have been obtained “THE P VALUE WAS and gather potentially interesting find-

NEVER MEANT TO BE
through p-hacking, the authors dropped ings without worrying too much about
one of the conditions so that the overall false alarms. Then, on the basis of these

USED THE WAY IT’S


p-value would be less than .05”, and “She results, the authors would decide exactly
is a p-hacker, she always monitors data how they planned to confirm the find-
while it is being collected.”
Such practices have the effect of turn-
ing discoveries from exploratory studies
U S E D T O D A Y .” ings, and would publicly preregister their
intentions in a database such as the Open
Science Framework (https://osf.io). They
— which should be treated with scep- would then conduct the replication stud-
ticism — into what look like sound Statisticians have pointed to a num- ies and publish the results alongside those of
confirmations but vanish on repli- ber of measures that might help. To the exploratory studies. This approach allows
cation. Simonsohn’s simulations have avoid the trap of thinking about results for freedom and flexibility in analyses, says Gel-
shown9 that changes in a few data-analysis as significant or not significant, for exam- man, while providing enough rigour to reduce
decisions can increase the false-positive rate ple, Cumming thinks that researchers should the number of false alarms being published.
in a single study to 60%. P-hacking is espe- always report effect sizes and confidence inter- More broadly, researchers need to realize the
cially likely, he says, in today’s environment vals. These convey what a P value does not: limits of conventional statistics, Goodman says.
of studies that chase small effects hidden in the magnitude and relative importance of They should instead bring into their analysis
noisy data. It is tough to pin down how wide- an effect. elements of scientific judgement about the
spread the problem is, but Simonsohn has Many statisticians also advocate replacing plausibility of a hypothesis and study limita-
the sense that it is serious. In an analysis10, he the P value with methods that take advantage tions that are normally banished to the dis-
found evidence that many published psychol- of Bayes’ rule: an eighteenth-century theorem cussion section: results of identical or similar
ogy papers report P values that cluster suspi- that describes how to think about probability experiments, proposed mechanisms, clinical
ciously around 0.05, just as would be expected as the plausibility of an outcome, rather than knowledge and so on. Statistician Richard
if researchers fished for significant P values as the potential frequency of that outcome. Royall of Johns Hopkins Bloomberg School
until they found one. This entails a certain subjectivity — some- of Public Health in Baltimore, Maryland,
thing that the statistical pioneers were trying said that there are three questions a scientist
NUMB E R S GAME to avoid. But the Bayesian framework makes it might want to ask after a study: ‘What is the
Despite the criticisms, reform has been slow. comparatively easy for observers to incorpo- evidence?’ ‘What should I believe?’ and ‘What
“The basic framework of statistics has been rate what they know about the world into their should I do?’ One method cannot answer all
virtually unchanged since Fisher, Neyman and conclusions, and to calculate how probabilities these questions, Goodman says: “The numbers
Pearson introduced it,” says Goodman. John change as new evidence arises. are where the scientific discussion should start,
Campbell, a psychologist now at the University Others argue for a more ecumenical not end.” ■ SEE EDITORIAL P. 131
of Minnesota in Minneapolis, bemoaned the approach, encouraging researchers to try mul-
issue in 1982, when he was editor of the Journal tiple methods on the same data set. Stephen Regina Nuzzo is a freelance writer and an
of Applied Psychology: “It is almost impossible Senn, a statistician at the Centre for Public associate professor of statistics at Gallaudet
to drag authors away from their p-values, and Health Research in Luxembourg City, likens University in Washington DC.
the more zeroes after the decimal point, the this to using a floor-cleaning robot that can-
1. Nosek, B. A., Spies, J. R. & Motyl, M. Perspect.
harder people cling to them”11. In 1989, when not find its own way out of a corner: any data- Psychol. Sci. 7, 615–631 (2012).
Kenneth Rothman of Boston University in analysis method will eventually hit a wall, and 2. Ioannidis, J. P. A. PLoS Med. 2, e124 (2005).
Massachusetts started the journal Epidemiol- some common sense will be needed to get the 3. Lambdin, C. Theory Psychol. 22, 67–90 (2012).
4. Goodman, S. N. Ann. Internal Med. 130, 995–1004
ogy, he did his best to discourage P values in process moving again. If the various methods (1999).
its pages. But he left the journal in 2001, and come up with different answers, he says, “that’s 5. Goodman, S. N. Epidemiology 12, 295–297 (2001).
P values have since made a resurgence. a suggestion to be more creative and try to find 6. Goodman, S. N. Stat. Med. 11, 875–879 (1992).
7. Gorroochurn, P., Hodge, S. E., Heiman, G. A.,
Ioannidis is currently mining the PubMed out why”, which should lead to a better under- Durner, M. & Greenberg, D. A. Genet. Med. 9,
database for insights into how authors across standing of the underlying reality. 325–321 (2007).
many fields are using P values and other sta- Simonsohn argues that one of the strongest 8. Cacioppo, J. T., Cacioppo, S., Gonzagab, G. C.,
Ogburn, E. L. & VanderWeele, T. J. Proc. Natl Acad.
tistical evidence. “A cursory look at a sample protections for scientists is to admit every- Sci. USA 110, 10135–10140 (2013).
of recently published papers,” he says, “is thing. He encourages authors to brand their 9. Simmons, J. P., Nelson, L. D. & Simonsohn, U.
convincing that P values are still very, very papers ‘P-certified, not P-hacked’ by includ- Psychol. Sci. 22, 1359–1366 (2011).
popular.” ing the words: “We report how we deter- 10. Simonsohn, U., Nelson, L. D. & Simmons, J. P. J. Exp.
Psychol. http://dx.doi.org/10.1037/a0033242
Any reform would need to sweep through mined our sample size, all data exclusions (2013).
an entrenched culture. It would have to change (if any), all manipulations and all measures 11. Campbell, J. P. J. Appl. Psych. 67, 691–700 (1982).

1 5 2 | N AT U R E | VO L 5 0 6 | 1 3 F E B R UA RY 2 0 1 4
© 2014 Macmillan Publishers Limited. All rights reserved

You might also like