Professional Documents
Culture Documents
Estadistica, Articulo, Statistical Errors
Estadistica, Articulo, Statistical Errors
F
or a brief moment in 2010, Matt Motyl was It turned out that the problem was not in Goodman, a physician and statistician at Stan-
1 5 0 | N AT U R E | VO L 5 0 6 | 1 3 F E B R UA RY 2 0 1 4
© 2014 Macmillan Publishers Limited. All rights reserved
R. NUZZO; SOURCE: T. SELLKE ET AL. AM. STAT. 55, 62–71 (2001) FEATURE NEWS
PROBABLE CAUSE
A P value measures whether an observed result can be attributed to chance. But it cannot answer a Chance of real effect
researcher’s real question: what are the odds that a hypothesis is correct? Those odds depend on how Chance of no real effect
strong the result was and, most importantly, on how plausibile the hypothesis is in the first place.
old-fashioned sense: worthy of a second look. many of the authors were non-statisticians provide general rule-of-thumb conversions
The idea was to run an experiment, then see if without a thorough understanding of either (see ‘Probable cause’). According to one
the results were consistent with what random approach, they created a hybrid system that widely used calculation5, a P value of 0.01 cor-
chance might produce. Researchers would first crammed Fisher’s easy-to-calculate P value responds to a false-alarm probability of at least
set up a ‘null hypothesis’ that they wanted to into Neyman and Pearson’s reassuringly rigor- 11%, depending on the underlying probabil-
disprove, such as there being no correlation or ous rule-based system. This is when a P value ity that there is a true effect; a P value of 0.05
no difference between two groups. Next, they of 0.05 became enshrined as ‘statistically sig- raises that chance to at least 29%. So Motyl’s
would play the devil’s advocate and, assuming nificant’, for example. “The P value was never finding had a greater than one in ten chance of
that this null hypothesis was in fact true, cal- meant to be used the way it’s used today,” says being a false alarm. Likewise, the probability
culate the chances of getting results at least as Goodman. of replicating his original result was not 99%,
extreme as what was actually observed. This as most would assume, but something closer
probability was the P value. The smaller it was, W H AT D OES I T ALL ME AN ? to 73% — or only 50%, if he wanted another
suggested Fisher, the greater the likelihood that One result is an abundance of confusion about ‘very significant’ result6,7. In other words, his
the straw-man null hypothesis was false. what the P value means4. Consider Motyl’s inability to replicate the result was about as
For all the P value’s apparent precision, study about political extremists. Most scien- surprising as if he had called heads on a coin
Fisher intended it to be just one part of a fluid, tists would look at his original P value of 0.01 toss and it had come up tails.
non-numerical process that blended data and say that there was just a 1% chance of his Critics also bemoan the way that P values
and background knowledge to lead to scien- result being a false alarm. But they would be can encourage muddled thinking. A prime
tific conclusions. But it soon got swept into a wrong. The P value cannot say this: all it can example is their tendency to deflect attention
movement to make evidence-based decision- do is summarize the data assuming a specific from the actual size of an effect. Last year, for
making as rigorous and objective as possible. null hypothesis. It cannot work backwards and example, a study of more than 19,000 people
This movement was spearheaded in the late make statements about the underlying reality. showed8 that those who meet their spouses
1920s by Fisher’s bitter rivals, Polish math- That requires another piece of information: online are less likely to divorce (p < 0.002) and
ematician Jerzy Neyman and UK statistician the odds that a real effect was there in the first more likely to have high marital satisfaction
Egon Pearson, who introduced an alternative place. To ignore this would be like waking (p < 0.001) than those who meet offline (see
framework for data analysis that included sta- up with a headache and concluding that you Nature http://doi.org/rcg; 2013). That might
tistical power, false positives, false negatives have a rare brain tumour — possible, but so have sounded impressive, but the effects were
and many other concepts now familiar from unlikely that it requires a lot more evidence actually tiny: meeting online nudged the
introductory statistics classes. They pointedly to supersede an everyday explanation such as divorce rate from 7.67% down to 5.96%, and
left out the P value. an allergic reaction. The more implausible the barely budged happiness from 5.48 to 5.64 on
But while the rivals feuded — Neyman hypothesis — telepathy, aliens, homeopathy — a 7-point scale. To pounce on tiny P values
called some of Fisher’s work mathematically the greater the chance that an exciting finding and ignore the larger question is to fall prey to
“worse than useless”; Fisher called Neyman’s is a false alarm, no mat- the “seductive certainty of significance”, says
approach “childish” and “horrifying [for] intel- NATURE.COM ter what the P value is. Geoff Cumming, an emeritus psychologist at
lectual freedom in the west” — other research- For more on These are sticky con- La Trobe University in Melbourne, Australia.
ers lost patience and began to write statistics statistics, see: cepts, but some stat- But significance is no indicator of practical
manuals for working scientists. And because go.nature.com/xlj9lr isticians have tried to relevance, he says: “We should be asking,
1 3 F E B R UA RY 2 0 1 4 | VO L 5 0 6 | N AT U R E | 1 5 1
© 2014 Macmillan Publishers Limited. All rights reserved
NEWS FEATURE
‘How much of an effect is there?’, not ‘Is there how statistics is taught, how data analysis is in the study.” This disclosure will, he hopes,
an effect?’” done and how results are reported and inter- discourage P-hacking, or at least alert readers
Perhaps the worst fallacy is the kind of preted. But at least researchers are admitting to any shenanigans and allow them to judge
self-deception for which psychologist Uri that they have a problem, says Goodman. “The accordingly.
Simonsohn of the University of Pennsylvania wake-up call is that so many of our published A related idea that is garnering attention is
and his colleagues have popularized the term findings are not true.” Work by researchers two-stage analysis, or ‘preregistered replication’,
P-hacking; it is also known as data-dredging, such as Ioannidis shows the link between says political scientist and statistician Andrew
snooping, fishing, significance-chasing and theoretical statistical complaints and actual Gelman of Columbia University in New York
double-dipping. “P-hacking,” says Simonsohn, difficulties, says Goodman. “The problems City. In this approach, exploratory and con-
“is trying multiple things until you get the that statisticians have predicted are exactly firmatory analyses are approached differently
desired result” — even unconsciously. It may what we’re now seeing. We just don’t yet have and clearly labelled. Instead of doing four sepa-
be the first statistical term to rate a definition all the fixes.” rate small studies and reporting the results in
in the online Urban Dictionary, where one paper, for instance, researchers would
the usage examples are telling: “That first do two small exploratory studies
finding seems to have been obtained “THE P VALUE WAS and gather potentially interesting find-
NEVER MEANT TO BE
through p-hacking, the authors dropped ings without worrying too much about
one of the conditions so that the overall false alarms. Then, on the basis of these
1 5 2 | N AT U R E | VO L 5 0 6 | 1 3 F E B R UA RY 2 0 1 4
© 2014 Macmillan Publishers Limited. All rights reserved