Why The P Value Is Under Fire

Current Medicine Research and Practice 8 (2018) 222e229
Contents lists available at ScienceDirect
Current Medicine Research and Practice

journal homepage: www.elsevier.com/locate/cmrp
Review Article
Why the p-value is under fire?

R.L. Sapra a, *, Samiran Nundy b
a
Biostatistian, Sir Ganga Ram Hospital, New Delhi, India
b
Department of Surgical Gastroenterology & Liver Transplantation, Sir Ganga Ram Hospital, New Delhi, India
a r t i c l e i n f o a b s t r a c t
Article history: For almost a century after its introduction the p-value remains the most frequently used inferential tool
Received 4 September 2018 of statistical science for steering research in various scientific domains. This ubiquitous powerful statistic
Accepted 15 October 2018 itself is now under fire being surrounded by numerous controversies. We here review some of the
Available online 19 October 2018
important papers which highlight the prevailing myths, misunderstandings and controversies about this
statistic. We also discuss recent developments made by the American Statistical Association (ASA) in
interpreting p-value and guiding researchers to avoid confusion. Our paper is based on a search of se-
lective databases and we do not claim it to be an exhaustive review. It specifically aims to help medical
researchers/professionals who have little background of this contentious statistic and have been chasing
it indiscriminately in publishing significant findings.
© 2018 Sir Ganga Ram Hospital. Published by Elsevier, a division of RELX India, Pvt. Ltd. All rights
reserved.
1. Introduction The genesis of the p-value is credited to Sir Ronald Fisher,5 a

statistician and a biologist who devised this statistic to test the null
The formulation and philosophy of hypothesis testing known hypothesis. The Null Hypothesis is a statement of no difference
today was largely developed between 1915 and 1933 by three sci- between an intervention(s) (and non-intervention) or no rela-
entists - R. A. Fisher, J. Neyman and E. S. Pearson. Two schools of tionship between variables. For example, a researcher might want
thought for testing of hypotheses emerged from their philosophy to compare the effects of two drugs. One can set a hypothesis prior
popularly known as ‘Fisherian’ and ‘Neyman-Pearsonian’. The to beginning the study that ‘Drug A and Drug B have the same ef-
distinction between the two approaches is that the former sets the ficacy’ or ‘There is no difference between the efficacy of Drug A and
null hypothesis and tests for its rejection using a significance level, B’. Researcher then conducts an experiment by taking care of all
whereas the latter sets the alternative hypothesis along with the a mandatory requirements such as randomization and blinding etc.,
priori effect size and two types of errors (Type I - rejecting the null records data; and under certain statistical assumptions computes
hypothesis when there is no effect and Type II enot rejecting when the probability (p) of finding a result at least or more extreme than
there is effect); and tests for acceptance of the alternative hy- the observed result assuming the null hypothesis is true.6 Fisher
pothesis.1,2 Besides, there were many other basic philosophical advocated 5% to be the standard level for this probability (with 1%
differences between the two schools. Lehman and his coworker as a more stringent alternative).1 Thus, if this probability is found to
argued that the two theories were complementary rather than be 0.05, researcher rejects the test (null) hypothesis. Fisher
contradictory. However, Perezgonzalez argued that Fisher's is the perhaps might not have thought that his seemingly simple statistic
closest approach to null hypothesis significance testing (NHST); it is would be misunderstood and misused extensively later by re-
also the philosophy underlying common statistical packages, such searchers for NHST, and would be surrounded by a number of
as SPSS. There are other approaches to hypothesis testing, such as controversies. The major controversy is that it is largely misun-
the Bayes' hypotheses testing (Lindley,1965)3 and Wald's (1950)4 derstood and hence misinterpreted. Some researchers argue that
decision theory which are not popular among medical re- there is something wrong with the p-value itself, whereas others
searchers owing to their computational complexity. argue for its weak statistical standards (0.05) leading to a repro-
ducibility crisis of findings. Some critics have questioned its arbi-
trary cut-off (usually 0.05) and its dichotomous nature resulting in
* Corresponding author.
widespread publication bias in the scientific literature. A few have
E-mail address: saprarl@gmail.com (R.L. Sapra). even challenged its validity and argued for alternatives. There are
https://doi.org/10.1016/j.cmrp.2018.10.003
2352-0817/© 2018 Sir Ganga Ram Hospital. Published by Elsevier, a division of RELX India, Pvt. Ltd. All rights reserved.
R.L. Sapra, S. Nundy / Current Medicine Research and Practice 8 (2018) 222e229 223
critics who complain of its ‘dancing attitude’. To share the flavor of meaning is widely and often wildly misconstrued during the last
the debate, we give below quotes or titles of some of the important eight decades and this has been pointed out in large number of
papers out of a large number published in high impact journals: papers and books.9 A paper by Goodman,9 “A Dirty Dozen: Twelve
P-Value Misconceptions” reviews a dozen of common mis-
1. “Mindless Statistics”,7 interpretations and their possible consequences. His observations
2. “The Difference Between ‘Significant’ and ‘Not Significant’ is regarding the interpretation of the p-value are relevant when he
not Itself Statistically Significant”,8 says, “It is not the fault of researchers that the p-value is difficult
3. “A Dirty Dozen: Twelve P-Value Misconceptions”,9 to interpret correctly. The man who introduced it as a formal
4. “Replication and p Intervals: p Values Predict the Future Only research tool, the statistician and geneticist, R.A. Fisher, could not
Vaguely, but Confidence Intervals Do Much Better”.10 It's explain exactly its inferential meaning”. A recent study by
science's dirtiest secret: The ‘scientific method’ of testing Greenland et al26 provides a detailed list of 25 misinterpretations of
hypotheses by statistical analysis stands on a flimsy p-value, confidence intervals, and power; and also guidelines for
foundation.”,11 improving statistical interpretation and reporting. We advise
5. “Null Misinterpretation in Statistical Testing and its Impact readers to refer to these papers for clarifications of their doubts
on Health Risk Assessment”,12 regarding null hypothesis significance testing and interpreting
6. “False-Positive Psychology: Undisclosed Flexibility in Data their results. The most common misinterpretations of the p-value
Collection and Analysis Allows Presenting Anything as are:
Significant”,13
7. “Bad Statistical Practice in Pharmacology and Other Basic i) The p-value is the probability that the null hypothesis is true
Biomedical Disciplines): “You Probably Don't Know P”,14 when p > 0.05
8. “Non significance Plus High Power Does Not Imply Support ii) 1- minus the p-value is the probability that the alternative
for the Null Over the Alternative”,15 hypothesis is true
9. “The primary product of a research inquiry is one or more iii) When p 0.05, the null hypothesis is false or should be
measures of effect size, not P values”,16 rejected
10. “Why most published research findings are false?”,17 iv) When the p-value >0.05 there is no effect
11. “The Statistical Crisis in Science [online]”,18 v) A statistically significant result is also medically significant
12. “Statistical techniques for testing hypotheses …have more
flaws than Facebook's privacy policies.”,19 The basic reason for misinterpretation of the p-value is the lack
13. “P values are not doing their job, because they can't”20 says of a researcher's understanding of its inferential capability. The p-
Stephen Ziliak, an economist at Roosevelt University in Chi- value is the probability of an observed or more extreme result
cago, Illinois, assuming that the null hypothesis is true. For example, if a
14. “P values, the 'gold standard' of statistical validity, are not as researcher wants to test whether a new drug is better than an old
reliable as many scientists assume”,20 one, he applies the t-statistic and gets a ‘t-value’ of say 2.0. Then he
15. “The fickle P value generates irreproducible results”,21 looks for the probability of getting a result of 2.0 or more extreme
16. “Do Not Over (P) Value Your Research Article”22 (t 2.0). This value which is shown under the green area (Figure) is
17. “P-value shake-up proposed”.23 the probability and is called the p-value. This is automatically
calculated by the software. If the p-value is found to be less than or
The widespread criticism of the inferential statistic used in equal to the cut off (usually p 0.05), he straightaway rejects the
NHST among the scientific community is thus a matter of serious null hypothesis. However, rejecting the null hypothesis does not
concern. Some of these observations led David Trafimow, Editor of imply that it is false. Similarly, when p > 0.05 it is incorrect to say
the Basic and Applied Social Psychology (BASP) journal to take a there is ‘no effect’ or that the ‘null hypothesis is true’. Because we
radical step and announce the ban of the p-value in future publi- did not know a priori whether the test hypothesis was true or false.
cations submitted to BASP. He says, “Null hypothesis significance That is why we had to conduct the experimentation to find out. The
testing has been shown to be logically invalid and to provide little p-value was calculated on the assumption that it was true. When
information about the actual likelihood of either the null or p 0.05, the safer inferential statement we can make is that there is
experimental hypothesis”.24 The journal now prefers strong at the most a 5% probability that our results are compatible with the
descriptive statistics, including effect sizes etc. instead of p-values. null hypothesis of ‘no effect’. The ASA's recent interpretation of the
Its editorial board also made two other major policy decisions in p-value clearly says, “P-values can indicate how incompatible the
favor of publishing papers with negative results or “null effects” data are with a specified statistical model”.27 Here the statistical
and those that contradicted already published work.25 model also includes the null hypothesis. The lower the p-value, the
These developments concerning the null hypothesis signifi- greater the statistical incompatibility of the data with the null hy-
cance testing and the rising criticism, and concerns surrounding pothesis, if the underlying assumptions used to calculate the p-
NHST built up pressure on the American Statistical Association value hold (Fig. 1).
(ASA) to clarify its position on the p-value to avoid its further
misinterpretation, misuse and large-scale confusion among the
scientific fraternity. The ASA took up this challenge and finally in
January, 2016, released a formal statement clarifying several widely
agreed principles underlying the proper use and interpretation of
the p-value. Below we discuss “what is wrong” and “what is right”
with this inferential, yet, highly controversial statistic.
2. Misunderstandings or misinterpretations of p-value
A major concern expressed by critics is that NHST is misunder-

stood by many of those who use it.6 The p-value's inferential Fig. 1. P-value.
224 R.L. Sapra, S. Nundy / Current Medicine Research and Practice 8 (2018) 222e229
The definition of p-value does not in any way permit us to the last decade. Recently there has been a doubt among
calculate the reverse conditional probability i.e. the probability of a researchers34e37 that it is difficult to replicate the published re-
hypothesis for given data. Thus, for example with a p-value of 0.02, sults. However, its nomenclature (reproducibility, replicability,
it is wrong to say that there is 98% probability that the alternative reliability, robustness, and generalizability) as well as criteria for
hypothesis is true. p-value does not measure the probability that checking vary among researchers. “The lexicon of reproducibility
the studied hypothesis is true, or the probability that the data were to date has been multifarious and ill-defined”.38 There is no clear-
produced by random chance alone.27 cut definition of reproducibility as on today. Perhaps the simplest
If we critically examine word by word the definition of the p- definition is that if some other researcher uses the same methods
value, it never makes any mention of the strength or effect of a then he or she should get similar results and can draw the same
relationship. Thus, if we draw inferences about the strength of re- conclusions.39 This definition was used by the journal while
lationships or effects on the basis of the magnitude of p-value, it is conducting the online survey of their readers on reproducibility.
simply over expectation from this poor statistic and is certainly Nature's survey of 1,576 researchers revealed that more than 70%
beyond its capability. A lower p-value does not ensure that the of them had tried and failed to reproduce another scientist's ex-
results will also be medically significant or reflect a higher strength periments, and more than half had failed to reproduce their own
of a relationship or effect. It measures a probability that judges experiments. To generate awareness about the irreproducibility
evidence against the null hypothesis. the journal Nature has recently started publishing a series entitled
P-value is a data dependent statistic and depends upon the “Challenges in Irreproducible Research” and the “Reproducibility
sample size as well, and decreases as the sample size increases. Initiative”, a project intended to identify and reward reproducible
Thus, normally a huge sample size leads to a low p-value until and research.21
unless there is absolutely no effect. One of the most cited examples The lack of reproducibility of the results questions the validity of
in the literature in this context pertains to the study recommending ongoing as well as earlier research programs and shakes the con-
aspirin in preventing myocardial infarction (MI) based on more fidence of their targeted audience including the project sponsoring
than 22000 subjects.28 The results indicated statistically highly authorities. Generally it is a myth amongst the researchers that a
significant reduction in MI (p < 0.00001), but surprisingly revealed small p-value has the higher probability of producing significant
very small effect sizes (a risk difference of 0.77% with a r2 ¼ 0.001). results if the experiment is replicated. It has been shown long ago
However, based on significant ‘p-values’ aspirin was recommended by Goodman40 that if the effect is a real one, the probability of
for MI reduction29 without paying attention to the extremely low repeating a statistically significant result is substantially lower than
effect sizes and later on these recommendations were modified. expected. Goodman40 and others41e46 have shown that ‘p-values’
Therefore, one should be cautious that extremely large studies may overstate against the null hypothesis and that is why we sometimes
be more likely to find a formally statistically significant difference fail to get significant results. The high rate of non-replication is a
even for a trivial effect that may not be medically or biologically consequence of the convenient, ill-founded strategy based on a
meaningful; or different from the null.30e32 single study and assessed on the p-value less than 0.05.47e49 One
The p-value has been used and misused extensively in the sci- may ask this question, “Is there something wrong with the p-value
entific literature without understanding its meaning. It was which is generally misunderstood or there are some other factors
developed to test inferential evidence against the hypothesis which such as poor study designs, faulty statistical analysis, and scientific
we do not know to be true or false. Many researchers have an misconduct contributing to non-reproducibility”? Johnson50 shows
obsession about this, thinking that ‘p-values’ will add value to their that there is something wrong with the p-value itself and proves
publications and therefore include more ‘p-values’ than necessary mathematically that the use of the 0.05 standard accounts for most
in their studies. Statistical software have simplified and automated of the problems of non-reproducibility in science - even more than
this task. But each p-value requires a test hypothesis followed by other factors, such as bias and scientific misconduct. He advocates
interpretation. It is better to clearly define the hypothesis at the for lowering down the level of significance from 0.05 to the 0.005 or
beginning of the study with a few specific questions whose answers 0.001 as very few studies fail to replicate results that are based on
one seeks from the study by applying the ‘p-values’. There is p-values of 0.005 or smaller. It is hard to believe that nearly one-
absolutely no need to estimate ‘p-values’ for each and every com- quarter of studies concluded on the basis of commonly used sta-
parison or association until and unless they are directly or indi- tistical threshold (0.05) may be false.51
rectly associated with the evidence of the hypothesis(s) whose Let us understand what is wrong with the p-value and why it
answer one is seeking. NHST essentially requires setting up with does not ensure reproducibility? The p-value is estimated from the
great caution, a test hypothesis prior to the conduct of study. The data comprising of random variables, hence it is a random variable
test hypothesis should be such that the researcher has doubts about itself.52e54 It is surprising that the p-value being the most popular
it and does not know its answer. For example, a researcher may be statistic among researchers for testing evidence against the null
interested in knowing ‘Is there any difference in the taste when hypothesis is never estimated with a standard error or confidence
milk is poured into tea decoction or when tea decoction is added to interval unlike other statistics. Indeed, it is not possible to do so for
milk’? The null hypothesis here is that the two methods of pre- p, because it estimates unobservable population parameters; and
paring tea do not have any effect of its taste. Here, we do not know because the p-value is a property of the sample, there is no unob-
whether the null hypothesis is true or false. servable ‘true p-value’ in the larger population.55 The random
Kuffner and Walker33 say, “The p-value is innocent, but behavior of the p-value is generally neglected. This huge sampling
woefully misunderstood”. But at the same time they question the variability in the p-value was called the ‘dance of the p-values’.56
validity of the p-value and argue how a p-value which is data The major cause of the lack of repeatability is the wide variability
dependent can form a valid formal ‘decision rule’ unlike a type-I from sample to sample in p-value which is not considered.21,57 A
error which is fixed in advance. recent study in genomics suggests that over interpretation of very
significant, but highly variable, p-values is an important factor
3. Reproducibility crisis and the p-value contributing to the incidence of non-reproducible studies.58 It is
only as reliable as the sample from which it is calculated21 and
A steadily rising concern about reproducibility of results has inherits its vagueness from the uncertainty of the estimates from
been witnessed among the scientific community especially during which it is calculated.59
A recent story published by Nosek60 and co-researchers literature. But, it can be checked to a greater extent if researchers
including Motyl, a psychology PhD student is an eye opener for understand the meaning of p-values in the new perspective and
all of us who are somehow stuck to drawing inferences solely based adhere to its 4th principle that requires full reporting and trans-
on the unpredictable nature of the p-value. They conducted a study parency for drawing proper inferences.
to investigate the embodiment of political extremism on 1979
participants and completed the study with a statement that polit- 4. P-hacking, publication bias and false positives
ical moderates saw shades of grey more accurately than did either
left-wing or right-wing extremists (p ¼ 0.01). However, before “P-hacking” or “data dredging” or “data manipulation” is a
writing and submitting the manuscript they had a chance to look at malpractice followed by researchers knowingly or unknowingly
the two recent articles highlighting the possibility that research where they keep on fiddling with the data i.e. selecting or dis-
practices spuriously inflate the presence of positive results in the carding data or trying various statistical tools until non-significant
published literature.13,61 Triggered by these findings, investigators results become significant.13,65 The background behind this practice
replicated the experiment with 1300 participants ensuring a sta- is the publication bias which is popularly known as the “file drawer
tistical power of 0.995 to detect an effect of the original effect size at effect” where studies with non-significant or negative results are
a significance level of 5%. To their great surprise, the effect dis- sent to the file drawer and consequently have lower publication
appeared (p ¼ 0.59). These controversial results discouraged the rates compared to significant ones. There is strong evidence that
authors from submitting their manuscript for publication. This true journals, especially those with higher impact factors, dispropor-
story clearly depicts the slippery nature of p-values and raises tionately publish significant results.66e68 A recent study conducted
serious concern about irreproducibility among the scientific com- on three top orthopedic journals clearly indicated an evidence of p-
munity. If every researcher did what Nosek et al.60 did, i.e. repeated hacking in the orthopedic literature.69 Authors used frequency
their experiments before sending manuscripts for publication, plots of p-curves for these journals and revealed a statistical evi-
perhaps some of them might have met the same fate. It won't be dence of p-hacking.
wrong to say that some of the researchers might be genuinely The dichotomous nature of NHST using p-value (if p 0.05,
interested in replicating their experiments and verifying results, reject the null hypothesis or not) is also a driving force for p-
but in the mad race for publishing sooner because of placement in hacking contributing to publication of false positives and conse-
jobs or profile raising etc. they opted not to do so. However, it is also quently most of the manipulations in results, if any, are done by the
true that sometimes constraints of resources or unnecessarily researchers when they get p-values slightly above the threshold
placing subjects at a risk from intervention do not permit re- value (0.05). This threshold value of 0.05 is also under criticism. It is
searchers to do this. Authors of the story truly admit e “When in- not feasible to assess the extent of p-hacking as it depends upon the
centives favour novelty over replication, false results persist in the integrity of the researcher. An indirect way of detecting this is to
literature unchallenged”. A detailed quantitative assessment of look for an increase in the relative frequency of p-values just below
non-reproducibility among psychological studies based on repli- 0.05, where we expect the signal of p-hacking to be the strongest.69
cations of 100 experimental and correlational studies published in P-hacking can also be tested during meta-analysis and its effect
three psychology journals using high-powered designs and original seems to be weak relative to the real effect sizes; and probably does
materials, indicated a decline of sixty one percent in significant not drastically alter scientific consensuses drawn from meta-
findings in replications (36%) over the originals (97%).62 analyses.70
Ioannidis has a pessimistic view about the reliability of the P-hacking promotes publication of false positives and probably
publications and says, “Why most published research findings are amounts to scientific misconduct. Its infection spreads fast and
false?“17 This thought-provoking statement made by the author is remains undetected. It may adversely affect the other related
not based on whimsical grounds but on simulation studies. The ongoing or future research programs in interpreting or reviewing
author says “for most study designs and settings, it is more likely their results and finally drawing conclusions and making policy
for a research claim to be false than true”. These observations decisions etc. The main culprit that promotes p-hacking is the de-
certainly compel us to think ‘should we believe in our past research grees of freedom available with the researcher. The issue of this
claims concluded merely on the basis of a single study without freedom has been discussed by a number of authors.13,61,71e80 The
verifying and replicating the experiment?’ We cannot undo what undisclosed flexibility or degrees of freedom13 enjoyed by the re-
has occurred in the past. Certainly, some precautions can be taken searchers in the course of collecting and analyzing data, excluding
in ongoing and future research programs. A recent study reported a and including certain observations, adding or removing variables,
large scale non-reproducibility (89%) in preclinical cancer research applying alternative statistical tools ultimately takes him to the
studies based on some 53 selected key papers and suggested some path of significant results by getting p 0.05. A recent review
useful guidelines including ‘tags’ that mention whether the key article critically presents p-hacking in relation to degrees of
findings of a seminal paper were confirmed.63 The scientific com- freedom and develops an extensive list of 34 degrees of freedom
munity is shaken by the recent reports of the huge proportion of that researchers have in formulating hypotheses, and in designing,
non-reproducibility of peer-reviewed preclinical studies and running, analyzing, and reporting of psychological research.81 P-
certain guidelines have been formulated for publication of these in hacking is not a problem with p-value per se. Researchers get
the magazine Science.64 trapped into p-hacking to report strong or significant results lead-
A recent study59 reviews that degrading p-values into signifi- ing to publication bias. Simmons et al13 offers a practical solution to
cant and non significant contributes to making studies irrepro- the flexibility-ambiguity problem by offering six requirements for
ducible, or to making them seem irreproducible. Authors of the authors and four guidelines for reviewers.
study further emphasize interpreting p-values cautiously as a The recent ASA statement on the p-value discussed in the last
graded evidence against the null hypothesis by making use of effect section of the statement, unequivocally discourages the wide-
sizes and interval estimates. spread and ongoing practice of p-hacking by making it mandatory
A serious concern about reproducibility has been raised in the full reporting and transparency by the author for drawing a proper
ASA's statement on p-values while introducing the statement, inference. It further adds in the explanation of its 4th principle
however, it does not say what strategy should be adopted to check e“Cherry picking promising findings, also known by such terms
the prevailing high level of non-reproducibility in the scientific as ‘data dredging’, ‘significance chasing’, ‘significance questing’,
‘selective inference’, and ‘p-hacking’, leads to a spurious excess of be tested. However, the ASA's interpretation of the third principle
statistically significant results in the published literature and cautions the users while making decisions purely based on the
should be vigorously avoided”.27 Amrhein et al.59 strongly argue threshold value and emphatically say that the p-value alone cannot
for removing fixed significance threshold to overcome the problem ensure that a decision is correct or incorrect and researchers should
of p-hacking and emphasize interpreting ‘p-values’ as graded consider contextual factors into play to derive scientific inferences.
measures of the strength of evidence against the null hypothesis. It is hardly two years since the ASA statement was released. But a
huge concern is being witnessed among the methodologists about
5. Weak and dichotomous threshold of p-values the threshold of the p-value. Recently a group of 72 methodologists
argued that the prevalent threshold of 0.05 was too high and pro-
Most of the research findings are concluded on the threshold posed its lowering to 0.005 for claiming statistical significance for
value of p i.e. if p is below 0.05 results are considered “statistically new discoveries.85 However, their proposal met a mixed response
significant” otherwise not. No doubt, this arbitrary cutoff has made with strong endorsement in some circles and concerns in others.86
the task easier for the researchers in concluding their findings but
at the same time has led to scientifically dubious practice of 6. The ASA's statement on p-values: context, process, and
regarding significant findings as more valuable, reliable and purpose
reproducible.6 This arbitrary cutoff (usually 0.05) has been ques-
tioned primarily on two grounds e one that many researchers feel Null Hypothesis Significance Testing (NHST) has been in practice
that this cutoff is too high and the other that it promotes p-hacking for almost nine decades since 1925 and a subject of debate with
leading to generating false positives. The weak evidence provided numerous controversies. However, unfailing rising criticism and
by p-values between 0.01 and 0.05 has been recently explored by concerns surrounding the NHST during the past few years certainly
Colquhoun82 in terms of false positive risks and he proved that built up pressure on the American Statistical Association (ASA) to
when p ¼ 0.05, the odds in favour of there being a real effect (given clarify its position on the p-value to avoid its further misinterpre-
by the likelihood ratio) are about 3: 1 and these odds are very low tation, misuse and large scale confusion among the scientific fra-
compared to 19 to 1 inferred from the p-value. Further to achieve a ternity. ASA is the world's largest body of statisticians and the
false positive risk of 5%, one would need to observe p ¼ 0.00045 for oldest continuously operating professional science society in USA.
a prior probability of a real effect of 0.1.82 Stimulated by a number of observations and several intriguing
In an another recent communication researchers also argued that thought provoking articles published in recent past such as “It's
there is weak evidence when the p-value is 0.05 and this threshold science's dirtiest secret: The ‘scientific method’ of testing hypoth-
should be lowered to 0.005 for the social and biomedical sciences.83 eses by statistical analysis stands on a flimsy foundation.“,11
In other scientific fields such as particle physics and DNA studies “numerous deep flaws in null hypothesis significance testing”,87
scientists have asked for a p-value as low as 0.0000003 and for a “statistical techniques for testing hypotheses …have more flaws
genome-wide association study a p-value of 5 10-8.23 No doubt, than Facebook's privacy policies.“,19 “Scientific Method: Statistical
lowering the p-value raises the evidence against the null hypothesis Errors”,88 “Why do so many colleges and grad schools teach
and reduces false positives. But at the same time this requires a p ¼ 0.05?” & “Why do so many people still use p ¼ 0.05?“, questions
higher sample size for rejecting the null hypothesis which may be a posed to the American Statistical Association (ASA) discussion
limitation in many medical experiments and may even involve a forum by George Cobb, Professor Emeritus of Mathematics and
higher risk besides increasing the project cost. The other major Statistics at Mount Holyoke College, “Psychology journal bans P
problem in setting the lower thresholds p-value is that it enhances values”89 and the issues related reproducibility crisis, the ASA
the chances of a false negative which is popularly known as a type II Board decided to take up the challenge of developing a policy
error i.e. we may conclude that effects do not exist where in reality statement on p-values and statistical significance. Based on the
they do. Consequences of getting such negative results have a higher widespread consensus in the statistical community, ASA developed
likelihood of the “file drawer effect” and have a lower publication a formal statement clarifying several widely agreed upon principles
rate. In other words keeping lower thresholds may deprive users the underlying the proper use and interpretation of the p-value. The
benefits of new drugs or technologies. Executive Committee of the ASA took almost three months, held
There is another think-tank which questions even the validity of discussions at length and finally approved the statement on January
the p-value for NHST and favours the use of Bayesian methods over 29, 2016 by defining the p-value along with six guiding principles. It
the p-value. “The family of Bayesian methods has been well is worth mentioning here that the ASA has not previously taken
developed over many decades now, but somehow we are stuck to positions on specific matters of statistical practice.27 ASA's state-
using frequentist approaches”.84 The basic reason being that they ment has been written in very simple language avoiding the tech-
are computationally complex compared to traditional ‘frequentist nical terms such as alternative hypothesis, statistical power and
methods’ and require technical expertise. errors associated with hypothesis testing etc.; and can be easily
The problem arising as a result of threshold values of ‘p’ can be understood even by who don't have statistical background.
overcome to a greater extent if we adhere to recent statement is-
sued by the American Statistical Association with respect to’ p- 7. What is a p-value?
values’ given here under the section ‘ASA's statement on p-values’.
The third principle clearly says “Scientific conclusions and business “Informally, a p-value is the probability under a specified sta-
or policy decisions should not be based only on whether a p-value tistical model that a statistical summary of the data (e.g., the
passes a specific threshold”. Furthermore, principles 4 and 5 of the sample mean difference between two compared groups) would be
said statement are straight forward in defining the limitations of equal to or more extreme than its observed value”.
the p-value that “it does not measure the size of an effect or the
importance of a result “ and “By itself, a p-value does not provide a 7.1. Principles
good measure of evidence regarding a model or hypothesis”. It is
worth mentioning here that the ASA's statement does not say The statement issued by the ASA includes following six guiding
anything about what should be the value of threshold. Perhaps it principles along with its explanation citing examples wherever
has left it to the requirement of the researchers and hypothesis to required:
1. “P-values can indicate how incompatible the data are with a et al92 commented, “Yet a year on, it is not clear that the ASA's
specified statistical model” statement has had any substantive effect at all”. They further add
2. “P-values do not measure the probability that the studied hy- that the business is going as usual even in the leading journals like
pothesis is true, or the probability that the data were produced The Lancet or Proceedings of the National Academy of Sciences. It is
by random chance alone” now time to sensitize this issue which happens to be the inferential
3. “Scientific conclusions and business or policy decisions should base of most our research findings on a massive scale, through
not be based only on whether a p-value passes a specific workshops, conferences and holding teaching classes etc. A pro-
threshold” active role is expected from statisticians all over world in this sci-
4. “Proper inference requires full reporting and transparency” entific endeavor. Such ventures will help in changing the
5. “A p-value, or statistical significance, does not measure the size conventional thought of researchers, reviewers and editors in un-
of an effect or the importance of a result” derstanding the meaning of the p-value, its limitations and its
6. “By itself, a p-value does not provide a good measure of evidence worth. In this context the scientific community must keep in mind
regarding a model or hypothesis”. the concluding remark made in ASA statement that “No single in-
dex should substitute for scientific reasoning”. Therefore, we
We are not here giving the interpretation part included in the should also be open to other alternatives such as Bayesian inference
ASA statement with each of the principle. Readers are advised to tools etc. No doubt, these tools appear to be computationally
look for the same available on the net. As far as contents of the complex. But so far none of the alternative tools have been shown
statement are concerned, Wasserstein & Lazar who happen to be to be superior to p-statistic. Abandoning the p-value is not a solu-
associated with the development of the statement clearly admit by tion to the problem. Much more needed is the understanding of the
saying “Nothing in the ASA statement is new”. Statisticians and pulse of the data while using this statistics for testing the evidence
others have been sounding the alarm about these matters for de- of hypothesis.
cades, to little avail”. Wellek90 while critically evaluating the cur-
rent p-value controversy feels that principles included in the ASA 8. Conclusions
statement provide much in terms of guidance for good practice of
handling and interpreting p-values, and hypothesis tests. However, Much of the criticism that has been prevailing around the p-
the author strongly criticizes the 6th principle of the statement and value is because of researchers' own lack of understanding of the
questions the consensus on the precise meaning of concept of ev- statistic, its limitations & worth and over expectations from this
idence i.e. what is a good measure of evidence and what is a bad value. Most of the researchers today are ignorant of the behavior of
measure. While discussing the problem of reproducibility of results p-values also known as the ‘dance of the p-values’ which is also
and misinterpretation of p-value, Colquhoun feels that the ASA responsible for irreproducibility crisis. It is now time to sensitize
statement includes “what you should not do, but failed to say what this issue which happens to be the inferential base of most our
you should do”.82 We are of this opinion that the statement leaves research findings on a massive scale, through workshops, confer-
most of the job up to the discretion of researchers while inter- ences and holding teaching classes etc. We believe the recent policy
preting and making decisions. But at the same time it has come out statement of the ASA will have a far reaching effect in checking the
with a clear definition and interpretation of p-value which had publication of false positives and curtailing down the “file drawer
been misunderstood over the past several decades by many of re- effect”. We hope this brief essay will be helpful to those who may
searchers having over expectations from this value and causing be interested in knowing ‘Why the p-value is under fire?’
damage to science. The statement will definitely help in reducing
researchers’ over dependency on p-values in concluding their Potential conflicts of interest
findings.
With the ASA guidelines, the dichotomy of the p-value has been The authors have nothing to disclose.
made porous so as to enable the researcher to see the truth on
either side of the threshold. This threshold has been a major cause
Acknowledgements
of augmenting false positives in the scientific literature. Ron Was-
serstein says “In the post p < 0.05 era, scientific argumentation is
We sincerely acknowledge Dr. S.R. Bhat, Former Principal Sci-
not based on whether a p-value is small enough or not. Attention is
entist, NRCPB, IARI campus, New Delhi (India) for going through the
paid to effect sizes and confidence intervals. Evidence is thought of
draft manuscript critically and valuable suggestions.
as being continuous rather than some sort of dichotomy. (As a start
to that thinking, if p-values are reported, we would see their
References
numeric value rather than an inequality (p ¼ 0.0168 rather than
p < 0.05)). All of the assumptions made that contribute information 1. Fisher Lehman EL. Neyman-Pearson theories of hypotheses testing: one theory
to inference should be examined, including the choices made or two? J Am Stat Assoc. 1993;88:1242e1249.
regarding which data is analyzed and how. In the post p < 0.05 era, 2. Fisher Perezgonzalez JD. Neyman-Pearson or NHST? A tutorial for teaching data
testing. Front Psychol. 2015;6:223. https://doi.org/10.3389/fpsyg.2015.00223.
sound statistical analysis will still be important, but no single 3. Lindley DV Introduction to Probability and Statistics from a Bayesian Viewpoint,
numerical value, and certainly not the p-value, will substitute for Part 1: Probability; Part 2: Inference. Cambridge: Cambridge University Press;
thoughtful statistical and scientific reasoning”.91 Amrhein and his 1965.
4. Wald A. Statistical Decision Functions. New York, NY: Wiley; 1950.
co-researchers59 share the same opinion and suggest “ i) do not use
5. Fisher RA. Statistical Methods for Research Workers London. Oliver & Boyd; 1925.
the word ‘significant’ and do not deny observed effect if the p-value 6. Nickerson RS. Null hypothesis significance testing: a review of an old and
is relatively large and ii) discuss how strong we judge the evidence, continuing controversy. Psychol Methods. 2000;5:241e301.
and how practically important the effect is”. 7. Gigerenzer G. Mindless statistics. J Soc Econ. 2004;33:587e606.
8. Gelman A, Stern HS. The difference between ‘significant’ and ‘not significant’ is
It is over two years when the ASA's statement was released and not itself statistically significant. Am Statistician. 2006;60:328e331.
duly highlighted by the media as well. Our personal observations 9. Goodman SN. A dirty dozen: Twelve P-value Misconceptions. Semin Hematol.
are that the majority of medical researchers in India and other 2008;45:135e140.
10. Cumming G. Replication and p intervals: p values Predict the future only
countries may not be aware, or if aware might not have read these vaguely, but confidence intervals do much better. Perspect Psychol Sci. 2008;3:
newly formulated guidelines. After one year of its release Matthews 286e300.
11. Siegfried T. Odds Are, It's Wrong e science fails to face the shortcomings of 52. Hung HMJ, ONeill Bauer, Kohne P. The behavior of the P-value when the
statistics. Sci News. 2010;177:26. https://wattsupwiththat.com/2010/03/20/. alternative hypothesis is true. Biometerics. 1997;53:11e12.
12. Greenland S. Null misinterpretation in statistical testing and its impact on 53. Sackrowitz H, Samuel-Cahn EP. Values as random variables-expected P values.
Health risk assessment. Prev Med. 2011;53:225e228. Am Statistician. 1999;53:326e331.
13. Simmons JP, Nelson LD, Simonsohn U. False-Positive Psychology: undisclosed 54. Murdoch DJ, Yu-Ling Tsai Y, Adcock J. P-values are random variables. Am
flexibility in data collection and analysis allows presenting anything as sig- Statistician. 2008;62:242e243.
nificant. Psychol Sci. 2011;22:1359e1366. 55. Cumming G. Understanding the New Statistics: Effect Sizes, Confidence Intervals,
14. Lew MJ. Bad statistical practice in Pharmacology (and other basic biomedical and Meta-analysis. New York: Routledge; 2012.
disciplines): you probably don't know P. Br J Pharmacol. 2012;166:1559e1567. 56. Cumming G. New statistics: why and how. Psychol Sci. 2014;25:7e29.
15. Greenland S. Non significance plus high power does not imply support for the 57. Boos DD, Stefanski LA. P-value precision and reproducibility. Am Statistician.
null over the alternative. Ann Epidemiol. 2012;22:364e368. 2011;65:213e221.
16. Cohen J. Things I have learned (so far). Am Psychol. 1990;45;1304e12. 58. Lazzeroni LC, Lu Y, Beitskaya-Levy. P-values in genomics: apparent precision
17. Loannidis JPA. Why most published research findings are false. PLoS Med. masks high uncertainity. Mol Psychiatr. 2014;19:1336e1340.
2005;2, e124. 59. Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat (p > 0.05): significance
18. Gelman A, Loken E. The statistical crisis in science [online]. Am Sci. 2014;102: thresholds and the crisis of unreplicable research. Peer J. 2017;5. e3544 https://
567e606. doi.org/10.7717/peerj.3544.
19. Siegfried T. To make Science Better, Watch out for Statistical Flaws; 2014. Science 60. Nosek BA, Spies JR, Motyl M. Scientific utopia II. Restructuring incentives and
News Context Blog https://www.sciencenews.org/blog/context/make-science- practices to promote truth over publishability. Perspect Psychol Sci. 2012;7:
better-watch-out-statistical-flaws (Accessed on 18th April 2018). 615e631.
20. Nuzzo R. P values, the 'gold standard' of statistical validity, are not as reliable as 61. John L, Loewenstein G, Prelec D. Measuring the prevalence of questionable
many scientists assume. Nature. 2014;506:150e152. research practices with incentives for truth-telling. Psychol Sci. 2012;23:
21. Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle p value 524e532.
generates irreproducible results. Nat Methods. 2015;12:179e185. 62. Open Science Collaboration. PSCHOLOGY. Estimating the reproducibility of
22. Thomas LE, Pencina MJ. Do not over (P) value Your research article. JAMA psychological science. Science. 2015;349. https://doi.org/10.1126/
Cardiol. 2016;1:1055. https://doi.org/10.1001/jamacardio.2016.3827. science.aac4716.
23. Chawla DS. P-value shake-up proposed. Nature. 2017;548:16e17. 63. Begley CG, Ellis LM. Drug development: raise standards for preclinical cancer
24. Trafimow D. Editorial. Basic Appl Soc Psychol. 2014;36:1e2. research. 483. 2012:531e533.
25. Trafimow D, Marks M. Editorial. Basic Appl Soc Psychol. 2015;37:1e2. https:// 64. Marcia M. Reproducibility. Science. 2014;343:229.
doi.org/10.1080/01973533.2015.1012991. 65. Gadbury GL, Allison DB. Inappropriate fiddling with statistical analysis to
26. Greenland S, Senn SJ, Rothman KJ, et al. Eur J Epidemiol. 2016;31:337e350. obtain a desirable p-value: tests to detect its presence in published literature.
27. Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, PLoS One. 2012;7, e46363.
and purpose. Am Statistician. 2016;70:129e133. https://doi.org/10.1080/ 66. Begg CB, Berlin JA. Publication bias: a problem in interpreting medical data. J R
00031305.2016.1154108. Stat Soc Ser A. 1988;151:419e463.
28. Bartolucci AA, Tendera M, Howard G. Meta-analysis of multiple primary pre- 67. Fanelli D. Negative results are disappearing from most disciplines and coun-
vention trials of cardiovascular events using aspirin. Am J Cardiol. 2011;107: tries. Scientometrics. 2012;90:891e904.
1796e1801. 68. Song F, Eastwood AJ, Gilbody S, Duley L, Sutton AJ. Publication and related
29. Sullivan GM, Feinn R. Using effect size- or why the P value is not enough. J Gard biases. Health Technol Assess. 2000;4:1e115.
Med Educ. 2012;4:279e282. 69. Rajak HRBA, Ang JGE, Attal H, Allen JC. P-hacking in orthopaedic literature: a
30. Lindley DV. A statistical paradox. Biometrika. 1957;44:187e192. twist to the tail bone joint. Surg Am. 2016;98. e91 https://doi.org/10.2106/JBJS.
31. Bartlett MS. A comment on D.V. Lindley's statistical paradox. Biometrika. 16.00479.
1957;44:533e534. 70. Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD. The extent and conse-
32. Senn SJ. Two cheers for P-values. J Epidemiol Biostat. 2001;6:193e204. quences of P-hacking in science. PLoS Biol. 2015;13. e1002106.
33. Kuffner TA, Walker SG. Why are p-values controversial? Am Statistician. 2018. 71. Kriegeskorte N, Simmons WK, Bellgowan PSF, Baker CI. Circular analysis in
https://doi.org/10.1080/00031305.2016.1277161. systems neuroscience: the dangers of double dipping. Nat Neurosci. 2009;12:
34. Woolston C. A blueprint to boost reproducibility of results. Nature. 2014;513. 535e540.
https://doi.org/10.1038/nature.2014.16222. 72. Nieuwenhuis S, Forstmann BU, Wagenmakers EJ. Erroneous analyses of in-
35. Mobley A, Linder SK, Braeuer R, Ellis LM, Zwelling LA. Survey on data repro- teractions in neuroscience: a problem of significance. Nat Neurosci. 2011;14:
ducibility in cancer research provides insights into our limited ability to 1105e1107.
translate findings from the laboratory to the clinic. PLoS One. 2013;8. e63221. 73. Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas HLJ. Why psycholo-
36. Russell JF. If a job is worth doing, it is worth doing twice: researchers and gists must change the way they analyze their data: the case of psi: comment on
funding agencies need to put a premium on ensuring that results are repro- Bem. J Pers Soc Psychol. 2011;100:426e432.
ducible. Nature. 2013;496:7. 74. Bakker M, van Dijk A, Wicherts JM. The rules of the game called psychological
37. Bohannon J. Replication effort provokes praisedand 'bullying' charges. Science. science. Perspect Psychol Sci. 2012;7:543e554.
2014;344:788e789. 75. Chambers CD. Registered reports: a new publishing initiative at Cortex. Cortex.
38. Goodman SN, Fanelli D, Ioannidis JPA. What does research reproducibility 2013;49:609e610.
mean? Sci Transl Med. 2016;8:341ps12. 76. Francis G. Replication, statistical consistency, and publication bias. J Math
39. Editorial. Reality check on reproducibility. Nature. 2016;533:437. https:// Psychol. 2013;57:153e169.
doi.org/10.1038/533437a. 77. Bakker M, Wicherts JM. Outlier removal, sum scores, and the inflation of the
40. Goodman SN. A comment of replication, P-values and evidence. Stat Med. Type I error rate in independent samples t tests. The power of alternatives and
1992;11:875e879. recommendations. Psychol Methods. 2014;19:409e427.
41. Berger J. Are P-values reasonable measures of accuracy? In: Francis Manly BFJI, 78. Simonsohn U, Nelson LD, Simmons JP. p-curve and effect size. Correcting for
Lam FC, eds. Pacific Statistics Congress. North-Holland: Elsevier; 1986. publication bias using only significant results. Perspect Psychol Sci. 2014;9:
42. Berer J, Sellke T. Testing a point null hypothesis: the irreconcilability of P- 666e681.
values and evidence. J Am Stat Assoc. 1987;82:112e139. 79. Steegen S, Tuerlinckx F, Gelman A, Vanpaemel W. Increasing transparency
43. Good I. Good Thinking: The Foundations of Probability and its Applications. Uni- through a multiverse analysis. Perspect Psychol Sci. 2016;11:702e712.
versity of Minnesota Press; 1983:129. 80. van Aert RCM, Wicherts JM, van Assen MALM. Conducting meta-analyses based
44. Pratt J. Bayesian interpretation of standard inference statements. J Roy Stat Soc on p-values: reservations and recommendations for applying p- uniform and
B. 1965;27:169e203. p-curve. Perspect Psychol Sci. 2016;11:713e729.
45. Edwards W, Lindman H, Savage L. Bayesian statistical inference for psycho- 81. Wicherts JM, Veldkamp CL, Augusteijn HE, Bakker M, van Aert RC, van
logical research. Psychol Rev. 1963;70:193e242. Assen MA. Degrees of freedom in planning, running, analyzing, and reporting
46. Diamond G, Forrester J. Clinical trials and statistical verdict: probable grounds psychological studies: a checklist to avoid p-hacking. Front Psychol. 2016;7:
for appeal. Ann Intern Med. 1983;98:385e394. 1832.
47. Sterne JA, Davey Smith G. Sifting the evidence- What's wrong with significance 82. Colquhoun D. The reproducibility of research and the misinterpretation of p-
tests. BMJ. 2001;322:226e231. values. R.Soc.Open Sci. 2017;4:171085. https://doi.org/10.1098/rsos.171085.
48. Wacholder S, Chanock S, Garcia-Closas M, Elghormli L, Rothman N. Assessing 83. Benjamin D, Berger J, Johannesson M, et al. Redefine statistical significance.
the probability that a positive report is false: an approach for molecular Preprint on PsyArXiv http://osf.io/preprints/psyarxiv/mky9j. 2017; 2017.
epidemiology studies. J Natl Cancer Inst. 2004;96:434e442. 84. Ioannidis JP, Tarone R, McLaughlin. The false-positive to false-negative ratio in
49. Risch NJ. Searching for genetic determinants in the new millennium. Nature. epidemiologic studies. J Epidemiol. 2011;22:450e456.
2000;405:847e856. 85. Benjamin DJ, Berger JO, Johnson VE, et al. Redefine statistical significance. Nat
50. Johnson VE. Revised standards for statistical evidence. Proc Natl Acad Sci Unit Hum Behav. 2018;2:6e10.
States Am. 2013;110:19313e19317. 86. Ioannidis JP. Viewpoint: the proposal to lower p-value thresholds to .005. J Am
51. Hayden EC. Weak statistical standards implicated in scientific irreproducibility. Med Assoc. 2018;319:1429e1430.
Nature. 2013. https://doi.org/10.1038/nature.2013.14131. 87. Physorg Science News Wire. The problem with p values: how significant are
they, really?. Available at http://phys.org/wire-news/145707973/the-problem- 90. Wellek S. A critical evaluation of the current “p-value controversy”. Biom J.
with-p-values-how-significant-are-they-really.html; 2013. 2017;00:1e19.
88. Nuzzo R. Scientific method: statistical errors. Nature. 2014;506:150e152. 91. Wasserstein R. American Statistical Association Seek “Post p < 0.05" Era.
Available at http://www.nature.com/news/scientific-methodstatistical-errors- Debunking Denialism; 2016. Available at https://debunkingdenialism.com/
1.14700. 2016/03/08/american-statistical-association-seek-post-p-0-05-era/.
89. Woolston C. Psychology journal bans P values. Nature. 2015;519. https:// 92. Matthews R, Wasserstein R, Spiegelhalter D. The ASA's p-value statement, one
doi.org/10.1038/519009f. year on. Significance. 2017;14:38e41.

Why The P Value Is Under Fire

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Why The P Value Is Under Fire

Uploaded by

Copyright:

Available Formats

Current Medicine Research and Practice 8 (2018) 222e229

Contents lists available at ScienceDirect

Current Medicine Research and Practice

Why the p-value is under ﬁre?

1. Introduction The genesis of the p-value is credited to Sir Ronald Fisher,5 a

2. Misunderstandings or misinterpretations of p-value

A major concern expressed by critics is that NHST is misunder-

You might also like