MiniLiteratureReview onNHST v4

A mini-literature-review:
What have been said about

Null Hypothesis Significance Test (NHST)
Dr. Gang Xie (John)

Statistics Support Officer, Quantitative Consulting Unit,
Office of Research Services and Graduate Studies,
Charles Sturt University, NSW, Australia
Website: http://www.csu.edu.au/qcu
19/06/2023 Copyright © Charles Sturt University 2021

References:
[1] David J. Sheskin (2011), Handbook of Parametric and Nonparametric Statistical Procedures (5 th
Edition), CRC Press, Taylor & Francis Group.
[2] George W. Snedecor & William G. Cochran (1989), Statistical Methods (8th Edition), Iowa State
University Press / AMES.
[3] Morris H. DeGroot (1986), Probability and Statistics (2 nd Edition), Addison-Wesley publishing
company.
[4] George Casella & Roger L. Berger (2002), Statistical Inference (2 nd Edition), Duxbury, Thomson
Learning.
[5] Raymond Hubbard and M.J. Bayarri (2003), Confusion over Measures of Evidence (p’s) versus Errors
(α’s) in Classical Statistical Testing, The American Statistician, Vol. 57, No. 3, pp.171-182.
[6] Gerd Gigerenzer (2004), Mindless statistics, The Journal of Socio-Economics 33, 587-606.
[7] Raymond Hubbard & J. Scott Armstrong (2006), Why We Don’t Really Know What Statistical
Significance Means: Implications for Educators, Journal of Marketing Education, Vol. 28 No. 2, 114-120.
[8] David Salsburg (2001), The Lady Tasting Tea: How statistics revolutionized science in the twentieth
century, W.H. Freeman and Company.
19/06/2023 Australian and New Zealand Statistical Virtual Conference, 05-09 July 2021 (ANZSC 2021) 2
References (continued):
[9] Raymond Hubbard, Brian D. Haig, and Rahul A. Parsa (2019), The Limited Role of Formal Statistical Inference in
Scientific Inference, The American Statistician, Vol. 73, No. S1, 91-98.
[10] Gerd Gigerenzer (1998), We need statistical thinking, not statistical rituals, Behavioral and Brain Sciences, 21:2,
199-200.
[11] Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019), Moving to a World Beyond “p<0.05”, The
American Statistician, Vol. 73, No. S1, 1-19: Editorial.
[12] Gerald J. Hahn and William Q. Meeker (1993), Assumptions for Statistical Inference, The American Statistician,
Vol. 47, No. 1, pp. 1-11.
[13] Geoff Cumming & Robert Calin-Jageman (2017), Introduction to The New Statistics: Estimation, Open Science,
& Beyond, Routledge, Taylor & Francis Group.
[14] David A. Freedman (1991), Statistical Models and Shoe Leather, Sociological Methodology, Vol. 21, pp. 291-313
[15] Frank Yates (1951), The Influence of Statistical Methods for Research Workers on the Development of the
Science of Statistics, Journal of The American Statistical Association, 46(253), pp. 19-34.
[16] Hadley Wickham & Garrett Grolemund (2017), R for Data Science: Import, Tidy, Transform, Visualize, and
Model Data, O’Reilly Media, Inc.
[17] William G. Cochran and Gertrude M. Cox (1957), Experimental Designs (2 nd Edition), John Wiley & Sons, Inc.
References (continued):
[18] E.L. Lehmann and Joseph P. Romano (2005), Testing Statistical Hypotheses (3 rd Edition), Springer.
[19] George E.P. Box (1976), Science and Statistics, Journal of the American Statistical Association
December 1976, Volume 71, Number 356, pages 791-799.
[20] Valentin Amrhein, Sander Greenland, Blake McShane, Retire statistical significance, Nature(2019),
Vol. 567, 305: Comment.
[21] Christopher Tong (2019), Statistical Inference Enables Bad Science; Statistical Thinking Enables Good
Science, The American Statistician, Vol. 73, No. S1, 246-261.
[22] Jacob Cohen (1994), The Earth Is Round (p < .05), American Psychologist, Vol.49, No. 12, 997-1003.
[23] Vincent S. Staggs (2019), Why statisticians are abandoning statistical significance, (Guest Editorial
for) Research in Nursing & Health, 42:159-160.
[24] Gerd Gigerenzer and Julian N. Marewski (2014), Surrogate Science: The Idol of a Universal Method
for Scientific Inference, Journal of Management, Vol.41 No.2, 421-440: Editorial Commentary
[25] Christian P. Robert (2016), Bayesian Testing of Hypotheses, Presentation for CRiSM workshop on
Contemporary Issues in Hypothesis Testing, 15-16 Sep. 2016, Warwick University, UK; online link:
https://warwick.ac.uk/fac/sci/statistics/crism/workshops/hypothesistesting/robert.pdf (accessed on
10/03/2021)
Outline of this presentation
• The concept/definition/nature of statistical inference
• Fisher’s Significance Testing versus Neyman-Pearson’s Hypothesis Testing
• Null Hypothesis Significance Test (NHST) as the Hybrid Testing Paradigm
• Fundamental concerns regarding the practice of NHST
• Textbooks sell (NHST as) a unified approach of statistical inference
• What we can/should do regarding the practice of NHST
• Your take-home messages
• References
The concept/definition/nature of statistical inference
• “The field of statistics can be divided into two general areas, descriptive statistics and
inferential statistics.”[1] “The process of making statement about the population from the
results of samples is called statistical inference.” [2]
• “Inferential statistics employs data in order to draw inferences (i.e., derive conclusions) or
make predictions. Typically, in inferential statistics sample data are employed to draw
inferences about one or more populations from which the samples have been derived.”[1]
• Estimation and Tests of Hypotheses are two major components (or two basic areas) in
statistical inference [1][2][3][4].
• “The goal of a hypothesis test is to decide, based on a sample from the population, which
of two complementary hypotheses is true.” [4] “Testing and confidence intervals are closely
related in a mathematical sense but the two techniques have different purposes.” [2]
• “A test statistic is evaluated in reference to a sampling distribution, which is a theoretical
probability distribution of all the possible values the test statistic can assume if one were to
conduct an infinite number of studies employing a sample size equal to that used in the
study. The probabilities for a sampling distribution are based on the assumption that each
of the samples is randomly drawn from the population it represents.” [1]
A summary of What we intend to do with statistical inference
Random
sample 1 Random
sample 2
Our research questions
and the defined
target population Random
sample i
(parameters: μ, σ, etc.)
My (random)
Randomisation; sample
Replicate; Blocking
Sample statistics as the Statistical models,
Parameter estimates such as regression
(point estimates and/or and (associated)/or
interval estimates)
hypothesis tests
Extra slides:
Reference [8]: One of the book cover comments reads
“Salsburg’s book is the story of statistical

theory in the twentieth century, its time of
triumph, and of the mathematical/scientific
geniuses who made it happen. He writes
with both experience and insight, and with
a happy lack of technical barriers between
the reader and his subject.”
- Brad Efron, Professor of Statistics, Standford University
https://www.amazon.com.au/Lady-Tasting-Tea-David-Salsburg/dp/0805071342
Extra slides:
https://en.wikipedia.org/wiki/Karl_Pearson https://en.wikipedia.org/wiki/Ronald_Fisher https://en.wikipedia.org/wiki/William_Sealy_Gosset
Extra slides:
https://en.wikipedia.org/wiki/Jerzy_Neyman https://en.wikipedia.org/wiki/Egon_Pearson https://en.wikipedia.org/wiki/Andrey_Kolmogorov
Extra slides:
https://www.researchgate.net/publication/8524006_ https://towardsdatascience.com/what-can-an-octopus-tell-us-
https://www.britannica.com/biography/Karl-Pearson Two_Approaches_to_Etiology_The_Debate_Over_ about-the-biggest-debate-in-statistical-theory-f017295d781f
Smoking_and_Lung_Cancer_in_the_1950s/figures?lo=1
Fisher’s Significance Testing versus Neyman-Pearson’s Hypothesis
Testing
• The currently most popular statistical inference device, Null Hypothesis Significance Test
(NHST) is the result of the anonymous merger of two fundamentally different statistical
testing paradigms (Fisher’s evidence-oriented significance testing (regarding the
inferential interpretations about the truth of a nullifiable hypothesis) and Neyman-
Person’s behaviour-oriented hypothesis testing (regarding a decision-making procedure
about two competing hypotheses)) into one seemingly unified/universal testing
procedure regarding the so-called ‘statistical significance’ [5][6][7].
• “The level of significance shown by a p value in a Fisherian significance test refers to the
probability of observing data this extreme (or more so) under a null hypothesis. This data-
dependent p value plays an epistemic role by providing a measure of inductive evidence
against Ho in single experiments. This is very different from the significance level denoted
by α in a Neyman-Pearson hypothesis test. With Neyman-Pearson, the focus is on
minimizing Type II, or β, errors (i.e., false acceptance of a null hypothesis) subject to a
bound on Type I, or α, errors (i.e., false rejections of a null hypothesis). Moreover, this
error minimization applies only to long-run repeated sampling situations, not to individual
experiments, and is a prescription for behaviors, not a means of collecting evidence.” [5]
Fisher’s Significance Testing versus Neyman-Pearson’s Hypothesis
Testing
• “Unfortunately, to develop a mathematical approach to hypothesis testing that was internally
consistent, Neyman had to deal with a problem that Fisher had swept under the rug. This is a
problem that continues to plague hypothesis testing, in spite of Neyman’s neat, purely
mathematical solution. It is a problem in the application of statistical methods to science in
general. In its more general form, it can be summed up in the question: What is meant by
probability in real life?” [8]
• “ Fisher suggested in his book Statistical Methods and Scientific Inference that the final decision
about what p-value would be significant should depend upon the circumstances. I used the word
suggested, because Fisher is never quite clear on how he would use p-values. He only presents
examples.” [8]
• “The N-P theory of hypothesis testing, … (It) is not a theory of statistical inference at all .” [7]
• “Fisher’s views on the nature of populations, probability, and random sampling were confusing.”
[9]
• “Students should learn why Neyman believed that null hypothesis testing can be ‘worse than
useless’ in a mathematical sense (e.g., when the power is less than alpha), and why Fisher thought
that Neyman’s concept of Type II error reflects a ‘metal confusion’ between technology and
science.” [10]
Null Hypothesis Significance Test (NHST)– The Hybrid Testing
Paradigm
• As typically presented in the textbooks, the hybrid testing paradigm, one form of NHST practice is
carried out roughly as follows: “The investigator specifies the null (Ho) and alternative (Ha)
hypotheses, the Type I error rate/significance level, α, and (supposedly) calculates the power
of the test (e.g., a z test). These steps are congruent with N-P orthodoxy. Next, the test statistic
is computed, and in an effort to have one’s cake and eat it too, a p value is determined.
Statistical significance is then established by using the problematic p < α criterion; if p < α, a result
is deemed statistically significant, and if p > α, it is not.” [7]
• However, this NHST practice is unjustifiable because it is an odd compromise between Fisher’s and
Neyman-Pearson’s logic that violates both. “…, if a researcher is interested in the ‘measure of
evidence’ provided by the p value, we see no use in also reporting the error probabilities, since
they do not refer to any property that the p value has. (In addition, the appropriate interpretation
of p values as a measure of evidence against the null hypothesis is not clear. …) Likewise, if the
researcher is concerned with error probabilities the specific p value is irrelevant.” [5] This, in turn,
causes an incompatible conflict in interpretation of the NHST testing outcomes - do they apply to a
single experiment/survey situation or a long-run repeated sampling situation?
Null Hypothesis Significance Test (NHST)– The Hybrid Testing
Paradigm
• For researchers in social sciences, the NHST practice often takes a so-called ‘the null ritual’
form as termed by Gigerenzer [6]: “1. Set up a statistical null hypothesis of ‘no mean
difference’ or ‘zero correlation.’ Don’t specify the predictions of your research hypothesis
or of any alternative substantive hypothesis. 2. Use 5% as a convention for rejecting the
null. If significant, accept your research hypothesis. Report the results as p < 0.05, p <
0.01, or p < 0.001 (whichever comes next to the obtained p-value). 3. Always perform this
procedure.” [6]
• “This first step is inconsistent with Naymen-Pearson theory; it does not specify an
alternative statistical hypothesis, α, β, and the sample size. The second step, making a yes-
no decision, is consistent with Neyman-Pearson theory, except that the level should not be
fixed by convention but by thinking about α, β, and the sample size. Fisher (1955) and
many statisticians after him (see Perlman and Wu, 1999), in contrast, argued that unlike in
quality control, yes-no decisions have little role in science; rather, scientists should
communicate the exact level of significance.” [6]
Null Hypothesis Significance Test (NHST)– The Hybrid Testing Paradigm
• “The basic differences are this: For Fisher, the exact level of significance is a property of the data,
that is, a relation between a body of data and a theory. For Neyman and Pearson, α is a property
of the test, not of the data. In Fisher’s Design, if the result is significant, you reject the null;
otherwise you do not draw any conclusion. The decision is asymmetric. In Neyman-Pearson
theory, the decision is symmetric. Level of significance and α are not the same thing.” [6]
• Gigerenzer [6] further used a Freudian psyche analogy to graphically depict the unconscious conflict
between statistical ideas in the minds of researchers related to the null ritual (NHST) practice.
The Unconscious Conflict
Superego
(Neyman-Pearson)
Two or more hypotheses; alpha and beta determined before the experiment;
compute sample size; no statements about the truth of hypotheses …
Ego
(Fisher)
Null hypothesis only; significance level computed after the experiment; beta ignored;
sample size by rule of thumb; gets papers published but left with feeling of guilt
Id
(Bayes)
Desire for probabilities of hypotheses
There are fundamental concerns regarding the practice of NHST
• “…, no p-value can reveal the plausibility, presence, truth, or importance of an association or effect.
Therefore, a label of statistical significance does not mean or imply that an association or effect is
highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the
association or effect being improbable, absent, false, or unimportant. Yet the dichotomization into
‘significant’ and ‘not significant’ is taken as an imprimatur of authority on these characteristics.”[11]
• “To be clear, the problem is not that of having only two labels. Results should not be trichotomized,
or indeed categorized into any number of groups, based on arbitrary p-value thresholds. Similarly, we
need to stop using confidence intervals as another means of dichotomizing (based, on whether a null
value falls within the interval). And, to preclude a reappearance of this problem elsewhere, we must
not begin arbitrarily categorizing other statistical measures (such as Bayes factors).”[11]
• The “use of formal statistical inference mandates adherence to two basic requirements. First, the
population must be well defined, finite, and unchanging. Second, a random sampling procedure is
imperative in the selection of elements. In methods texts this sounds as uncomplicated as following
the steps in a basic recipe.”[9] However, in practice, most of us would agree (or as pointed out in
literature) that unable (or not) to define a study population or unable to draw a random sample(s)
from an ever defined population is a common case rather than an exception. [1][6][8][12][13]
• “Thus, typically (although there are exceptions), the ideal sample to employ in research is a random
sample.” “In point of fact, it would be highly unusual to find an experiment that employed a truly
random sample. Pragmatic and/or ethical factors make it literally impossible in most instances to
obtain random samples for research.” [1]
• “Generally, replication and prediction of new results provide a harsher and more useful validation
regime than statistical testing of many models on one data set. Fewer assumptions are needed, there is
less chance of artifact, more kinds of variation can be explored, and alternative explanations can be
ruled out. Indeed, taken to the extreme, developing a model by specification tests just comes back to
curve fitting-with a complicated set of constraints on the residuals.” [14]
• In his 'Statistical Modeling, Causal Inference, and Social Science' blog Andrew Gelman (the first author
of Bayesian Data Analysis) wrote: "...if you’re going to give advice in a statistics book about data
collection, random sampling, random assignment of treatments, etc., you should also talk about
repeating the entire experiment. ...I don’t know that I’ve ever seen a statistics textbook recommend
repeating the experiment as a general method, in the same way they recommend random sampling,
random assignment, etc." (https://statmodeling.stat.columbia.edu/2020/02/14/42221/ , accessed on 18/11/2020)
• “The emphasis on tests of significance, and the consideration of the results of each experiment in
isolation, have had the unfortunate consequence that scientific workers have often regarded the
execution of a test of significance on an experiment as the ultimate objective. Results are significant
or not significant and that is the end of it.” [15]
• “Unlike its statistical inference counterpart, the concept of scientific inference defies reduction to a
series of allegedly neat and tidy methodological steps whose dutiful observance renders the output
“science.” Believing otherwise is wishful thinking.” [9]
• “So scientific inferences are made within a dynamic context of what we believe we know, and hope
to know, about our world and beyond. Importantly, these scientific conjectures and judgments do
not often involve applications of formal statistical inference.” [9]
• “Hypothesis Generation Versus Hypothesis Confirmation: As soon as you use an
observation twice, you’ve switched from confirmation to exploration. This is necessary
because to confirm a hypothesis you must use data independent of the data that you
used to generate the hypothesis. Otherwise you will be overoptimistic. There is
absolutely nothing wrong with exploration, but you should never sell an exploratory
analysis as a confirmatory analysis because it is fundamentally misleading.” [16]
Classic textbooks sell a unified approach of statistical inference [1][2][3][4][17]
[18]
As noted by Salsburg, “While Snedecor did not contribute much in the way of original research,
he was a great synthesizer. In the 1930s, he produced a textbook, Statistical Methods, first in
mimeographed form and finally published in 1940, which became the preeminent text in the field.”
“In the 1970s, a review of citations in published scientific articles from all areas of science showed
that Snedecor’s Statistical Methods was the most frequently cited book.” [8] (Snedecor died in 1974)
The 8th edition of the book [2] with Cochran as the co-author was published in 1989 (Cochran died in 1980).
“He (William Cochran) joined Gertrude Cox in teaching experimental design courses (there were
now several such courses), and together in 1950 they wrote a textbook on the subject, entitled
Experimental Designs.” “The Science Citation Index publishes lists of citations from scientific journals
each year. The Index is printed in small print with the citations ranged in five columns; Cochran and
Cox’s book usually takes up at least one full columns each year.” [8] ([17]@1957, 1950, two editions)
“Eric Lehmann …, in 1959, wrote a definitive textbook on the subject of hypothesis testing,
which remains the most complete description of Neyman-Pearson hypothesis testing in the
literature.” [8] ([18]@2005, 1986, 1959, three editions)
Classic textbooks sell a unified approach of statistical inference [1][2][3][4][17]
[18]
• “The two-sided 95% confidence interval for μ consists precisely
of those values of μo for μ that would result in failing to reject
the hypothesis using a 5% two-tailed test on the sample
evidence . According to 5.2.1, Ho: μ = μo fails to be rejected by
a 5% two-tailed test whenever
μo - 1.96σ/ < and μo + 1.96σ/ <

The inequalities can be rearranged as follows:
μo < + 1.96σ/ and μo > - 1.96σ/
Or, as a single expression, - 1.96σ/ < μo < + 1.96σ/
This is the two-sided 95% confidence interval for μ.
The confidence interval provides an assessment of how
accurately we know μ, while the test indicates whether μ could
have the value μo.”[2](page 66)
However, the real/fundamental concern is how to interpret the
test result in real life terms [8]. For example, see the snapshot
picture about confidence intervals on the right copied from [13].
What we can/should do regarding the practice of NHST
• “Research workers, therefore, have to accustom themselves to the fact that in many branches
of research the really critical experiment is rare, and that it is frequently necessary to combine
the results of numbers of experiments dealing with the same issue in order to form a satisfactory
picture of the true situation. This is particularly true of agricultural field trials, where in general the
effects of the treatments are found to vary with soil and meteorological conditions. In consequence
it is absolutely essential to repeat the experiment at different places and in different years if results
of any general validity or interest are to be obtained.” [15]
• “To Box, this one experiment is part of a stream of experiments.

The data from this experiment are compared to data from other
experiments. The previous knowledge is then reconsidered in
terms of both the new experiment and new analysis of the old
experiments. The scientists never cease to return to older studies
to refine their interpretation of them in terms of the new studies.” [8]
• Figure (Figure A in [19]) title: The advancement of Learning;
subtitle for the top panel, An Iteration Between Theory and Practice;
subtitle for the bottom panel, A Feedback Loop.
 We should accept the fact that uncertainty exists everywhere in research and statistical methods do not
rid data of their uncertainty. Better measures, more sensitive designs, and larger samples, all of these
increase the rigor of research. Accepting uncertainty helps us be modest and leads us to be thoughtful.
[9][11][20][21][22] Statistically thoughtful researchers begin above all else with clearly expressed
objectives and they consider not only one but a multitude of data analysis techniques. [11][20][21]
 “Thoughtful research prioritizes sound data production by putting energy into the careful planning,
design, and execution of the study (Tong 2019).” [11]
 “Thoughtful research considers the scientific context and prior evidence.” [11]
 “Being thoughtful in our approach to research will lead us to be open in our design, conduct, and
presentation of it.” [11]
 “Being open goes hand in hand with being modest.” “The nexus of openness and modesty is to report
everything while at the same time not concluding anything from a single study with unwarranted
certainty.” [11]
 “Be modest by encouraging others to reproduce your work. Of course, for it to be reproduced readily,
you will necessarily have been thoughtful in conducting the research and open in presenting it.” [11]
• As quoted by Tong in [21], “the then-Editor of Science, and now President of the U.S. National
Academy of Science, Marcia McNutt, stated:
At Science, the paradigm is changing. We’re talking about asking authors, ‘Is this
hypothesis testing or exploratory?’ An exploratory study explores new questions rather
than tests an existing hypothesis. But scientists have felt that they had to disguise an
exploratory study as hypothesis testing and that is totally dishonest. I have no problem
with true exploratory science. That is what I did most of my career. But it is important that
scientists call it as such and not try to pass it off as something else. If the result is
important and exciting, we want to publish exploratory studies, but at the same time make
clear that they are generally statistically underpowered, and need to be reproduced.”
 Reporting and interpreting point and interval estimates should become a routine practice, and seek
ways to quantify, visualize, and interpret the potential for error [6][13][20]. “Whatever the statistics
show, it is fine to suggest reasons for your results, but discuss a range of potential explanations, not
just favoured ones. Inferences should be scientific, and that goes far beyond the merely statistical.
Factors such as background evidence, study design, data quality and understanding of underlying
mechanisms are often more important than statistical measures such as P values or intervals.” [20]
Conclusions (your take-home messages):
• Statisticians who studied the issue agreed that NHST is a never-justified practice that
originated from confusions in the early history of modern statistics. [5-12][23][24]
• The term “statistically significant” – don’t say it and don’t use it! In other words, the
concept of “statistical significance” should be abandoned entirely from statistical
analysis practice. [6][7][9][11][20][21]
• It is researchers’ desire to substitute intellectual capital for labour [14]. However, an
alternative magic recipe to replace NHST for statistical inference does not exist and it
should not exist. [6][9][19][21][22][24][25]
• “Most scientific research is exploratory in nature.” The formal statistical inference can
only play a limited role in scientific inference. It is statistical thinking enables good
science. [9-16][21][22][24]
• Context is king in statistics. [6][13][14][16][19][20]; And the seven words principles
for performing statistical analysis without referring to statistical significance are:
“Accept uncertainty. Be thoughtful, open, and modest.” [11]
Thank You for your attention
and
Welcome comments or questions
19/06/2023 Copyright © Charles Sturt University 2021 26

MiniLiteratureReview onNHST v4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MiniLiteratureReview onNHST v4

Uploaded by

Copyright:

Available Formats

A mini-literature-review:

What have been said about

Dr. Gang Xie (John)

19/06/2023 Copyright © Charles Sturt University 2021

Reference [8]: One of the book cover comments reads

“Salsburg’s book is the story of statistical

https://en.wikipedia.org/wiki/Karl_Pearson https://en.wikipedia.org/wiki/Ronald_Fisher https://en.wikipedia.org/wiki/William_Sealy_Gosset

https://en.wikipedia.org/wiki/Jerzy_Neyman https://en.wikipedia.org/wiki/Egon_Pearson https://en.wikipedia.org/wiki/Andrey_Kolmogorov

μo - 1.96σ/ < and μo + 1.96σ/ <

• “To Box, this one experiment is part of a stream of experiments.

19/06/2023 Copyright © Charles Sturt University 2021 26

You might also like