JX ANZSC2020 v2

Statistical Society of Australia and New Zealand Statistical Association, 9 July, 2020: SSA + NZSA mini-virtual conference
Please Join Us - Say ‘No’ to

Null Hypothesis Significance
Testing
Authors: Gang (John) Xie1, Jason White1
Affiliations:
1 Introduction
Research Office, Charles Sturt University, Wagga Wagga, NSW Australia
In 2019 American Statistical Association made a call to “stop using the term ‘statistically significant’ entirely” [1]. Through this poster, we would like to deliver the
following (already known but unduly ignored) messages to as many researchers as possible, as another push towards “moving to a world beyond ‘p < 0.05’”.
The Null Hypothesis Significance Testing (NHST) paradigm as it is today should hardly have a place in statistical inference for scientific research because NHST is
logically not defensible, technically flawed, and practically damaging. The central logic of NHST may be stated as: that if an observation is rare under the null
hypothesis then it can be regarded as evidence against the null hypothesis. However, this NHST logic is not defensible [2][3]. Among many other technical fallacies,
NHST typically dichotomizes a continuous test statistic/measure (e.g., p-value < 0.05) to reach a rejection/acceptance conclusion which contains mostly the
information that is actually not what researchers look for [1][2]. Statistical inference in scientific research, in its essence, should focus on questions such as “what is
the best estimate(s) of the treatment effect(s)” and “what is the magnitude of the uncertainty associated with the estimate(s)”. The question “is the treatment effect
significant ?” is really not a statistical question [2][4]. It is a matter of fact that the application of NHST has now become the cult of statistical significance which does
not and/or is unable to provide (but is unfortunately misused as providing) the relevant/substantive information in answering scientific research questions[3][4].
A typical setting of the NHST paradigm What we intend to do with statistical inference
• Specification of H0 and H1
• Calculate the p-value or 95% confidence interval (ci)
• Reject H0 if p < 0.05 or ci does not contain zero so that H1 is accepted; otherwise
accept H0.
• By accepting H0 we assume/conclude that the difference or the treatment effect
found in sample data is due to chance; otherwise, we claim that the
difference/effect is ‘real’ or ‘true’.
• Type I error (the α value); Type II error (the β value); statistical power (i.e., 1-β);
sample size; effect size.
NHST paradigm which became into being around 1950s, is actually a hybrid logic of
R.A. Fisher’s logic of significance testing and J. Neyman and E.S. Pearson’s logic of
hypothesis testing, gradually becoming the sine qua non of scientific research [2][3]. A hypothetical example of NHST using G*Power 3.1
What is wrong with NHST

“…, no p-value can reveal the plausibility, presence, truth, or importance of an association or effect. Therefore, a label of statistical significance does not mean or
imply that an association or effect is highly probable, real, true, or important. Nor does a label of statistical nonsignificance lead to the association or effect being
improbable, absent, false, or unimportant. Yet the dichotomization into ‘significant’ and ‘not significant’ is taken as an imprimatur of authority on these characteristics.
… To be clear, the problem is not that of having only two labels. Results should not be trichotomized, or indeed categorized into any number of groups, based on
arbitrary p-value thresholds. Similarly, we need to stop using confidence intervals as another means of dichotomizing (based, on whether a null value falls within the
interval). And, to preclude a reappearance of this problem elsewhere, we must not begin arbitrarily categorizing other statistical measures (such as Bayes factors). ”
(citation of [1]).
A hypothetical ANOVA case (N=30): Scenario 2, samples are from • In general, Pr(Data|H is true) ≠ Pr(H is true|Data). Therefore, NHST does not tell us what we want to
Scenario 1, treatment samples three different populations: know. What we are looking for is Pr(H is true|Data) but NHST provides us with p-value (i.e., Pr(Data|
(balanced design) from the same Normal(mean = 9.9, sd = 3.4) for trt1; H is true)). Assuming Pr(H0) = Pr(H1)=0.5), we have Pr(H0 is true|p ≥ α) = (1 – α)/[(1 – α) + β] and
Normal(mean = 11.5, sd = 3.4) for trt2;
population: Normal(mean = 12, sd = 3.4) Normal(mean = 14.3, sd = 3.4) for trt3. Pr(H0 is true|p < α) = α/[(1 – β) + α] based on Bayes Theorem.
• Statistical analysis results based on a single set sample data can hardly
provide confirmatory evidence about scientific findings [6][7].
• Given all other things are correct (e.g., data quality, model specification and
assumption conditions, etc.), calculation of p-values at least depends on three • In real life cases, however, H0 (i.e., no difference or everything are equal) is almost always false
elements: the raw effect measure (e.g., mean difference, correlation, odds ratio, so that type I error α=0 [5][6]. In these cases, calculation of p-values is conceptually
etc.); variance of the effect; and sample size [7]. In particular, increasing sample inappropriate which is the fundamental reason for the large sample size dilemma (with a
size always results in decreasing p-value given other things unchanged[5]. sufficiently large sample size you can make any result become statistically significant! [5])
Moving to a world beyond ‘p < 0.05’: Statistical data analysis without significance test Analysis of a hypothetical ANOVA case:
• Context is king in statistics! “Whatever the statistics show, it is fine to suggest reasons for your results,
but discuss a range of potential explanations, not just favoured ones. Inferences should be scientific, and that goes
far beyond the merely statistical. Factors such as background evidence, study design, data quality and understanding
of underlying mechanisms are often more important than statistical measures such as P values or intervals.” (citation of [7])
A valid statistical data analysis should

• Define your research question(s) (by referring to a defined target population and sampling/data collection scheme) and
perform your data analysis as planned; provide the point estimate (e.g., mean, median, proportion, correlation, etc.) and
interval estimate (e.g., standard deviation, range) of the treatment effects by performing descriptive statistical data
(numerical and/or graphical) analysis and interpret your analysis results from the subject context’s perspective [7][8].
• If a statistical inference analysis (e.g., a regression model fitted to the sample data) is also performed, report the point
estimate/effect size and the associated interval estimate (e.g., a 95% confidence/credible interval or standard deviation)
or the associated p-value; justify the fitted model is a good representation/summary of the observed data [4][7][8].
We shall not attempt to find/develop a magic alternative to NHST because it does not exist [3][4].
It is what-if analysis (NOT confirmatory analysis) the very nature of statistical inference for one (nonrepetitive) set of sample data! [8]
References:
[1] Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019). Moving to a World Beyond “p<0.05”. The American Statistician, Vol. 73, No. S1, 1-19: Editorial.
[2] Steven N. Goodman (1999). Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy. Annuals of Internal Medicine, Vol. 130(12), pp 995-1004. Contact details: Dr. John Xie
[3] Gerd Gigerenzer (1993). The Superego, the Ego, and the Id in Statistical Reasoning. Print publication date: 2002; DOI: 10.1093/acprof:oso/9780195153729.001.0001. (Statistics Support Officer, Quantitative Consulting Unit)
[4] Sander Greenland (2017). Invited Commentary: The Need for Cognitive Science in Methodology. American Journal of Epidemiology, Vol. 186, No. 6 Phone: +61 2 69332229
[5] R.E. Kirk, Practical significance: A concept whose time has come (1996), Educational and Psychological Measurement, 56, 746-759.
[6] Marks R. Nester (1996). An Applied Statistician’s Creed. Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 45, No. 4, 401-410. Email: gxie@csu.edu.au
[7] Valentin Amrhein, Sander Greenland, Blake McShane (2019). Retire statistical significance. Nature, Vol. 567, 305: Comment. Website: https://www.csu.edu.au/qcu
[8] Valentin Amrhein, David Trafimow, & Sander Greenland (2019). Inferential Statistics as Descriptive Statistics:There Is No Replication Crisis if We Don’t Expect Replication.
The American Statistician, Vol. 73, No. S1, 262-270, DOI: 10.1080/00031305.2018.1543137
www.csu.edu.au

JX ANZSC2020 v2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JX ANZSC2020 v2

Uploaded by

Copyright:

Available Formats

Statistical Society of Australia and New Zealand Statistical Association, 9 July, 2020: SSA + NZSA mini-virtual conference

Please Join Us - Say ‘No’ to

What is wrong with NHST

A valid statistical data analysis should

You might also like