Professional Documents
Culture Documents
The Problem With Magnitude Based Inference .23
The Problem With Magnitude Based Inference .23
Inference’’
KRISTIN L. SAINANI
Division of Epidemiology, Department of Health Research and Policy, Stanford University, Stanford, CA
ABSTRACT
SAINANI, K. L. The Problem with ‘‘Magnitude-based Inference.’’ Med. Sci. Sports Exerc., Vol. 50, No. 10, pp. 2166–2176, 2018.
Purpose: A statistical method called ‘‘magnitude-based inference’’ (MBI) has gained a following in the sports science literature, despite concerns
voiced by statisticians. Its proponents have claimed that MBI exhibits superior type I and type II error rates compared with standard null
hypothesis testing for most cases. I have performed a reanalysis to evaluate this claim. Methods: Using simulation code provided by MBI_s
proponents, I estimated type I and type II error rates for clinical and nonclinical MBI for a range of effect sizes, sample sizes, and smallest
important effects. I plotted these results in a way that makes transparent the empirical behavior of MBI. I also reran the simulations after
correcting mistakes in the definitions of type I and type II error provided by MBI_s proponents. Finally, I confirmed the findings mathematically;
and I provide general equations for calculating MBI_s error rates without the need for simulation. Results: Contrary to what MBI_s proponents
have claimed, MBI does not exhibit ‘‘superior’’ type I and type II error rates to standard null hypothesis testing. As expected, there is a tradeoff between
type I and type II error. At precisely the small-to-moderate sample sizes that MBI_s proponents deem ‘‘optimal,’’ MBI reduces the type II error rate at
the cost of greatly inflating the type I error rate—to two to six times that of standard hypothesis testing. Conclusions: Magnitude-based inference
exhibits worrisome empirical behavior. In contrast to standard null hypothesis testing, which has predictable type I error rates, the type I error rates for
MBI vary widely depending on the sample size and choice of smallest important effect, and are often unacceptably high. Magnitude-based inference
should not be used. Key Words: TYPE I ERROR, TYPE II ERROR, HYPOTHESIS TESTING, CONFIDENCE INTERVALS, STATISTICS
I
was recently asked to weigh in on a statistical debate that problems with the method, including that it creates unacceptably
has been brewing in the sports science literature. Some high false-positive rates (8). In response, MBI_s proponents,
researchers have advocated the use of a new statistical Hopkins and Batterham (4), published a rebuttal in Sports
method they are calling ‘‘magnitude-based inference’’ (MBI) as Medicine 2016 in which they claim that MBI ‘‘outperforms’’
an alternative to standard hypothesis testing (1–5). The method standard null hypothesis testing in terms of both type I (false-
is being used in practice in the sports science literature (4,6), positive) and type II (false-negative) error rates for most cases.
which makes it imperative to resolve this debate. At face value, this conclusion is dubious. There is a tradeoff
Several statisticians have criticized MBI due to its lack of a between type I and type II error: when you improve one, you
sound theoretical framework (7–9). In a 2015 article in Medicine sacrifice the other. Thus, you do not need to be a statistician to
& Science in Sports & Exercise, Welsh and Knight provided a immediately be skeptical of their paper. Indeed, their article is
statistical review of MBI in which they identified theoretical flawed in both its methods and conclusions.
First, Hopkins and Batterham (4) have obscured the system-
atic behavior of MBI in the way they presented their results. I
Address for correspondence: Kristin L. Sainani, Ph.D., Department of have reproduced the exact numbers they report in their article,
Health Research and Policy, 150 Governor_s Lane, HRP Redwood Bldg,
SPECIAL COMMUNICATIONS
2166
MAGNITUDE-BASED INFERENCE: Clinical MBI. When you are interested in testing whether
A BRIEF SYNOPSIS a clinical intervention is beneficial or not, Hopkins and
Batterham call this ‘‘clinical MBI.’’ In clinical MBI, users
The motivation behind MBI is a good one. Hopkins and
define a trivial range by setting thresholds for harm and ben-
Batterham encourage researchers to pay more attention to con-
efit; these are usually assigned the same value, but they don_t
fidence intervals and effect sizes. By doing so, researchers can
have to be. Hopkins and Batterham contend that an inter-
avoid many common statistical errors, such as mistakenly con-
vention is implementable if it is at least ‘‘possibly’’ benefi-
cluding that a significant but trivially small effect is clinically
cial (Q25% chance of benefit) and ‘‘most unlikely’’ harmful
important (10).
(G0.5% risk of harm). Equivalently, the upper limit of the
In MBI, researchers start by defining a trivial range, in
50% confidence interval (UCL50) must equal or exceed the
which effect sizes are too small to care about. For example,
threshold for benefit and the lower limit of the 99% confi-
researchers might declare that changes in resting heart rate dence interval (LCL99) must be above the threshold for
within 1 bpm are trivial. Effects outside of this range are either harm. For example, if the thresholds for benefit and harm
beneficial (when resting heart rate is lowered) or harmful (when are +0.2 standard deviations and j0.2 standard deviations,
resting heart rate is increased). Researchers then interpret their respectively, the intervention would be implementable
confidence intervals relative to these ranges. For example, if a when UCL50 Q +0.2 and LCL99 9 j0.2.
supplement reduces resting heart rate a statistically significant It is possible to change the ‘‘minimum chance of benefit’’ and
amount but the 95% confidence interval is j0.9 to j0.1 bpm, ‘‘maximum risk of harm’’ from their defaults of 25% and 0.5%.
one should conclude that the supplement has only a trivial bi- For example, if you want to require a 75% minimum chance of
ologic effect. Conversely, if the reduction in resting heart rate is benefit (i.e., ‘‘likely’’ beneficial), then the intervention would
statistically nonsignificant, but the 95% confidence interval is be implementable when LCL50 Q +0.2 and LCL99 9 j0.2.
j10 to +0.1 bpm—which predominantly spans the beneficial Nonclinical MBI. When you are simply trying to deter-
range—one should not conclude that the supplement is inef- mine whether an effect exists (either positive or negative),
fective. This is a good approach. Hopkins and Batterham call this nonclinical MBI and refer to
Where Hopkins and Batterham_s method breaks down is effects as positive or negative rather than beneficial and
when they go beyond simply making qualitative judgments harmful. Researchers can claim various degrees of certainty
like this and advocate translating confidence intervals into for a positive effect when the 90% confidence interval ex-
probabilistic statements such as: the effect of the supplement is cludes the negative range (LCL90 9 j0.2, for example, cor-
‘‘very likely trivial’’ or ‘‘likely beneficial.’’ This requires responding to G5% chance of a negative effect) but overlaps
interpreting confidence intervals incorrectly, as if they were the positive range: ‘‘Unlikely’’ positive is when just the upper
Bayesian credible intervals. For example, they incorrectly in- limit of the 90% confidence interval (UCL90) makes it into
terpret a 95% confidence interval that falls completely within the positive range; ‘‘possibly’’ positive is when the upper limit
the trivial range as meaning that there is a 95% chance that the of the 50% confidence interval (UCL50) makes it into the
effect is trivial. Others have pointed out the problems with this positive range; ‘‘likely’’ positive is when the lower limit of the
misinterpretation (7–9); I will avoid a lengthy discussion of 50% confidence interval (LCL50) makes it into the positive
this issue here, because my primary goal is to demonstrate the range; and ‘‘very likely’’ positive is when the lower limit of the
empirical behavior of MBI when implemented as Hopkins 90% confidence interval (LCL90) makes it into the positive
and Batterham propose. range. Negative effects follow the same pattern.
Magnitude-based inference provides probabilities that the Magnitude-based inference_s proponents advocate reporting
effect is beneficial, trivial, and harmful. Then these proba- the probabilistic statements and allowing individuals to judge
bilities are interpreted using the following scale: G0.5% = how much uncertainty they can tolerate for a given decision.
most unlikely; 0.5% to 5% = very unlikely; 5% to 25% = For example, a laboratory scientist might choose to move a
SPECIAL COMMUNICATIONS
unlikely; 25% to 75% = possibly; 75% to 95% = likely; 95% drug to clinical tests if it shows at least a ‘‘likely’’ effect in
to 99.5% = very likely; 999.5% = most likely (2). laboratory experiments.
For example, suppose our supplement study yields a 90% Mathematical summary. In summary, MBI imposes
confidence interval of j8 bpm to j1 bpm. This is translated two constraints: a constraint on harm (or negative effects)
to: ‘‘There is a 90% probability that the true effect lies between and a constraint on benefit (or positive effects). Each con-
j8 bpm and j1 bpm.’’ This leaves a 5% chance that the true straint is determined by two parameters: the threshold for
effect is 9 j1 bpm and thus not beneficial. So, MBI con- harm/benefit, and the maximum risk of harm/minimum
cludes: ‘‘There is a 95% chance that the supplement is bene- chance of benefit. Following Welsh and Knight (8), I will
ficial’’ or the supplement is ‘‘very likely beneficial.’’ use the following symbols for these parameters:
Hopkins and Batterham describe two versions of MBI:
clinical and nonclinical (4). I will consider each of these Gh ¼ maximum risk of harm
cases separately. I believe that the difference is that clinical Gb ¼ minimum chance of benefit
MBI entails a one-sided test whereas nonclinical MBI entails Ch ¼ threshold f or harm
a two-tailed test; I will delve into this distinction later. Cb ¼ threshold f or benefit
PROBLEM WITH MAGNITUDE-BASED INFERENCE Medicine & Science in Sports & Exercised 2167
It is possible to write general mathematical formulas or 30,000 repetitions in my simulations, depending on the
for the constraints. I will focus on clinical MBI here; non- size of the simulation.
clinical MBI is similar but considers both directions. For the I then corrected mistakes I found in Hopkins and Batterham_s
specific case of a two-group comparison of means with definitions of type I and type II error. Table 1a shows my
equal variances and equal group sizes of n per group, an effect is corrected definitions.
implementable if the following conditions are met (for full First, Hopkins and Batterham treat an ‘‘unclear’’ result—when
derivation, see Document, Supplemental Digital Content 1, the confidence intervals are so wide that they span from harmful
derivation of the constraints, http://links.lww.com/MSS/B270): to beneficial—as error-free. However, when a study misses a
1. Constraint on benefit: real effect because the sample size is too small, this is clearly a
qffiffiffiffiffi qffiffiffiffiffi
observed value þ Tð1jGb Þ;2nj2 2s2
Q Cb Y observed value Q Cb j Tð1jGb Þ;2nj2 2sn
2 type II error.
n
2. Constraint on harm: To illustrate this point further, consider the case when you
qffiffiffiffiffi qffiffiffiffiffi have effect sizes that straddle the border of trivial—for ex-
2s2
9 Ch Y observed value 9 jCh þ Tð1jGh Þ;2nj2 2sn
2
observed value j Tð1jGh Þ;2nj2 n
ample, 0.199 and 0.2 when the threshold for benefit is 0.2.
Note that we can simplify the above to: The errors associated with these two effect sizes must be
rffiffiffiffiffiffiffi! rffiffiffiffiffiffiffi!! mirror images of one another. A correct call at 0.199 must be
2s2 2s2
observed value 9 max jCh þ Tð1jGh Þ;2nj2 ; Cb j Tð1jGb Þ;2nj2
n n an incorrect call at 0.2, and vice versa. For example, if you
correctly dismiss an effect at 0.199, you would have wrongly
In other words, the observed value must be greater than dismissed it at 0.2 (a type II error). My plots of type I and type
whichever constraint is bigger for a given example. II errors for these ‘‘border cases’’ are indeed mirror images of
Magnitude-based inference is based on the confidence each other, whereas Hopkins and Batterham_s plots are not
intervals used in standard hypothesis testing. Thus, not sur- (see Figure, Supplemental Digital Content 2, simulation re-
prisingly the two methods converge. If you set the thresholds sults for effect sizes 0.199 and 0.2, http://links.lww.com/MSS/
for harm/benefit to 0, clinical MBI just reverts to a one-sided B271). Correctly counting unclear cases as type II errors fixes
null hypothesis test with a significance level of Gh (because, their plots.
presumably, the minimum chance of benefit will always be Second, it appears that Hopkins and Batterham intend
set higher than the maximum risk of harm). clinical MBI as a one-sided test. The goal is to deter-
mine whether an intervention is beneficial or not beneficial.
There is no distinction between an inference of harmful and
METHODS
an inference of trivial—in both cases, you will not imple-
Type I errors are false positives and can occur only when the ment the intervention. This is consistent with a one-sided
true effect is trivial. Type II errors are false negatives and can test for benefit.
occur only when the true effect is nontrivial. (Note: For a one- Moreover, Hopkins and Batterham are confused about
sided test, Type I errors occur when the effect is trivial or in the what to call cases in which there is a true nontrivial effect,
direction you do not care about and type II errors occur when but an inference is made in the wrong direction (i.e., infer-
the true effect is real and in the direction you care about). ring that a beneficial effect is harmful or that a harmful effect
To estimate type I and type II error rates, Hopkins and is beneficial). In the text, they switch between calling these
Batterham (4) ran simulations for a particular scenario: com- type I and type II errors; and, in their calculations, they treat
paring a continuous outcome, measured in standard deviation
units, between two groups of athletes in a pre–post design (4). TABLE 1a. My corrections to Hopkins and Batterham_s definitions of type I and type II
Simulations used the defaults for minimum chance of benefit error for clinical MBI and standard hypothesis testing (from Figures 2b and 1a of (4)).
(25% for clinical MBI, varying for nonclinical MBI) and maxi- Error When the True Effect isI
mum risk of harm (0.5% for clinical MBI, 5% for nonclinical Inference Beneficial Trivial Harmful
SPECIAL COMMUNICATIONS
SPECIAL COMMUNICATIONS
Inference Positive Trivial Negative Hopkins and Batterham_s definitions for type I and type
Nonclinical MBI: II error produce inaccurate values for these known cases.
Positive None Type I Type III Finally, I explored MBI mathematically: I worked out
Trivial-to-positivea Partialb type II Partial type I Partial type III
Trivial Type II None Type II
the derivation of Hopkins and Batterham_s sample size
Trivial-to-negative Partial type III Partial type I Partial type II formula; and I derived general mathematical equations for
Negative Type III Type I None the type I and type II error rates for clinical MBI for the
Unclear Type II None Type II
Standard hypothesis testing: problem of comparing two means.
Significant, positive None Type I Type III
Nonsignificant Type II None Type II
Significant, negative Type III Type I None RESULTS
Their definitions appear in Figures 2a and 1a of (4).
a
Clinical MBI
The trivial-to-positive category includes inferences of ‘ unlikely,’’ ‘ possibly,’’ and ‘ likely’’
positive. The trivial-to-negative category is similar (4).
b
Figure 1 shows the type I error for clinical MBI and standard
Partial errors are given weights of 0.15 for an ‘ unlikely’’ inferences (5%–25% chance), 0.50
for a ‘ possibly’’ inference (25%–75% chance), and 0.85 for a ‘ likely’’ inference (75%–95%
hypothesis testing when the true effect is null-to-trivial in
chance), as appropriate. the beneficial direction (0, 0.1, 0.199). Figure 1 reveals an
PROBLEM WITH MAGNITUDE-BASED INFERENCE Medicine & Science in Sports & Exercised 2169
SPECIAL COMMUNICATIONS
FIGURE 1—Type I error rates as a function of sample size for clinical MBI (red) and standard hypothesis testing (blue) when the true effect is zero or
trivial in the beneficial direction. Panels A to F use a threshold for harm/benefit of 0.2 standard deviations. Panels G and H show that the type I error
rate for clinical MBI depends on the value chosen for the threshold for harm/benefit (horizontal reference line at 5% is the type I error rate for
standard hypothesis testing at an effect size of 0). Results are similar whether using Hopkins and Batterham_s definitions (A, C, E, G) or my corrected
definitions (B, D, F, H). Simulations used 100,000 repetitions for A–F, and 30,000 for G–H.
important characteristic of MBI that Hopkins and Batterham_s It is easy to explain why these peaks occur. When the true
paper obscures: MBI causes peaks of false positives. For spe- effect is 0 to 0.199, observed effects often end up in or near this
cific sample sizes, the false positive rate spikes to double or range. At small sample sizes, confidence intervals around these
triple that of standard hypothesis testing. Note that this pattern observed effects are so wide that they cross into the harmful
is the same whether I use Hopkins and Batterham_s incorrect range (LCL99 e j0.2). As sample size increases, however,
definitions (panels A, C, and E) or the correct definitions the confidence intervals narrow out of the harmful range
(panels B, D, and F), as their definitional errors did not impact (LCL99 9 j0.2), while still overlapping the beneficial range
this range of true effects. (UCL50 Q +0.2)—resulting in a sharp increase in type I error
SPECIAL COMMUNICATIONS
these peaks depends on the value chosen for the threshold for
harm/benefit. 3) Nonclinical MBI has lower type II error rates
than standard hypothesis testing at the same sample sizes at
which the type I error rates peak. 4) The difference in type II
error between nonclinical MBI and standard hypothesis testing
is most pronounced when the true effect is close to the trivial
threshold and less pronounced when the true effect is larger.
FIGURE 2—Representative case where clinical MBI incurs a false
positive but standard hypothesis testing does not (true effect size is 0). If Where MBI and Standard Hypothesis
the threshold for harm/benefit is 0.2, clinical MBI declares an inter- Testing Converge
vention ‘‘implementable’’ when LCL99 9 j0.2 and UCL50 Q 0.2;
standard hypothesis testing (one-sided) declares an effect significant if If you reduce the harm/benefit threshold to 0, then MBI reverts
LCL90 9 0. At small-to-moderate sample sizes, clinical MBI incurs
more false positives because LCL99 9 j0.2 is a less stringent constraint to a one-sided null hypothesis test with a significance level equal
than LCL90 9 0. to the maximum risk of harm. Thus, in our simulations, clinical
PROBLEM WITH MAGNITUDE-BASED INFERENCE Medicine & Science in Sports & Exercised 2171
FIGURE 3—Type I and type II error rates as a function of sample size for clinical MBI (red) and standard hypothesis testing (blue). The threshold for
benefit is 0.2 standard deviations, so a true effect of 0.199 is trivial (A) and true effects of 0.2 and 0.3 are beneficial (B, C). Note that the plots for 0.199
and 0.2 are mirror images of one another, as expected for effect sizes that straddle the trivial threshold. Plots D-F show the type I error when the true
effect is in the harmful direction whether by a trivial (D) or nontrivial (E, F) amount. Simulations used 100,000 repetitions and my corrected definitions
for the errors.
MBI should revert to a one-sided null hypothesis test with a Indeed, simulations show that the type I and type II errors for
significance level of 0.005; and nonclinical MBI should revert to clinical and nonclinical MBI are as expected for a one-sided
a two-sided null hypothesis test with a significance level of 0.10. null hypothesis test with a significance level of 0.005 and a
SPECIAL COMMUNICATIONS
FIGURE 4—Type I and type II error rates as a function of sample size for nonclinical MBI (green) and standard hypothesis testing (blue). Panels A-E
use a threshold for benefit/harm of 0.2 standard deviations. Panel F shows that the type I error rates for nonclinical MBI depend on the value chosen
for the threshold for harm/benefit (horizontal reference line at 5% is the type I error rate for standard hypothesis testing at an effect size of 0).
Simulations used 100,000 repetitions for A-E and 30,000 for F, and my corrected definitions for the errors.
SPECIAL COMMUNICATIONS
type I error—these peaks occur precisely at the point at which it easy to calculate the type I and type II error rates for many
the constraints on harm and benefit are equally important. combinations of parameters, and allowing further exploration
General equations for MBI_s error rates. I derived of MBI_s behavior. For example, I changed the minimum
general mathematical equations for the type I and type II chance of benefit from 25% (‘‘possibly’’ beneficial) to 75%
error rates of clinical MBI for the case of comparing two (‘‘likely’’ beneficial). For an effect size of 0, the type I error is
means. These equations assume equal group sizes but can be low—peaking at 4.9% when n = 19 per group. However, for
generalized to unequal group sizes as described above (for full an effect size of 0.2, the type II error is at least 75% for all sample
derivation see: Document, Supplemental Digital Content 7, sizes; and for an effect size of 0.3, MBI requires 165 participants
math derivation, http://links.lww.com/MSS/B276): per group to achieve a type II error of 20%, whereas standard
hypothesis testing requires only 50 per group (see Figure, Sup-
Gh ¼ maximum risk of harm plemental Digital Content 8, simulation for minimum chance of
Gb ¼ minimum chance of benefit benefit of 75%, http://links.lww.com/MSS/B277).
Ch ¼ threshold f or harm I have provided a SAS program that calculates type I and
Cb ¼ threshold f or benefit type II error for clinical MBI (see Document, Supplemental
PROBLEM WITH MAGNITUDE-BASED INFERENCE Medicine & Science in Sports & Exercised 2173
FIGURE 5—Comparison of mathematically predicted values (green) to simulated values (red) for the type I and type II error rates of clinical MBI for
various true effect sizes. Simulations used 100,000 repetitions and my corrected definitions for the errors.
Digital Content 9, SAS implementation of error rate formulas, ‘‘possibly’’ inference (25%) to a ‘‘likely’’ inference (75%), to
http://links.lww.com/MSS/B278). reduce the type I error. This indeed controls the type I error rate,
but greatly increases the type II error rate, meaning that clinical
MBI will require much larger sample sizes than standard hy-
DISCUSSION
pothesis testing to achieve comparable statistical power (and for
Though Hopkins and Batterham claim that MBI has ‘‘su- true effects on the border of trivial, the statistical power will
perior’’ error rates compared with standard hypothesis testing never be higher than 25%).
for most cases (4), this is false. MBI exhibits very specific Whereas standard hypothesis testing has predictable type
tradeoffs between type I and type II error. Magnitude-based I error rates, MBI has type I error rates that vary greatly
inference creates peaks in the false positive rate and corre- depending on the sample size; choice of thresholds for harm/
sponding dips in the false negative rate at specific sample sizes. benefit; and choice of maximum risk of harm/minimum
Sample size calculators provided by MBI_s creators are tuned to chance of benefit. This is problematic because unless re-
find these peaks—which typically occur at small-to-moderate searchers calculate and report the type I error for every ap-
sample sizes. At these peaks, the type I error rates are two to plication, this will always be hidden to readers. Furthermore,
six times that of standard hypothesis testing. The use of MBI the dependence on the thresholds for harm/benefit as well
may therefore result in a proliferation of false positive results. as the maximum risk of harm/minimum chance of benefit
In their Sports Medicine paper, Hopkins and Batterham do makes it easy to game the system. A researcher could tweak
SPECIAL COMMUNICATIONS
acknowledge inflated type I error rates in one specific case— these values until they get an inference they like. Hopkins
clinical MBI when the effect is marginally trivial in the ben- and Batterham dismiss this issue by saying: ‘‘Researchers
eficial direction (4). However, due to incorrect definitions of should justify a value within a published protocol in advance
type I and type II error, they failed to recognize that type I error of data collection, to show they have not simply chosen a
is actually inflated in all cases at their ‘‘optimal’’ sample sizes value that gives a clear outcome with the data. Users of NHST
(for null-to-trivial effects in both directions and for nonclinical [Null Hypothesis Significance Testing] are not divested of
MBI). The increase in false positives occurs because MBI_s this responsibility, as the smallest important effect informs
constraint on harm—which dominates at small-to-moderate sam- sample size’’ (4). But there_s an obvious difference: You
ple sizes—is less stringent than the corresponding constraint cannot change the sample size once the study is done, but it is
in standard hypothesis testing (as illustrated in Fig. 2). Incorrect easy to fiddle with MBI’s harm and benefit parameters. And,
definitions also led Hopkins and Batterham to dramatically though it would be ideal for researchers to publish protocols
underestimate type II error in several of their simulations. ahead of time, in reality they rarely do (14).
Hopkins and Batterham might argue that one can simply There is no doubt that standard hypothesis testing has
change the minimum chance of benefit, for example from a pitfalls. Too often, researchers place undue emphasis on
SPECIAL COMMUNICATIONS
equal the minimum chance of benefit only if the true effect undesirable systematic behavior and should not be used. As
is equal to the threshold for benefit (Cb). However, because Welsh and Knight (8) have already pointed out, MBI should
of the conflation of terms, the users may mistakenly infer be replaced with a fully Bayesian approach or should simply
that they are setting the type I and type II errors at fixed be scrapped in favor of making qualitative statements about
values. Moreover, users may be left with the false impres- confidence intervals. In addition, a one-sided null hypothesis
sion that they are controlling the overall type I error when, test for benefit—interpreted alongside the corresponding
in fact, they are controlling only the type I error from in- confidence interval—would achieve most of the objectives of
ferring a harmful effect is beneficial, which accounts for clinical MBI while properly controlling type I error.
only a tiny fraction of type I errors; they are failing to
constrain the predominant source of type I error, from in- The author did not receive financial support and has no conflicts
ferring that a trivial effect is beneficial. Users (or potential of interest to disclose related to this work. Thanks to Matthew
users) of MBI are encouraged to use my general equations Sigurdson for his thoughtful comments on the paper. The results of the
study are presented clearly, honestly, and without fabrication, falsifi-
to explore the true type I and type II error rates for MBI for cation, or inappropriate data manipulation. The results of the present
different scenarios. study do not constitute endorsement by ACSM.
PROBLEM WITH MAGNITUDE-BASED INFERENCE Medicine & Science in Sports & Exercised 2175
REFERENCES
1. Hopkins WG. Estimating sample size for magnitude-based 11. Leventhal L, Huynh CL. Directional decisions for two-tailed tests:
inferences. Sportscience. 2006;10:63–70. power, error rates, and sample size. Psychol Methods. 1996;1(3):
2. Hopkins WG. A spreadsheet for deriving a confidence interval, 278–92.
mechanistic inference and clinical inference from a P value. 12. Cohen J. Statistical Power Analysis for the Behavioral Sciences.
Sportscience. 2007;11:16–20. 2nd ed. Hillsdale (NJ): Lawrence Erlbaum Associates; 1988. 26 p.
3. Hopkins WG, Marshall SW, Batterham AM, Hanin J. Progressive 13. Hopkins WG. (2008). Sample Size Estimation for Research Grants
statistics for studies in sports medicine and exercise science. Med and Institutional Review Boards [PowerPoint slides]. 2018 [Cited
Sci Sports Exerc. 2009;41(1):3–13. March 17, 2018] Available from: https://view.officeapps.live.com/op/
4. Hopkins WG, Batterham AM. Error rates, decisive outcomes and view.aspx?src=http://www.sportsci.org/2006/SampleSizeEstimation.ppt.
publication bias with several inferential methods. Sports Med. 2016; 14. Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis JP.
46(10):1563–73. Reproducible research practices and transparency across the bio-
5. Buchheit M. The numbers will love you back in return—I promise. medical literature. PLoS Biol. 2016;14(1):e1002333.
Int J Sports Physiol Perform. 2016;11(4):551–4. 15. Nuzzo R. Scientific method: statistical errors. Nature. 2014;506(7487):
6. Claus GM, Redkva PE, Brisola GMP, et al. Beta-alanine supple- 150–2.
mentation improves throwing velocities in repeated sprint ability and 16. Sainani KL. Getting the right answer: four statistical principles.
200-m swimming performance in young water polo players. Pediatr PM&R. 2017;9(9):933–7.
Exerc Sci. 2017;29(2):203–12. 17. Gurrin LC, Kurinczuk JJ, Burton PR. Bayesian statistics in medical
7. Barker RJ, Schofield MR. Inference about magnitudes of effects. research: an intuitive alternative to conventional data analysis.
Int J Sports Physiol Perform. 2008;3(4):547–57. J Eval Clin Pract. 2000;6(2):193–204.
8. Welsh AH, Knight EJ. ‘‘Magnitude-based inference’’: a statistical 18. Shakespeare TP, Gebski VJ, Veness MJ, Simes J. Improving interpre-
review. Med Sci Sports Exerc. 2015;47(4):874–84. tation of clinical studies by use of confidence levels, clinical significance
9. Butson M. Will the numbers really love you back: re-examining curves, and risk-benefit contours. Lancet. 2001;357(9265): 1349–53.
magnitude-based inference. [Internet] 2018 [Cited March 17, 2018] 19. Hopkins WG, Batterham AM. An imaginary Bayesian monster. Int
Available from: https://osf.io/yvj5r/. J Sports Physiol Perform. 2008;3(4):411–2.
10. Sainani KL. Clinical versus statistical significance. PM&R. 2012; 20. Curran-Everett D. Explorations in statistics: confidence intervals.
4(6):442–5. Adv Physiol Educ. 2009;33(2):87–90.
SPECIAL COMMUNICATIONS