The Problem With Magnitude Based Inference .23

The Problem with ‘‘Magnitude-based
Inference’’
KRISTIN L. SAINANI
Division of Epidemiology, Department of Health Research and Policy, Stanford University, Stanford, CA
ABSTRACT
SAINANI, K. L. The Problem with ‘‘Magnitude-based Inference.’’ Med. Sci. Sports Exerc., Vol. 50, No. 10, pp. 2166–2176, 2018.
Purpose: A statistical method called ‘‘magnitude-based inference’’ (MBI) has gained a following in the sports science literature, despite concerns
voiced by statisticians. Its proponents have claimed that MBI exhibits superior type I and type II error rates compared with standard null
hypothesis testing for most cases. I have performed a reanalysis to evaluate this claim. Methods: Using simulation code provided by MBI_s
proponents, I estimated type I and type II error rates for clinical and nonclinical MBI for a range of effect sizes, sample sizes, and smallest
important effects. I plotted these results in a way that makes transparent the empirical behavior of MBI. I also reran the simulations after
correcting mistakes in the definitions of type I and type II error provided by MBI_s proponents. Finally, I confirmed the findings mathematically;
and I provide general equations for calculating MBI_s error rates without the need for simulation. Results: Contrary to what MBI_s proponents
have claimed, MBI does not exhibit ‘‘superior’’ type I and type II error rates to standard null hypothesis testing. As expected, there is a tradeoff between
type I and type II error. At precisely the small-to-moderate sample sizes that MBI_s proponents deem ‘‘optimal,’’ MBI reduces the type II error rate at
the cost of greatly inflating the type I error rate—to two to six times that of standard hypothesis testing. Conclusions: Magnitude-based inference
exhibits worrisome empirical behavior. In contrast to standard null hypothesis testing, which has predictable type I error rates, the type I error rates for
MBI vary widely depending on the sample size and choice of smallest important effect, and are often unacceptably high. Magnitude-based inference
should not be used. Key Words: TYPE I ERROR, TYPE II ERROR, HYPOTHESIS TESTING, CONFIDENCE INTERVALS, STATISTICS
I
was recently asked to weigh in on a statistical debate that problems with the method, including that it creates unacceptably
has been brewing in the sports science literature. Some high false-positive rates (8). In response, MBI_s proponents,
researchers have advocated the use of a new statistical Hopkins and Batterham (4), published a rebuttal in Sports
method they are calling ‘‘magnitude-based inference’’ (MBI) as Medicine 2016 in which they claim that MBI ‘‘outperforms’’
an alternative to standard hypothesis testing (1–5). The method standard null hypothesis testing in terms of both type I (false-
is being used in practice in the sports science literature (4,6), positive) and type II (false-negative) error rates for most cases.
which makes it imperative to resolve this debate. At face value, this conclusion is dubious. There is a tradeoff
Several statisticians have criticized MBI due to its lack of a between type I and type II error: when you improve one, you
sound theoretical framework (7–9). In a 2015 article in Medicine sacrifice the other. Thus, you do not need to be a statistician to
& Science in Sports & Exercise, Welsh and Knight provided a immediately be skeptical of their paper. Indeed, their article is
statistical review of MBI in which they identified theoretical flawed in both its methods and conclusions.
First, Hopkins and Batterham (4) have obscured the system-
atic behavior of MBI in the way they presented their results. I
Address for correspondence: Kristin L. Sainani, Ph.D., Department of have reproduced the exact numbers they report in their article,
Health Research and Policy, 150 Governor_s Lane, HRP Redwood Bldg,
SPECIAL COMMUNICATIONS
but have regraphed them in a more transparent and informative

Stanford, CA 94305; E-mail: kcobb@stanford.edu.
Submitted for publication December 2017. way. This single change reveals the fundamental problem with
Accepted for publication April 2018. MBI. Second, Hopkins and Batterham have incorrectly defined
Supplemental digital content is available for this article. Direct URL citations type I and type II error. When I correct these mistakes, I show
appear in the printed text and are provided in the HTML and PDF versions of this that the problem with MBI holds for all cases. Finally, I derive
article on the journal_s Web site (www.acsm-msse.org).
general mathematical equations for the type I and type II error rates
0195-9131/18/5010-2166/0 for MBI; these equations confirm the findings from the simulations.
MEDICINE & SCIENCE IN SPORTS & EXERCISEÒ The problem boils down to this: MBI creates peaks of false
Copyright Ó 2018 by the American College of Sports Medicine. This is an
positives at specific sample sizes; and MBI_s creators provide
open-access article distributed under the terms of the Creative Commons
Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), sample size calculators (1) that specifically find these peaks.
where it is permissible to download and share the work provided it is properly For example, for a particular statistical comparison, Hopkins
cited. The work cannot be changed in any way or used commercially without and Batterham (4) conclude that 50 participants per group is
permission from the journal. the ‘‘optimal’’ sample size when using MBI. It turns out that 50 per
DOI: 10.1249/MSS.0000000000001645 group is precisely where the false-positive rate peaks for that case.
2166
MAGNITUDE-BASED INFERENCE: Clinical MBI. When you are interested in testing whether
A BRIEF SYNOPSIS a clinical intervention is beneficial or not, Hopkins and
Batterham call this ‘‘clinical MBI.’’ In clinical MBI, users
The motivation behind MBI is a good one. Hopkins and
define a trivial range by setting thresholds for harm and ben-
Batterham encourage researchers to pay more attention to con-
efit; these are usually assigned the same value, but they don_t
fidence intervals and effect sizes. By doing so, researchers can
have to be. Hopkins and Batterham contend that an inter-
avoid many common statistical errors, such as mistakenly con-
vention is implementable if it is at least ‘‘possibly’’ benefi-
cluding that a significant but trivially small effect is clinically
cial (Q25% chance of benefit) and ‘‘most unlikely’’ harmful
important (10).
(G0.5% risk of harm). Equivalently, the upper limit of the
In MBI, researchers start by defining a trivial range, in
50% confidence interval (UCL50) must equal or exceed the
which effect sizes are too small to care about. For example,
threshold for benefit and the lower limit of the 99% confi-
researchers might declare that changes in resting heart rate dence interval (LCL99) must be above the threshold for
within 1 bpm are trivial. Effects outside of this range are either harm. For example, if the thresholds for benefit and harm
beneficial (when resting heart rate is lowered) or harmful (when are +0.2 standard deviations and j0.2 standard deviations,
resting heart rate is increased). Researchers then interpret their respectively, the intervention would be implementable
confidence intervals relative to these ranges. For example, if a when UCL50 Q +0.2 and LCL99 9 j0.2.
supplement reduces resting heart rate a statistically significant It is possible to change the ‘‘minimum chance of benefit’’ and
amount but the 95% confidence interval is j0.9 to j0.1 bpm, ‘‘maximum risk of harm’’ from their defaults of 25% and 0.5%.
one should conclude that the supplement has only a trivial bi- For example, if you want to require a 75% minimum chance of
ologic effect. Conversely, if the reduction in resting heart rate is benefit (i.e., ‘‘likely’’ beneficial), then the intervention would
statistically nonsignificant, but the 95% confidence interval is be implementable when LCL50 Q +0.2 and LCL99 9 j0.2.
j10 to +0.1 bpm—which predominantly spans the beneficial Nonclinical MBI. When you are simply trying to deter-
range—one should not conclude that the supplement is inef- mine whether an effect exists (either positive or negative),
fective. This is a good approach. Hopkins and Batterham call this nonclinical MBI and refer to
Where Hopkins and Batterham_s method breaks down is effects as positive or negative rather than beneficial and
when they go beyond simply making qualitative judgments harmful. Researchers can claim various degrees of certainty
like this and advocate translating confidence intervals into for a positive effect when the 90% confidence interval ex-
probabilistic statements such as: the effect of the supplement is cludes the negative range (LCL90 9 j0.2, for example, cor-
‘‘very likely trivial’’ or ‘‘likely beneficial.’’ This requires responding to G5% chance of a negative effect) but overlaps
interpreting confidence intervals incorrectly, as if they were the positive range: ‘‘Unlikely’’ positive is when just the upper
Bayesian credible intervals. For example, they incorrectly in- limit of the 90% confidence interval (UCL90) makes it into
terpret a 95% confidence interval that falls completely within the positive range; ‘‘possibly’’ positive is when the upper limit
the trivial range as meaning that there is a 95% chance that the of the 50% confidence interval (UCL50) makes it into the
effect is trivial. Others have pointed out the problems with this positive range; ‘‘likely’’ positive is when the lower limit of the
misinterpretation (7–9); I will avoid a lengthy discussion of 50% confidence interval (LCL50) makes it into the positive
this issue here, because my primary goal is to demonstrate the range; and ‘‘very likely’’ positive is when the lower limit of the
empirical behavior of MBI when implemented as Hopkins 90% confidence interval (LCL90) makes it into the positive
and Batterham propose. range. Negative effects follow the same pattern.
Magnitude-based inference provides probabilities that the Magnitude-based inference_s proponents advocate reporting
effect is beneficial, trivial, and harmful. Then these proba- the probabilistic statements and allowing individuals to judge
bilities are interpreted using the following scale: G0.5% = how much uncertainty they can tolerate for a given decision.
most unlikely; 0.5% to 5% = very unlikely; 5% to 25% = For example, a laboratory scientist might choose to move a
unlikely; 25% to 75% = possibly; 75% to 95% = likely; 95% drug to clinical tests if it shows at least a ‘‘likely’’ effect in
to 99.5% = very likely; 999.5% = most likely (2). laboratory experiments.
For example, suppose our supplement study yields a 90% Mathematical summary. In summary, MBI imposes
confidence interval of j8 bpm to j1 bpm. This is translated two constraints: a constraint on harm (or negative effects)
to: ‘‘There is a 90% probability that the true effect lies between and a constraint on benefit (or positive effects). Each con-
j8 bpm and j1 bpm.’’ This leaves a 5% chance that the true straint is determined by two parameters: the threshold for
effect is 9 j1 bpm and thus not beneficial. So, MBI con- harm/benefit, and the maximum risk of harm/minimum
cludes: ‘‘There is a 95% chance that the supplement is bene- chance of benefit. Following Welsh and Knight (8), I will
ficial’’ or the supplement is ‘‘very likely beneficial.’’ use the following symbols for these parameters:
Hopkins and Batterham describe two versions of MBI:
clinical and nonclinical (4). I will consider each of these Gh ¼ maximum risk of harm
cases separately. I believe that the difference is that clinical Gb ¼ minimum chance of benefit
MBI entails a one-sided test whereas nonclinical MBI entails Ch ¼ threshold f or harm
a two-tailed test; I will delve into this distinction later. Cb ¼ threshold f or benefit
PROBLEM WITH MAGNITUDE-BASED INFERENCE Medicine & Science in Sports & Exercised 2167
It is possible to write general mathematical formulas or 30,000 repetitions in my simulations, depending on the
for the constraints. I will focus on clinical MBI here; non- size of the simulation.
clinical MBI is similar but considers both directions. For the I then corrected mistakes I found in Hopkins and Batterham_s
specific case of a two-group comparison of means with definitions of type I and type II error. Table 1a shows my
equal variances and equal group sizes of n per group, an effect is corrected definitions.
implementable if the following conditions are met (for full First, Hopkins and Batterham treat an ‘‘unclear’’ result—when
derivation, see Document, Supplemental Digital Content 1, the confidence intervals are so wide that they span from harmful
derivation of the constraints, http://links.lww.com/MSS/B270): to beneficial—as error-free. However, when a study misses a
1. Constraint on benefit: real effect because the sample size is too small, this is clearly a
qffiffiffiffiffi qffiffiffiffiffi
observed value þ Tð1jGb Þ;2nj2 2s2
Q Cb Y observed value Q Cb j Tð1jGb Þ;2nj2 2sn
2 type II error.
n
2. Constraint on harm: To illustrate this point further, consider the case when you
qffiffiffiffiffi qffiffiffiffiffi have effect sizes that straddle the border of trivial—for ex-
2s2
9 Ch Y observed value 9 jCh þ Tð1jGh Þ;2nj2 2sn
2
observed value j Tð1jGh Þ;2nj2 n
ample, 0.199 and 0.2 when the threshold for benefit is 0.2.
Note that we can simplify the above to: The errors associated with these two effect sizes must be
rffiffiffiffiffiffiffi! rffiffiffiffiffiffiffi!! mirror images of one another. A correct call at 0.199 must be
2s2 2s2
observed value 9 max jCh þ Tð1jGh Þ;2nj2 ; Cb j Tð1jGb Þ;2nj2
n n an incorrect call at 0.2, and vice versa. For example, if you
correctly dismiss an effect at 0.199, you would have wrongly
In other words, the observed value must be greater than dismissed it at 0.2 (a type II error). My plots of type I and type
whichever constraint is bigger for a given example. II errors for these ‘‘border cases’’ are indeed mirror images of
Magnitude-based inference is based on the confidence each other, whereas Hopkins and Batterham_s plots are not
intervals used in standard hypothesis testing. Thus, not sur- (see Figure, Supplemental Digital Content 2, simulation re-
prisingly the two methods converge. If you set the thresholds sults for effect sizes 0.199 and 0.2, http://links.lww.com/MSS/
for harm/benefit to 0, clinical MBI just reverts to a one-sided B271). Correctly counting unclear cases as type II errors fixes
null hypothesis test with a significance level of Gh (because, their plots.
presumably, the minimum chance of benefit will always be Second, it appears that Hopkins and Batterham intend
set higher than the maximum risk of harm). clinical MBI as a one-sided test. The goal is to deter-
mine whether an intervention is beneficial or not beneficial.
There is no distinction between an inference of harmful and
METHODS
an inference of trivial—in both cases, you will not imple-
Type I errors are false positives and can occur only when the ment the intervention. This is consistent with a one-sided
true effect is trivial. Type II errors are false negatives and can test for benefit.
occur only when the true effect is nontrivial. (Note: For a one- Moreover, Hopkins and Batterham are confused about
sided test, Type I errors occur when the effect is trivial or in the what to call cases in which there is a true nontrivial effect,
direction you do not care about and type II errors occur when but an inference is made in the wrong direction (i.e., infer-
the true effect is real and in the direction you care about). ring that a beneficial effect is harmful or that a harmful effect
To estimate type I and type II error rates, Hopkins and is beneficial). In the text, they switch between calling these
Batterham (4) ran simulations for a particular scenario: com- type I and type II errors; and, in their calculations, they treat
paring a continuous outcome, measured in standard deviation
units, between two groups of athletes in a pre–post design (4). TABLE 1a. My corrections to Hopkins and Batterham_s definitions of type I and type II
Simulations used the defaults for minimum chance of benefit error for clinical MBI and standard hypothesis testing (from Figures 2b and 1a of (4)).
(25% for clinical MBI, varying for nonclinical MBI) and maxi- Error When the True Effect isI
mum risk of harm (0.5% for clinical MBI, 5% for nonclinical Inference Beneficial Trivial Harmful
MBI); a threshold for harm/benefit of 0.2 standard deviations; Clinical MBI:

Beneficial (‘‘very likely’’ or more) None Type I Type I
and samples sizes of 10, 50, and 144 per group. They generated Beneficial (‘‘possibly’’ or ‘ likely’’)a None Type I Type I
type I errors for trivial effect sizes of 0, T0.1, and T0.199, and Trivial (includes ‘ unlikely’’ beneficial) Type II None None
Trivial-to-Harmful Type II None None
type II errors for nontrivial effect sizes of T0.2 to T0.6. Harmful Type II None None
I commend Hopkins and Batterham for providing their Unclear Type II None None
simulation code as a supplement to their 2016 paper (4). Standard hypothesis testing:
Significant, beneficial None Type I Type I
With their code, I was able to figure out how their method Nonsignificant Type II None None
works, and to reproduce their results. I re-ran their simulations Significant, harmful Type II None None
with 100,000 repetitions, and was able to match the numbers Standard hypothesis testing is treated as a one-sided test for benefit, for comparability
they report in Figure 3 of their 2016 Sports Medicine paper with clinical MBI.
a
Note that what Hopkins and Batterham label as the trivial-to-beneficial category only
(4). I then systematically varied both the sample size (from 10 includes inferences of ‘ possibly’’ and ‘ likely’’ beneficial; the trivial category includes in-
to 150 per group) and the threshold for harm/benefit (from 0.1 ferences of ‘ unlikely’’ beneficial. This is how these errors are assigned by Hopkins and
Batterham, though their Figure 2b does not make this clear. ‘ Possibly’’ likely is when
to 0.3 for sample sizes of 5 to 300 per group), and plotted type lower limit of the 99% confidence interval is above the harmful range and the upper limit
I and type II error against sample size. I used either 100,000 of the 50% confidence interval is within the beneficial range (4).
2168 Official Journal of the American College of Sports Medicine http://www.acsm-msse.org

them both as type II errors (Table 1a). However, they cannot Hopkins and Batterham_s logic is that as long as you acknowl-
both be type II errors at the same time. Inferring that a edge even a small chance (5%–25%) that the effect might be
beneficial effect is harmful constitutes a type II error only for trivial when it is, then you have not made a type I error; and as
a one-sided test for benefit; and inferring that a harmful effect long as you acknowledge even a small chance (5%–25%) that
is beneficial constitutes a type II error only for a one-sided test the effect might be positive (or negative) when it is, then you
for harm. Recognizing that clinical MBI is in fact a one-sided have not made a type II error.
test for benefit clears up this confusion: inferring that a bene- However, this seems specious. Is concluding that an ef-
ficial effect is harmful is a type II error, whereas inferring that fect is ‘‘likely’’ positive really an error-free conclusion
a harmful effect is beneficial is a type I error. when the effect is in fact trivial? Again, the errors for the
In addition to these corrections, I also corrected Hopkins border cases—0.199 and 0.2, when the threshold for harm/
and Batterham_s definitions of type I and type II error for benefit is 0.2—should be mirror images of one another; but
standard hypothesis testing to reflect a one-sided test they are clearly not under this logic.
(Table 1a). Using my corrected definitions, I reran my sim- The solution is to allow degrees of error. If you conclude
ulations to estimate type I and type II error for clinical MBI that an effect is ‘‘likely’’ positive when it is trivial, you are
and standard hypothesis testing for true effect sizes of 0, T0.1, mostly wrong. And if you conclude that an effect is ‘‘un-
T0.199, T0.2, and T0.3. likely’’ positive when it is trivial, you are mostly right. Thus,
Hopkins and Batterham also used incorrect definitions of I introduced partial errors in my calculations. For scoring the
type I and type II error for nonclinical MBI. Table 1b shows my partial errors, I used values of 0.15, 0.50, and 0.85, corre-
corrected definitions. sponding to the midpoints of the probability ranges assigned
As before, ‘‘unclear’’ cases are considered type II errors by Hopkins and Batterham: 0.05 to 0.25 (unlikely), 0.25 to
when there is a real effect (either positive or negative). 0.75 (possibly), and 0.75 to 0.95 (likely). For example, de-
Hopkins and Batterham are again confused about how to claring an effect ‘‘unlikely’’ positive when it is trivial is 0.15
treat cases in which researchers correctly infer a nontrivial of a type I error, whereas the same inference is 0.85 of a type
effect, but in the wrong direction. Hopkins and Batterham call II error when the effect is positive. Note that it is possible to
these type II errors, but—as previously discussed—these are change the weighting of these partial errors—this will sim-
type II errors only for one-sided tests. Clearly, Hopkins and ply alter the balance between type I and type II errors.
Batterham intend nonclinical MBI as a two-sided test, since Using my corrected definitions, I re-ran my simulations
positive and negative are treated equivalently. For a two-sided to estimate type I and type II error for nonclinical MBI and
test that incorporates direction, inferences in the wrong di- standard hypothesis testing for true effect sizes of 0, T0.1,
rection are actually neither type I nor type II errors—these are T0.199, T0.2, and T0.3. Note that in my corrected plots, the
type III errors (11). border cases of 0.199 and 0.2 are nearly mirror images
For nonclinical MBI, there is a third issue. Hopkins and (Fig. 4), as they should be. The slight asymmetry is due to the
Batterham tabulate type I and type II error such that most presence of type III errors, which occur more frequently in
inferential choices are guaranteed to be error-free. If you infer nonclinical MBI than standard hypothesis testing (see Figure,
that the effect is ‘‘unlikely,’’ ‘‘possibly,’’ or ‘‘likely’’ positive Supplemental Digital Content 3, type III errors, http://links.
or negative, you can never be wrong. These inferences are not lww.com/MSS/B272). The code for my simulations is avail-
counted as type I errors when the true effect is trivial, and are able as a supplement to this paper (see Document, Supple-
not counted as type II errors when the true effect is nontrivial. mental Digital Content 4, SAS code for simulations, http://
links.lww.com/MSS/B273).
I also ran simulations setting the trivial threshold to
TABLE 1b. My corrections to Hopkins and Batterham_s definitions of type I and type II
error for nonclinical MBI and standard two-sided null hypothesis testing. 0.00001 (effectively 0) to demonstrate that: (1) MBI and
Error When the True Effect isI standard hypothesis testing converge as expected, and (2)
Inference Positive Trivial Negative Hopkins and Batterham_s definitions for type I and type
Nonclinical MBI: II error produce inaccurate values for these known cases.
Positive None Type I Type III Finally, I explored MBI mathematically: I worked out
Trivial-to-positivea Partialb type II Partial type I Partial type III
Trivial Type II None Type II
the derivation of Hopkins and Batterham_s sample size
Trivial-to-negative Partial type III Partial type I Partial type II formula; and I derived general mathematical equations for
Negative Type III Type I None the type I and type II error rates for clinical MBI for the
Unclear Type II None Type II
Standard hypothesis testing: problem of comparing two means.
Significant, positive None Type I Type III
Nonsignificant Type II None Type II
Significant, negative Type III Type I None RESULTS
Their definitions appear in Figures 2a and 1a of (4).
a
Clinical MBI
The trivial-to-positive category includes inferences of ‘ unlikely,’’ ‘ possibly,’’ and ‘ likely’’
positive. The trivial-to-negative category is similar (4).
b
Figure 1 shows the type I error for clinical MBI and standard
Partial errors are given weights of 0.15 for an ‘ unlikely’’ inferences (5%–25% chance), 0.50
for a ‘ possibly’’ inference (25%–75% chance), and 0.85 for a ‘ likely’’ inference (75%–95%
hypothesis testing when the true effect is null-to-trivial in
chance), as appropriate. the beneficial direction (0, 0.1, 0.199). Figure 1 reveals an
FIGURE 1—Type I error rates as a function of sample size for clinical MBI (red) and standard hypothesis testing (blue) when the true effect is zero or
trivial in the beneficial direction. Panels A to F use a threshold for harm/benefit of 0.2 standard deviations. Panels G and H show that the type I error
rate for clinical MBI depends on the value chosen for the threshold for harm/benefit (horizontal reference line at 5% is the type I error rate for
standard hypothesis testing at an effect size of 0). Results are similar whether using Hopkins and Batterham_s definitions (A, C, E, G) or my corrected
definitions (B, D, F, H). Simulations used 100,000 repetitions for A–F, and 30,000 for G–H.
important characteristic of MBI that Hopkins and Batterham_s It is easy to explain why these peaks occur. When the true
paper obscures: MBI causes peaks of false positives. For spe- effect is 0 to 0.199, observed effects often end up in or near this
cific sample sizes, the false positive rate spikes to double or range. At small sample sizes, confidence intervals around these
triple that of standard hypothesis testing. Note that this pattern observed effects are so wide that they cross into the harmful
is the same whether I use Hopkins and Batterham_s incorrect range (LCL99 e j0.2). As sample size increases, however,
definitions (panels A, C, and E) or the correct definitions the confidence intervals narrow out of the harmful range
(panels B, D, and F), as their definitional errors did not impact (LCL99 9 j0.2), while still overlapping the beneficial range
this range of true effects. (UCL50 Q +0.2)—resulting in a sharp increase in type I error

rates. Beyond a certain sample size, the confidence intervals trivial, but nonzero, the type I error rate climbs steadily with
become sufficiently narrow to drop out of the beneficial range increasing sample sizes. This is because as precision im-
(UCL50 G +0.2)—and thus type I error rates drop. proves, it becomes clear that there is a true nonzero effect.
In mathematical terms, recall that the bigger of the con- At sample sizes above about 100 per group, clinical MBI
straints on harm and benefit decides the inference. At small- has lower type I error rates than standard hypothesis testing.
to-moderate sample sizes, the constraint on harm tends to be However, the false positives prevented by MBI are all cases
bigger; at larger sample sizes, the constraint on benefit tends to where researchers need only look at the narrow confidence
be bigger. The peak occurs at the point at which the two intervals to realize that the effect—though statistically
constraints are equally likely to decide the inference (50% significant—is small or trivial in size.
chance that the harm constraint is bigger, 50% chance that the Figure 3 shows the type II errors when the true effect is
benefit constraint is bigger). beneficial (0.2, 0.3). Clinical MBI reduces the type II error
This pattern also explains why MBI has higher false-positive rate at the same sample sizes where it increases the type I
rates than standard hypothesis testing at small-to-moderate error rate. This is expected, because loosening the con-
sample sizes, where the harm constraint dominates. Clinical straints on false positives makes it easier to avoid false
MBI_s constraint on harm (LCL99 9 j0.2) is almost always negatives. At Hopkins and Batterham_s ‘‘optimal’’ sample
less stringent than the corresponding constraint for a standard size of 50, clinical MBI cuts the type II error rate in half
one-sided hypothesis test (LCL90 9 0). Figure 2 illustrates the compared with standard hypothesis testing for a true effect
typical case in which clinical MBI incurs a false-positive of 0.2. Clinical MBI has the biggest advantage for type II
whereas standard hypothesis testing does not. error when the true effect is near the trivial threshold. With
The false-positive rate peaks at a sample size of 50 per larger true effects, the advantage of clinical MBI shrinks. In
group. Interestingly, this is exactly the sample size that fact, when the true effect is 0.5 standard deviations—what
Hopkins and Batterham proclaim as ‘‘optimal’’ for MBI for Cohen calls a medium effect size (12)—clinical MBI has
this example (4). In fact, their sample size formula explicitly almost no advantage over standard hypothesis testing (see
finds these peaks, as I will later show mathematically. Figure, Supplemental Digital Content 5, simulation for ef-
Changing the threshold for harm/benefit changes the loca- fect size of 0.5, http://links.lww.com/MSS/B274).
tion of the peaks (Fig. 1, panels G and H). For example, the Figure 3 shows the type I errors when the true effect is
peak is at about 20 per group when the threshold is 0.3 harmful, whether by a trivial (j0.1) or nontrivial (j0.2,
standard deviations rather than 0.2. This is because you can j0.3) amount. The type I error rates are low overall (G3%),
rule out harmful effects faster (since the threshold for harm is but generally higher for clinical MBI than standard hypothesis
lower, j0.3 rather than j0.2) but can also rule out beneficial testing. Clinical MBI also exhibits peaks in the type I error rate
effects faster (since the threshold for benefit is higher, 0.3 around the ‘‘optimal’’ sample size of 50, as before. The expla-
rather than 0.2). Smaller trivial thresholds have the opposite nation for the higher false-positive rates is the same as
effect. For example, the peak in type I error occurs at a before—false positives occur for MBI but not standard hy-
sample size of about 200 per group when the threshold is 0.1. pothesis testing when LCL99 9 j0.2 but LCL90 G 0.
In contrast to MBI, standard hypothesis testing has pre-
dictable type I error rates (Fig. 1). When the true effect is 0, Nonclinical MBI
the type I error rate is constant at 5%. When the true effect is
Figure 4 shows the results for type I and type II error for
nonclinical MBI. The patterns are similar to those seen in
clinical MBI, namely, 1) nonclinical MBI creates peaks of
false positives at specific sample sizes (here ~ 30 per group).
Within these peaks, the type I error rate is two to six times
higher than for standard hypothesis testing. 2) The location of
these peaks depends on the value chosen for the threshold for
harm/benefit. 3) Nonclinical MBI has lower type II error rates
than standard hypothesis testing at the same sample sizes at
which the type I error rates peak. 4) The difference in type II
error between nonclinical MBI and standard hypothesis testing
is most pronounced when the true effect is close to the trivial
threshold and less pronounced when the true effect is larger.
FIGURE 2—Representative case where clinical MBI incurs a false
positive but standard hypothesis testing does not (true effect size is 0). If Where MBI and Standard Hypothesis
the threshold for harm/benefit is 0.2, clinical MBI declares an inter- Testing Converge
vention ‘‘implementable’’ when LCL99 9 j0.2 and UCL50 Q 0.2;
standard hypothesis testing (one-sided) declares an effect significant if If you reduce the harm/benefit threshold to 0, then MBI reverts
LCL90 9 0. At small-to-moderate sample sizes, clinical MBI incurs
more false positives because LCL99 9 j0.2 is a less stringent constraint to a one-sided null hypothesis test with a significance level equal
than LCL90 9 0. to the maximum risk of harm. Thus, in our simulations, clinical
FIGURE 3—Type I and type II error rates as a function of sample size for clinical MBI (red) and standard hypothesis testing (blue). The threshold for
benefit is 0.2 standard deviations, so a true effect of 0.199 is trivial (A) and true effects of 0.2 and 0.3 are beneficial (B, C). Note that the plots for 0.199
and 0.2 are mirror images of one another, as expected for effect sizes that straddle the trivial threshold. Plots D-F show the type I error when the true
effect is in the harmful direction whether by a trivial (D) or nontrivial (E, F) amount. Simulations used 100,000 repetitions and my corrected definitions
for the errors.
MBI should revert to a one-sided null hypothesis test with a Indeed, simulations show that the type I and type II errors for
significance level of 0.005; and nonclinical MBI should revert to clinical and nonclinical MBI are as expected for a one-sided
a two-sided null hypothesis test with a significance level of 0.10. null hypothesis test with a significance level of 0.005 and a
FIGURE 4—Type I and type II error rates as a function of sample size for nonclinical MBI (green) and standard hypothesis testing (blue). Panels A-E
use a threshold for benefit/harm of 0.2 standard deviations. Panel F shows that the type I error rates for nonclinical MBI depend on the value chosen
for the threshold for harm/benefit (horizontal reference line at 5% is the type I error rate for standard hypothesis testing at an effect size of 0).
Simulations used 100,000 repetitions for A-E and 30,000 for F, and my corrected definitions for the errors.

two-sided null hypothesis test with a significance level of 0.10, ES = true effect size (difference in means)
respectively (see Figure, Supplemental Digital Content 6, simu- n = per group sample size
lations for threshold for harm/benefit of 0, http://links.lww.com/ R2 ¼ true variance
MSS/B275). As further proof that Hopkins and Batterham_s s2 ¼ pooled sample variance
definitions for the errors are incorrect, their definitions wildly
underestimate the type II errors for these known cases (see T harm ¼ Tð1jGh Þ;2nj2
Figure, Supplemental Digital Content 6, simulations for thresh- T benefit ¼ Tð1jGb Þ;2nj2
old for harm/benefit of 0, http://links.lww.com/MSS/B275). Type I error (type I error can only occur when ES G Cb):
!
Mathematical Confirmation (Clinical MBI) nð2nj2ÞðCh þ Cb Þ2
Type I error probability ¼ P W2nj2 9
2R2 ðTharm þ Tbenefit Þ2
Sample size formula. Hopkins and Batterham provide 0 qffiffiffiffiffiffi 1
2
sample size calculators in Excel spreadsheets for clinical B jCh þ Tharm 2Rn j ES C
P@T2nj2 9 qffiffiffiffiffiffi A
MBI (1), though they do not appear to have published the 2R2
n
formulas that underlie these calculators. Welsh and Knight !!
nð2nj2ÞðCh þ Cb Þ2
(8) were able to work out the formula for comparing two þ 1 j P W2nj2 9
means from the algorithms in the spreadsheets but were unsure 0 qffiffiffiffiffiffi 1
of the derivation. Based on an online presentation by Hopkins B Cb j Tbenefit 2R2
n j ES C
P@T2nj2 9 qffiffiffiffiffiffi A
(13), I believe that I have figured out the derivation. For sim- 2R2
n
plicity, I will treat the group sizes as equal, but one can gen-
eralize to unequal group sizes by substituting (r + 1)R2/r for Type II error (type II error can only occur when ES Q Cb:
2R2, and changing the degrees of freedom to (r + 1)nj2, !
nð2nj2ÞðCh þ Cb Þ2
where r is the ratio of the larger to smaller group. Type II error probability ¼ P W2njn 9
Hopkins and Batterham start by assuming that the true 0 qffiffiffiffiffiffi 1
2R2
jC þ T j ES
effect is a random variable t-distributed around a fixed observed B
P@T2nj2 G
h harm
qffiffiffiffiffiffi
n C
A
effect (this mirrors their misinterpretation of frequentist confi- 2R2
n
!!
dence intervals as Bayesian credible intervals). This distribu- nð2nj2ÞðCh þ Cb Þ2
tion will have a variance of 2R2 =n assuming a pooled variance þ 1 j P W2nj2 9 2
2R ðTharm þ Tbenefit Þ
0 qffiffiffiffiffiffi 1
and equal group sizes. They then solve for the sample size that
Cb j Tbenefit 2Rn j ES C
2
will make the following true: P(true effectGjCh) = Gh and B

P@T2nj2 G qffiffiffiffiffiffi A
P(true effectQCb) = Gb. This occurs when MBI_s constraints on 2R2
n
harm and benefit are equal:
Values generated from these equations closely match the
rffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffi
2R2 2R2 results of my simulations (Fig. 5). My equations may slightly
jCh þ Tð1jGh Þ;2nj2 ¼ Cb j Tð1jGb Þ;2nj2 overestimate the type I error rates at the peaks (Fig. 5)
n n
due to a simplification made in the math for tractability
Solving this equation leads to an n per group of: (for details see: text, Supplemental Digital Content 7, math
2 derivation, http://links.lww.com/MSS/B276). In contrast, the
2R2 Tð1jGbÞ;2nj2 þ Tð1jGh Þ;2nj2
n ¼ mathematically predicted results do not match the simula-
ðCb þ Ch Þ2 tions that use Hopkins and Batterham_s definitions of type I
This formula indeed matches the formula that Welsh and and type II error, again demonstrating that their definitions
Knight (8) report. Furthermore, it explains why Hopkins and are incorrect.
Batterham_s ‘‘optimal’’ sample sizes correspond to the peaks in These equations eliminate the need for simulation, making
type I error—these peaks occur precisely at the point at which it easy to calculate the type I and type II error rates for many
the constraints on harm and benefit are equally important. combinations of parameters, and allowing further exploration
General equations for MBI_s error rates. I derived of MBI_s behavior. For example, I changed the minimum
general mathematical equations for the type I and type II chance of benefit from 25% (‘‘possibly’’ beneficial) to 75%
error rates of clinical MBI for the case of comparing two (‘‘likely’’ beneficial). For an effect size of 0, the type I error is
means. These equations assume equal group sizes but can be low—peaking at 4.9% when n = 19 per group. However, for
generalized to unequal group sizes as described above (for full an effect size of 0.2, the type II error is at least 75% for all sample
derivation see: Document, Supplemental Digital Content 7, sizes; and for an effect size of 0.3, MBI requires 165 participants
math derivation, http://links.lww.com/MSS/B276): per group to achieve a type II error of 20%, whereas standard
hypothesis testing requires only 50 per group (see Figure, Sup-
Gh ¼ maximum risk of harm plemental Digital Content 8, simulation for minimum chance of
Gb ¼ minimum chance of benefit benefit of 75%, http://links.lww.com/MSS/B277).
Ch ¼ threshold f or harm I have provided a SAS program that calculates type I and
Cb ¼ threshold f or benefit type II error for clinical MBI (see Document, Supplemental
FIGURE 5—Comparison of mathematically predicted values (green) to simulated values (red) for the type I and type II error rates of clinical MBI for
various true effect sizes. Simulations used 100,000 repetitions and my corrected definitions for the errors.
Digital Content 9, SAS implementation of error rate formulas, ‘‘possibly’’ inference (25%) to a ‘‘likely’’ inference (75%), to
http://links.lww.com/MSS/B278). reduce the type I error. This indeed controls the type I error rate,
but greatly increases the type II error rate, meaning that clinical
MBI will require much larger sample sizes than standard hy-
DISCUSSION
pothesis testing to achieve comparable statistical power (and for
Though Hopkins and Batterham claim that MBI has ‘‘su- true effects on the border of trivial, the statistical power will
perior’’ error rates compared with standard hypothesis testing never be higher than 25%).
for most cases (4), this is false. MBI exhibits very specific Whereas standard hypothesis testing has predictable type
tradeoffs between type I and type II error. Magnitude-based I error rates, MBI has type I error rates that vary greatly
inference creates peaks in the false positive rate and corre- depending on the sample size; choice of thresholds for harm/
sponding dips in the false negative rate at specific sample sizes. benefit; and choice of maximum risk of harm/minimum
Sample size calculators provided by MBI_s creators are tuned to chance of benefit. This is problematic because unless re-
find these peaks—which typically occur at small-to-moderate searchers calculate and report the type I error for every ap-
sample sizes. At these peaks, the type I error rates are two to plication, this will always be hidden to readers. Furthermore,
six times that of standard hypothesis testing. The use of MBI the dependence on the thresholds for harm/benefit as well
may therefore result in a proliferation of false positive results. as the maximum risk of harm/minimum chance of benefit
In their Sports Medicine paper, Hopkins and Batterham do makes it easy to game the system. A researcher could tweak
acknowledge inflated type I error rates in one specific case— these values until they get an inference they like. Hopkins
clinical MBI when the effect is marginally trivial in the ben- and Batterham dismiss this issue by saying: ‘‘Researchers
eficial direction (4). However, due to incorrect definitions of should justify a value within a published protocol in advance
type I and type II error, they failed to recognize that type I error of data collection, to show they have not simply chosen a
is actually inflated in all cases at their ‘‘optimal’’ sample sizes value that gives a clear outcome with the data. Users of NHST
(for null-to-trivial effects in both directions and for nonclinical [Null Hypothesis Significance Testing] are not divested of
MBI). The increase in false positives occurs because MBI_s this responsibility, as the smallest important effect informs
constraint on harm—which dominates at small-to-moderate sam- sample size’’ (4). But there_s an obvious difference: You
ple sizes—is less stringent than the corresponding constraint cannot change the sample size once the study is done, but it is
in standard hypothesis testing (as illustrated in Fig. 2). Incorrect easy to fiddle with MBI’s harm and benefit parameters. And,
definitions also led Hopkins and Batterham to dramatically though it would be ideal for researchers to publish protocols
underestimate type II error in several of their simulations. ahead of time, in reality they rarely do (14).
Hopkins and Batterham might argue that one can simply There is no doubt that standard hypothesis testing has
change the minimum chance of benefit, for example from a pitfalls. Too often, researchers place undue emphasis on

p-values (15) and fail to consider other key pieces of statis- In this article, I have explored type I and type II errors only
tical information—including graphical displays, effect sizes, for a specific statistical test: comparing two means. However,
confidence intervals, and consistency across multiple analyses my error equations could easily be adapted to other statistical
(16). I appreciate Hopkins and Batterham_s attempt to edu- tests; and the patterns will be the same. Also, I have not
cate researchers about the importance of magnitude and pre- addressed the veracity of MBI_s probabilistic statements, for
cision. However, although their intentions may have been example, when MBI states that there is a 75% to 95%
good, they introduced a statistical method into the sports chance that the effect is beneficial, is this accurate? One can
science literature without adequately understanding its sys- largely ignore these statements by simply viewing MBI as a
tematic behavior. Also, they continue to defend it with de- method that evaluates the location of confidence intervals
monstrably false arguments. For example, I have established relative to certain ranges. However, I will note that these
that Hopkins and Batterham defined type I and type II error probabilistic statements are unlikely to be accurate given
incorrectly in their paper in Sports Medicine (4), and conse- that Hopkins and Batterham are interpreting confidence
quently made false claims about MBI. This should give users intervals as Bayesian credible intervals without doing a
of MBI pause. proper Bayesian analysis. Arguably, many interventions
Additionally, in their 2016 Sports Medicine article (4), tested in sports science will have a low prior probability of
Hopkins and Batterham claim that, ‘‘We also provide published effectiveness and failing to account for this will lead to
evidence of the sound theoretical basis of MBI [10,11,16].’’ overly optimistic conclusions. As Welsh and Knight argue
However, the three references they cite do not provide such (8), if Hopkins and Batterham want to provide probabilistic
evidence. Gurrin et al. (17) suggests that it may sometimes statements, they should adapt a fully Bayesian analysis.
be reasonable to interpret frequentist confidence intervals Hopkins and Batterham may be correct in recognizing
as Bayesian credible intervals by assuming a uniform prior that when it comes to testing interventions to enhance ath-
probability, but they explicitly state: ‘‘Although the use of a letic performance, our tolerance for type I error may be
uniform prior probability distribution provides a neat intro- higher than in other situations (for example, testing a cancer
duction to the Bayesian process, there are a number of rea- drug). Coaches and athletes may be willing to occasionally
sons why the uniform prior distribution does not provide the adapt ineffective interventions so long as they are not det-
foundation on which to base a bold new theory of statistical rimental to athletic performance. However, the cost of this is
analysis!’’ Shakespeare et al. (18) just provides general infor- not nothing—ineffective interventions waste time and money,
mation on confidence intervals, and does not address anything and may cause side effects. What is needed is an approach that_s
directly related to MBI. Finally, the last reference, by Hopkins less conservative than a two-sided null hypothesis test but still
and Batterham (19), is a short letter in which they point to adequately reins in the type I error. The answer is simple: Use
empirical evidence from a simulation that is only a preliminary a one-sided null hypothesis test for benefit. Compared with
version of the simulations reported in Sports Medicine (4). clinical MBI, this approach: 1) guards more strictly against
Hopkins and Batterham_s sample size spreadsheets may also inferring benefit when the effect is harmful, 2) has lower
mislead users. They ask users to enter values for the minimum type I error rates for small-to-moderate sample sizes, and 3)
chance of benefit (Gb) and maximum risk of harm (Gh), but has higher type II error rates only when the effect sizes are
they label these the ‘‘type I clinical error’’ and ‘‘type II clinical close to trivial. Researchers would, of course, need to con-
error.’’ In fact, the minimum chance of benefit and maximum sider the accompanying effect size and confidence interval.
risk of harm are not interchangeable with type I and type II If the confidence interval contains only trivial effects, this
error, as previously noted by Welsh and Knight (8). At the should be interpreted accordingly. For an excellent reference
specified sample size, the type I error rate will equal the on how to interpret confidence intervals, see the study of
minimum chance of benefit only if the true effect equals the Curran-Everett (20).
threshold for harm (jCh). Also, the type II error rate will In conclusion, ‘‘magnitude-based inference’’ exhibits
equal the minimum chance of benefit only if the true effect undesirable systematic behavior and should not be used. As
is equal to the threshold for benefit (Cb). However, because Welsh and Knight (8) have already pointed out, MBI should
of the conflation of terms, the users may mistakenly infer be replaced with a fully Bayesian approach or should simply
that they are setting the type I and type II errors at fixed be scrapped in favor of making qualitative statements about
values. Moreover, users may be left with the false impres- confidence intervals. In addition, a one-sided null hypothesis
sion that they are controlling the overall type I error when, test for benefit—interpreted alongside the corresponding
in fact, they are controlling only the type I error from in- confidence interval—would achieve most of the objectives of
ferring a harmful effect is beneficial, which accounts for clinical MBI while properly controlling type I error.
only a tiny fraction of type I errors; they are failing to
constrain the predominant source of type I error, from in- The author did not receive financial support and has no conflicts
ferring that a trivial effect is beneficial. Users (or potential of interest to disclose related to this work. Thanks to Matthew
users) of MBI are encouraged to use my general equations Sigurdson for his thoughtful comments on the paper. The results of the
study are presented clearly, honestly, and without fabrication, falsifi-
to explore the true type I and type II error rates for MBI for cation, or inappropriate data manipulation. The results of the present
different scenarios. study do not constitute endorsement by ACSM.
REFERENCES
1. Hopkins WG. Estimating sample size for magnitude-based 11. Leventhal L, Huynh CL. Directional decisions for two-tailed tests:
inferences. Sportscience. 2006;10:63–70. power, error rates, and sample size. Psychol Methods. 1996;1(3):
2. Hopkins WG. A spreadsheet for deriving a confidence interval, 278–92.
mechanistic inference and clinical inference from a P value. 12. Cohen J. Statistical Power Analysis for the Behavioral Sciences.
Sportscience. 2007;11:16–20. 2nd ed. Hillsdale (NJ): Lawrence Erlbaum Associates; 1988. 26 p.
3. Hopkins WG, Marshall SW, Batterham AM, Hanin J. Progressive 13. Hopkins WG. (2008). Sample Size Estimation for Research Grants
statistics for studies in sports medicine and exercise science. Med and Institutional Review Boards [PowerPoint slides]. 2018 [Cited
Sci Sports Exerc. 2009;41(1):3–13. March 17, 2018] Available from: https://view.officeapps.live.com/op/
4. Hopkins WG, Batterham AM. Error rates, decisive outcomes and view.aspx?src=http://www.sportsci.org/2006/SampleSizeEstimation.ppt.
publication bias with several inferential methods. Sports Med. 2016; 14. Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis JP.
46(10):1563–73. Reproducible research practices and transparency across the bio-
5. Buchheit M. The numbers will love you back in return—I promise. medical literature. PLoS Biol. 2016;14(1):e1002333.
Int J Sports Physiol Perform. 2016;11(4):551–4. 15. Nuzzo R. Scientific method: statistical errors. Nature. 2014;506(7487):
6. Claus GM, Redkva PE, Brisola GMP, et al. Beta-alanine supple- 150–2.
mentation improves throwing velocities in repeated sprint ability and 16. Sainani KL. Getting the right answer: four statistical principles.
200-m swimming performance in young water polo players. Pediatr PM&R. 2017;9(9):933–7.
Exerc Sci. 2017;29(2):203–12. 17. Gurrin LC, Kurinczuk JJ, Burton PR. Bayesian statistics in medical
7. Barker RJ, Schofield MR. Inference about magnitudes of effects. research: an intuitive alternative to conventional data analysis.
Int J Sports Physiol Perform. 2008;3(4):547–57. J Eval Clin Pract. 2000;6(2):193–204.
8. Welsh AH, Knight EJ. ‘‘Magnitude-based inference’’: a statistical 18. Shakespeare TP, Gebski VJ, Veness MJ, Simes J. Improving interpre-
review. Med Sci Sports Exerc. 2015;47(4):874–84. tation of clinical studies by use of confidence levels, clinical significance
9. Butson M. Will the numbers really love you back: re-examining curves, and risk-benefit contours. Lancet. 2001;357(9265): 1349–53.
magnitude-based inference. [Internet] 2018 [Cited March 17, 2018] 19. Hopkins WG, Batterham AM. An imaginary Bayesian monster. Int
Available from: https://osf.io/yvj5r/. J Sports Physiol Perform. 2008;3(4):411–2.
10. Sainani KL. Clinical versus statistical significance. PM&R. 2012; 20. Curran-Everett D. Explorations in statistics: confidence intervals.
4(6):442–5. Adv Physiol Educ. 2009;33(2):87–90.

The Problem With Magnitude Based Inference .23

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Problem With Magnitude Based Inference .23

Uploaded by

Copyright:

Available Formats

The Problem with ‘‘Magnitude-based

but have regraphed them in a more transparent and informative

MBI); a threshold for harm/benefit of 0.2 standard deviations; Clinical MBI:

2168 Official Journal of the American College of Sports Medicine http://www.acsm-msse.org

2170 Official Journal of the American College of Sports Medicine http://www.acsm-msse.org

2172 Official Journal of the American College of Sports Medicine http://www.acsm-msse.org

will make the following true: P(true effectGjCh) = Gh and B

2174 Official Journal of the American College of Sports Medicine http://www.acsm-msse.org

2176 Official Journal of the American College of Sports Medicine http://www.acsm-msse.org

You might also like