You are on page 1of 7

Hypothesis Testing & Statistical

Significance, Including Common


Mistakes to Avoid

“Statistically significant” is a widely used yet To avoid this ambiguity, don’t use the “S” word. If the
commonly misunderstood phrase. Before P value is less than a threshold, say so. If an observed
effect is large enough to be relevant, say so. The
analyzing data and presenting statistical
word “significant” is never needed.
results, make sure you understand what
statistical “significance” means and, equally
Accept or reject?
important, what it doesn’t mean.
Much of the statistical reasoning we use today
was originally developed in the context of quality
How to use statistics for
control where you need a definite yes or no answer
hypothesis testing from every analysis. Should you accept or reject the
Statistics is a powerful tool that can help scientists batch? The logic used to obtain the answer is called
decide whether or not a hypothesis is valid. But with hypothesis testing.
great power comes great responsibility, and it’s
important to use statistics wisely. Here we provide a ADVICE: Avoid the concept of “statistical
brief overview on how to use statistics for hypothesis
significance” (it’s only relevant in limited
testing as well as guidance on the much-used and
situations)
often misunderstood term, “significant.”
The term “significant” is seductive and easy to
ADVICE: Avoid the word “significant” so you misinterpret because the use of the word in statistics
has a meaning entirely distinct from its use in everyday
can precisely communicate what you mean
conversation. Just because a difference is statistically
The word “significant” has two meanings in research: significant does not mean that it is biologically
or clinically important or interesting. Moreover, a
• A P value is less than a preset threshold (often 0.05)
result that is not statistically significant (in the first
so you can reject the null hypothesis and state that
experiment) may turn out to be very important.
the results are deemed statistically significant.
The use of “statistically significant” in hypothesis
• A difference (or ratio or correlation...) is large
testing makes sense in situations where you must
enough that you think the result has biological (or
make a firm decision based on the results of one P
scientific or practical) relevance.
value. While this situation occurs in quality control, 3. Now, perform the appropriate statistical test to
where you need to decide whether to accept or reject compute the P value.
a specific batch, it doesn’t really happen in other
If the P value is less than the threshold, state
situations. For example, in clinical trials, decisions
that you “reject the null hypothesis” and that the
are made based on several kinds of evidence, and
difference is “statistically significant.”
in basic research, decisions and conclusions are
typically made based on multiple experiments. If the P value is greater than the threshold, state
that you “do not reject the null hypothesis” and that
Thus, if you do not need to make a decision based
the difference is “not statistically significant”. You
on one P value, then there is no need to declare a
cannot conclude that the null hypothesis is true.
result “statistically significant” or not. Simply report
All you can do is conclude that you don’t have
the P value as a number, without using the term
sufficient evidence to reject the null hypothesis.
“statistically significant”. Better yet, simply report
the confidence interval, without a P value. But even if you find a “statistically significant” result,
what is the likelihood that this difference is actually
Can you reject the null hypothesis and true? We discuss this topic in the next section.
declare a “statistically significant” difference
between sample and control? Defining P value and alpha
Probably the easiest part of your statistical analysis The P value is a probability, with a value ranging from
will be the following three steps to determine zero to one, that answers the following question
whether or not you can reject the null hypothesis and, (which you probably never thought to ask):
if the context of your analysis merits it (see ADVICE: For an experiment of this size, if the populations
Avoid the concept of “statistical significance” (it’s really have the same mean, what is the probability
only relevant in limited situations)) declare that of observing at least as large a difference between
you’ve found a statistically significant difference sample means as was, in fact, observed?
between the sample and the control.
Alpha is a threshold P value below which a difference
1. Before you even begin your experiment, first in means is judged “significant.” For an alpha set
define a threshold P value. to its typical value of 0.05, a result is said to be
Ideally, you should set the P value based on the statistically significant when a difference that large
relative consequences of missing a true difference or (or larger) would occur less than 5% of the time if the
falsely finding a difference. In practice, the threshold populations were, in fact, identical.
value (called alpha) is almost always set to 0.05, an
arbitrary value that has been widely adopted.

2. Next, define the null hypothesis.

When analyzing an experiment, the null hypothesis


is usually the opposite of the experimental
hypothesis. Your experimental hypothesis —
the reason you did the experiment — is that the
treatment changes the mean. The null hypothesis
is that the two populations have the same mean,
Population A and Population B have the same mean, but if a
i.e. that the treatment has no effect. scientist only sees the data indicated in purple (Population A) or
blue (Population B), the observed means will be very different from
the true mean of each population. The P value provides a value for
the probability that you will observe the purple and blue data points,
given a total population indicated by the gray data points.
Significant vs. not significant: a legal is statistically significant, it might still not be true.
Thus, interpreting results requires common sense,
analogy to guilty or not guilty
intuition, and judgment. Here’s an example.
The statistical concept of “significant” vs. “not
Imagine that you are screening a drug to see if it
significant”can be understood by comparing it to the
lowers blood pressure. You use the usual threshold
legal concept of “guilty” vs. “not guilty.”
of P<0.05 as defining statistical significance. Based
In the American legal system (and much of the on the amount of scatter you expect to see and
world) a criminal defendant is presumed innocent the minimum change you would care about, you’ve
until proven guilty. If the evidence proves that the chosen the sample size for each experiment to have
defendant is guilty beyond a reasonable doubt, 80% power to detect the difference you are looking
the verdict is “guilty,” otherwise the verdict is “not for with a P value less than 0.05.
guilty”. In some countries, this verdict is “not proven”,
If you do get a P value less than 0.05, what is the
which is a better description. A “not guilty” verdict
chance that the drug truly works?
does not mean that the judge or jury concluded that
the defendant is innocent -- it just means that the The answer: It depends on the context of your
evidence was not strong enough to persuade the experiment.
judge or jury that the defendant was guilty.
Let’s start with the scenario where, based on the
The concepts of “significant” and “not significant” in context of the work, you estimate there is a 10% chance
statistical hypothesis testing is similar. In statistical that the drug actually has an effect. What happens when
hypothesis testing you start with the null hypothesis, you perform 1000 drugs that have a 10% chance of
which is usually that there is no difference between having an effect (Table 1)?
groups. If the evidence produces a small enough P
value, you reject that null hypothesis and conclude Given your 10% estimate, you would expect 100 drugs
that the difference is real. If the P value is higher that work and 900 drugs that don't work. But these are
than your threshold (usually 0.05), you don’t reject what you expect. Given these expectations, what
the null hypothesis. This doesn’t mean the evidence would you observe?
convinced you that the treatment had no effect, To understand what you would observe, let’s focus on
only that the evidence was not persuasive enough to the column that covers the 10% probability that a drug
convince you that there is an effect. actually works. Since the power of your study is 80%,
you expect only 80% of the 100 drugs that actually work
Interpreting low P values, the false discovery to yield a P value less than 0.05, so the upper left cell is
rate, and how much faith to put into a 80 drugs (i.e. 100 * 80%).
“statistically significant” result Now let’s look at the column that covers the 90%
probability that a drug does not actually work. Since
The three steps discussed in the previous section
you set the definition of statistical significance to 0.05,
sound straightforward but it’s important to note
you expect 5% of the 900 drugs don't actually work to
that you can’t interpret statistical significance or a
yield a P value less than 0.05, so the upper right cell is
P value in a vacuum—your interpretation depends
45 drugs (i.e. 900 * 5%).
on the context of the experiment, which affects the
false discovery rate (also called the false positive
rate). Depending on the context of your study, the
false discovery rate can be much higher than the
value of alpha, which means that even if your result
Table 1. Reality vs observation – how often does the P value deliver false positive and false negative results?*

EXPECTED

REALITY: Drugs work REALITY: Drugs don't TOTAL NUMBER OF


work DRUGS

P<0.05
A “significant” 80 45 125
OBSERVED

effect is observed

P>0.05
“no significant” 20 855 875
effect is observed

TOTAL NUMBER OF DRUGS 100 900 1,000

*In this scenario, a drug is estimated to have a 10% chance of actually having an effect, our sample size is
chosen to have an 80% power of observing the expected difference with a P value less than 0.05, and we
evaluate 1,000 drugs with a 10% probability of having an effect.

Adding up the first row, you find that 125 drugs yield a discoveries. The false discovery rate (abbreviated
"statistically significant" result, however only 80 of FDR) is 45/125 or 36%. Not 5%, but 36%. This is also
these drugs actually work. The other 45 drugs yield a called the False Positive Rate (FPR).
"statistically significant" result when the reality is that
Table 2, from chapter 12 of Essential Biostatistics1,
they don’t actually work – these results are false
shows the FDR for this and three other scenarios.
positives or false

Table 2. FDR for three different scenarios.

FDR FOR
PRIOR PROBABILITY FDR FOR P<0.05 0.045 < P < 0.050

Comparing randomly assigned groups


0% 100% 100%
in a clinical trial prior to treatment

Testing a drug that might possibly work 10% 36% 78%

Testing a drug with 50:50 chance


50% 6% 27%
of working

Positive controls 100% 0% 0%

Each row in the table above is for a different than 0.05, what is the chance that there really is
scenario defined (before collecting data) by a no effect and the result is just a matter of random
different prior probability of there being a real sampling?” Note this answer is not 5%. The FDR is
effect. The middle column shows the expected quite different than alpha, the threshold P value
FDR (also called FPR) as calculated above. This used to define statistical significance.
column answers the question: “If the P value is less
The right-most column, determined by simulations, animal’s weight); use a method to compare one
asks a slightly different question based on work by variable while adjusting for differences in another;
Colquhoun2,3: “If the P value is just a little bit less the list of possibilities is endless. Keep trying until
than 0.05 (between 0.045 and 0.050), what is the you obtain a statistically significant result or until you
chance that there really is no effect and the result is run out of money, time, or curiosity.
just a matter of random sampling?” These numbers
The problem with the approach above is that if there
are much higher. Focus on the third row where the
really is no difference or no effect, the chance of
prior probability is 50%. In this case, if the P value
finding a “statistically significant” result will still exceed
is just barely under 0.05 there is a 27% chance that
5%, sometimes by a large amount. Therefore, if you
the effect is due to chance. Note: 27%, not 5%! And
only collect more data or analyze the data differently
in a more exploratory situation where you think the
when the P value is greater than 0.05, your results
prior probability is 10%, the false discovery rate
will be biased. If the P value was less than 0.05 in
for P values just barely below 0.05 is 78%. In this
the first analysis, it might be larger than 0.05 after
situation, a statistically significant result (defined
collecting more data or using an alternative analysis.
conventionally) means almost nothing if you're
However, you’d never see this outcome because you
trying to minimize false positives.
only collected more data or tried different data analysis
Which brings us back to the question of how to strategies when the first P value was greater than 0.05
interpret a low P value and what a statistically (see the next section for a more detailed discussion).
significant finding means.
The term P-hacking was coined by Simmons et al4 who
The bottom line is: also use the phrase, “too many investigator degrees
of freedom.” This is a general term that encompasses
Even if you have a low P value and, thus, a statistically
dynamic sample size collection, HARKing, and more.
significant difference between your control and your
There are three kinds of P-hacking:
sample, there are many situations where a high false
discovery rate/false positive rate can render that • The first kind of P-hacking involves changing the
result close to meaningless. You have to use actual values analyzed. Examples include ad hoc
judgment, intuition, common sense, and your sample size selection, switching to an alternate
scientific acumen to interpret your results. control group (if you don’t like the first results and
your experiment involved two or more control
More Mistakes to Avoid groups), trying various combinations of independent
variables to include in a multiple regression (whether
ADVICE: Don’t P-Hack, it can introduce bias the selection is manual or automatic), trying
analyses with and without outliers, and analyzing
into your results leading to incorrect
various subgroups of the data.
conclusions
• The second kind of P-hacking is reanalyzing a
Statistical results can only be interpreted at face
single data set with different statistical tests.
value when every choice in data analysis was
Examples: Trying parametric and nonparametric
performed exactly as planned and documented as
tests; analyzing the raw data, then analyzing the
part of the experimental design. If you adjust and re-
logarithms of the data.
analyze your data after your initial analysis you can
introduce bias, leading to incorrect conclusions. • The third kind of P-hacking is the garden of
forking paths5. This happens when researchers
Here’s an example of a problematic data analysis
perform a reasonable analysis given their
workflow: you collect and analyze data, are unhappy
assumptions and their data, but would have done
with the result and then add additional data and
other analyses that were just as reasonable had the
re-analyze; remove a few outliers; transform to
data turned out differently.
logarithms; try a nonparametric test; redefine
the outcome by normalizing (say, dividing by each
Exploring your data can be a very useful way Here we simulated data by drawing values from a
to generate hypotheses and make preliminary Gaussian distribution (mean=40, SD=15, but these
conclusions. But all such analyses need to be clearly values are arbitrary). Both groups were simulated
labeled, and then retested with new data. using exactly the same distribution. We picked N=5
in each group and computed an unpaired t test and
ADVICE: Don’t HARK recorded the P value. Then we added one subject
to each group (so N=6) and recomputed the t test
(Hypothesizing after the result is known)
and P value. We repeated this until N=100 in each
Hypothesizing After the Result is Known (HARKing, group. Then we repeated the entire simulation three
Kerr 1998) is when you analyze the data many times. These simulations were done by comparing
different ways (say different subgroups), discover an two groups with identical population means, so any
intriguing relationship, and then publish the data so “statistically significant” result we obtain must be a
it appears that the hypothesis was stated before the coincidence — a Type I error.
data were collected.
Experiment 1 (green) reached a P value of less than
0.05 when N=7, but the P value is higher than 0.05 for
ADVICE: Don’t keep adding subjects until you all other sample sizes. Experiment 2 (red) reached a
hit significance, you’ll get misleading results P value of less than 0.05 when N=61 and also when
N=88 or 89. In Experiment 3 (blue) the curve hit a P
The problem with continuing to add subjects until
value of less than 0.05 when N=92 to N=100.
you hit significance is the same as the previous point
about p-hacking—you’ll introduce bias into your If we followed the sequential approach, we would
results, leading you to come to incorrect conclusions. have declared the results in all three experiments to
The best approach is to calculate what your sample be “statistically significant”. We would have stopped
size should be to see the difference between means when N=7 in the first (green) experiment, so would
that you expect at the alpha that you define as never have seen the dotted parts of its curve. We
significant, and then stick with that sample size. would have stopped the second (red) experiment
when N=6, and the third (blue) experiment when
Here’s a simulation that demonstrates the problem.
N=92. In all three cases, we would have declared the
results to be “statistically significant”.

Since these simulations were created for values


where the true mean in both groups was identical,
any declaration of “statistical significance” is a Type
I error (rejection of a true null hypothesis). If the null
hypothesis is true, i.e. the two population means
are identical, we would expect to see this kind of
Type I error in 5% of experiments (if we use the
traditional definition of alpha=0.05 so P values less
than 0.05 are declared to be significant). But with
the sequential approach used in the simulations,
all three of our experiments resulted in a Type I
error. If you extended the experiment long enough
(infinite N) all experiments would eventually reach
statistical significance. Of course, in some cases
Three different simulations that demonstrate the hazards of adding
subjects until data reach statistical significance (indicated by the you would eventually give up even without “statistical
green area adjacent to the x-axis). significance”. But this sequential approach will
produce “significant” results in far more than 5% of REFERENCES
experiments, even if the null hypothesis were true, 1
Motulsky HJ. Essential Biostatistics: A nonmathematical
rendering this approach invalid. approach. Oxford University Press; 1st edition (June 30,
2015). ISBN: 978-0199365067. www.essentialbiostatistics.
Bottom line: It is important that you choose a sample com.
size and stick with it. You’ll fool yourself if you stop
2
Colquhoun, D. An investigation of the false discovery rate
when you like the results but keep going when you
and the misinterpretation of p-values. Royal Society Open
don’t. The alternative is using specialized sequential
Science. 2014. 1(3):140216–140216. http://doi.org/10.1098/
or adaptive methods that take into account the fact rsos.140216.
that you analyze the data as you go. To learn more
3
Colquhoun, D. The False Positive Risk: A Proposal
about these techniques, look up “sequential” or
Concerning What to Do About p-Values. The American
“adaptive” methods in advanced statistics books. Statistician. 2019. 73:supplement 1.
4
Simmons, J. P., Nelson, L. D., & Simonsohn, U. False-
positive psychology: undisclosed flexibility in data
collection and analysis allows presenting anything as
significant. Psychological Science 2011. 22(11):1359–1366.
5
Gelman, A., & Loken, E. The garden of forking paths: Why
multiple comparisons can be a problem, even when there
is no “fishing expedition” or ‘p-hacking’ and the research
hypothesis was posited ahead of time. (2013). Unpublished
as of Jan. 2016.

You might also like