You are on page 1of 15

Review Article

Laboratory Animals
2019, Vol. 53(1) 28–42
! The Author(s) 2018
Guidelines on statistics for researchers Article reuse guidelines:
sagepub.com/journals-
using laboratory animals: the essentials permissions
DOI: 10.1177/0023677218783223
journals.sagepub.com/home/lan

Romain-Daniel Gosselin1,2

Abstract
There is growing concern that the omnipresence of flawed statistics and the deficient reproducibility that
arises therefrom results in an unethical waste of animals in research. The present review aims at providing
guidelines in biostatistics for researchers, based on observed frequent mistakes, misuses and misconcep-
tions as well as on the specificities of animal experimentation. Twelve recommendations are formulated that
cover sampling, sample size optimisation, choice of statistical tests, understanding p-values and reporting
results. The objective is to expose important statistical issues that one should consider for the correct design,
execution and reporting of experiments.

Keywords
experimental design, statistics, sample size, reproducibility, guidelines
Date received: 15 October 2017; accepted: 21 May 2018

Statistics is the science of rigorously quantifying uncer- ‘refresher course’. The review focuses on statistics and
tainty, and applying it to life sciences (i.e. biostatistics) will only touch upon ethical issues. It will begin with a
has become indispensable due to the capriciousness of very brief introduction that reviews the important con-
biological readouts. In animal research, flawless bio- cepts in biostatistics and is then divided into sections
statistics is essential for interpreting and reproducing that recapitulate the three arms of quantitative experi-
results and thus avoiding the unnecessary and unethical ments, namely design, analysis and reporting. Each sec-
use of animals. Despite this, there is a well-documented tion is sub-divided into a series of recommendations
universal misunderstanding and misuse of biostatistics, (Figure 1). Nevertheless, the author is aware that con-
which can have critical consequences on the validity of cepts addressed in distinct sections and subsections are
the research.1,2 Mistakes in experimental design,2–4 interconnected. Furthermore, although prepared with
interpretation of p-values,5,6 statistical analysis,7–9 and the objective of guiding researchers involved in
presentation,10,11 can result in devastating ethical and animal research, the concepts exposed herein are fully
financial costs and low success rates of subsequent clin- generalisable to other scientific fields such as in-vitro or
ical trials or technology development. The present art- clinical studies. Finally, the entire world of biostatistics
icle aims to introduce good practices into biostatistics cannot be covered in a single review; consequently, this
in research employing laboratory animals. I will expose review will focus on a selection of topics deemed
pitfalls and mistakes frequently encountered, followed more relevant for a journal about animal research.
by viable recommendations.
1
Biotelligences LLC, Lausanne, Switzerland
2
Precision Medicine Unit, Lausanne University Hospital (CHUV),
Scope and limitations Switzerland
The review is addressed mainly to life scientists who are
not familiar with the fundamental concepts of statistics Corresponding author:
Romain-Daniel Gosselin, Precision Medicine Unit, Lausanne
but who need to use biostatistics in their research activ- University Hospital (CHUV), Chemin des Roches 1a/1b, CH-1010
ity. Nevertheless, life scientists with a more advanced Lausanne, Switzerland.
command of statistics could use the present article as a Email: rdg@biotelligences.com
Gosselin 29

Figure 1. Overview of the recommendations suggested in the present article.

Therefore, some readers from specific domains or with of independence is defined by the fact that the value
precise expectancies (for example in statistics not based of one experimental unit does not affect the value of
on hypothesis testing), might find this thematic selec- another. On the other hand, sampling units are the
tion and the angle by which these issues are covered smallest entities that will be measured or observed.
hereafter somewhat partial. When multiple sampling units are taken from each
experimental unit, they are called technical replicates.
Populations of interest are characterised by their prop-
Hypothesis testing erties, named parameters, whereas the estimates of these
The following paragraph serves as a very concise out- parameters in samples are eponymously called statistics.
line on the specific domain of hypothesis testing and Researchers may want to simply describe the properties of
gives the backbone of the notions and terminology the samples through so-called descriptive statistics or, con-
exposed hereafter. versely, try to reach conclusions about population para-
Life scientists study populations, a term defined as meters from the samples using inferential statistics.
the total set of observations that can be made. Adequate experimental designs ensure good internal
However, populations are often (if not always) inacces- validity (i.e. validity of the method that enables the experi-
sible; therefore, scientists rely on small subsets of obser- ment to draw inferences) and external validity (i.e. gener-
vations named samples. Experimental units are the alisability) of the study, which can both be threatened by
smallest entities that can be assigned to a treatment many flaws in the research protocol. The important ones
and, therefore, at the same time are the true biological are covered in the present review. Investigators often
replicates that can be randomised and used to form a study relationships between so called independent variables
sample of independent observations. This notion (the values of which are ‘controlled’ by the experimenter,
30 Laboratory Animals 53(1)

such as sex or treatment) and dependent variables


(variables the investigator quantifies). To avoid confusion
Statistical design
between different uses and meanings of the term Recommendation 1: Consult a statistician to
‘‘independence’’, the words input and output variables design and analyse the study, and consider
are preferred throughout the text. Input variables are statistics, including planned tests, during
called factors when categorical (e.g. groups, such as sex)
and the different values they take are called levels (e.g.
the design phase
male or female). In 1938, Ronald Fisher, one of the founders of modern
The idea that scientific questions may have clear-cut statistics, stated that ‘to call in the statistician after the
answers is almost invariably irrelevant in the life experiment is done may be no more than asking him to
sciences because observations are naturally fickle, perform a post-mortem examination: he may be able to
even when collected from the same individual, but say what the experiment died of’.12 Following on from
also due to variability that arises from measurements. this, in their landmark article published in 2004 entitled
The purpose of statistical tests is to extract the neces- Guidelines for reporting statistics in journals published by
sary information by removing the noisy background the American Physiological Society, Curran-Everett and
to attempt to make decisions about populations. The Benos introduced nine guidelines for designing and
most frequent approach is null hypothesis significance reporting quantitative results in physiology, and sug-
testing (NHST), which is measured using a p-value. gested that researchers should consult a statistician if
Despite their great diversity, statistical tests generally in doubt when planning the study.13 The attention
proceed in a similar series of steps: 1. based on given to statistics at a very early stage of the research
the biological question, a so-called null hypothesis (H0) project is salutary since flaws in design are often irre-
is formulated, which theorises the absence of effect, rela- versible and require a completely new experiment.
tionship or difference; 2. data are collected and the effect, Among others, early important considerations include
difference or relationship observed is converted into a the type of variables to be measured, the unpaired or
statistic (usually named with a single letter or symbol paired (repeated measures) nature of the observations,
depending on the test); 3. the p-value is calculated, the number of experimental groups, the conditions of
which corresponds to the probability of obtaining this animal housing or the sex to be used. A professional
statistic or a larger one if H0 is true; and, finally, 4. a statistician will help define such important considera-
biological conclusion is drawn, usually based on a com- tions since accumulating evidence suggests that most
parison between p and an alpha threshold (usually set at researchers are not sufficiently proficient in experimen-
5% in life sciences). This classical approach of NHST is tal design and biostatistics to do it alone.14 These deci-
called the Neyman-Pearson approach. sions determine the nature of future statistical analyses
Depending on the question asked and hypothesis to be performed and therefore the subsequent power
formulated, statistical tests belong to categories that and sample size calculations (see Recommendations 3,
differ in the precise way they work. Throughout the 4 and 5). Figure 2 illustrates this point by showing how
article, we will often refer to notions and thinking input and output variables influence the subsequent
that belong to so-called location tests (such as statistical tests.
Student’s t-test), which make inferences about central For example, the collection of a continuous variables
tendency (i.e. mean, median) of the population. This (e.g. tumor size or plasma cytokine concentration)
relative focus is justified by the prevalence of these allows the use of a parametric test (see
tests in animal research, which makes their terminology Recommendation 8) for data analysis, but such tests
familiar to researchers. However, other families of tests are very sensitive to extreme individual values (out-
exist, such as variance tests (e.g. Levene’s test), propor- liers). In this example, the investigator must therefore
tion tests (e.g. Chi2 test), association tests (e.g. consider very early what policy to implement regarding
Pearson’s and Spearman’s correlation) or distribution outlier exclusion and the alternative possibility of col-
tests (e.g. Shapiro-Wilk test or Kolmogorov-Smirnov lecting categorical variables (e.g. disease score or nature
test), and will be mentioned periodically. of cancerous organ) instead. These early decisions will
Importantly, great care must be given not to confuse constitute the backbone on which the entire design is
terms and notions used in statistics with their counter- built. It is recommended to systematically review the
parts in common language. For example, when com- literature or perform a pilot study to get valuable infor-
paring blood pressure between mice and rats, the mation about the variables of interest.
statistical population refers to the values of the blood A particular point to note here is the selection of the
pressure not the species, which are obviously different sex of the animals in the experimental design since
biological populations. males and females may display divergent biological
Gosselin 31

Figure 2. Influence of the nature of the variables on the subsequent statistical tests. Grey boxes outlined in bold are
expanded in tables indicated by arrows to illustrate the hidden multiple possibilities and choices that exist within each box.
The list of tests is not exhaustive.

readouts. Experimental designs usually incorporate chronic pain for which years of mechanistic research
only one sex, males in the majority of cases, in an conducted on male rodents turned out to be untrue in
attempt to reduce the total variability.15 It should not females.17,18 The inclusion of both sexes can be made
be ignored that this selection bias may impede the through specific designs such as factorial designs. This
validity of the study.16 This point has been well demon- allows the estimation of treatment effect, sex effect, and
strated in recent studies on preclinical models of the effect due to the interaction of both.19
32 Laboratory Animals 53(1)

Recommendation 2: Take early steps to being preferred for a continuous variable. They are
considered nuisance variables that may generate spur-
defeat pseudo-replication and confounding ious relationships between the variables of interest and
Pseudo-replication, as coined by Hurlbert, is defined as therefore jeopardise the internal validity of the study.
‘the use of inferential statistics to test for treatment Their effect may be unnoticeable (hidden or lurking
effects with data from experiments where either treat- variables) or indistinguishable from the effect of the
ments are not replicated (though samples may be) or input variable (confounding variables).
replicates are not statistically independent’. 20 On the Hidden variables are frequent in experimental design
other hand, confounding refers to a situation in which and, as such, this issue should be addressed in order to
the association or effect between the variables is dis- balance their impact and maximise the validity of the
torted by the existence of (at least) another variable. study. For example, cage effect as described above, may
Although not synonymous, both issues may arise be controlled during the design phase by randomisation
from insufficiently defining the exact structure of the that randomly assigns animals to treatments paradigms
experiment, i.e. the identification, selection, and defini- across the cages. This will generate a mix of treatment
tion of variables and levels, as well as the rules by which paradigms within in the different cages. To achieve this,
the levels will be allocated to the experimental units. I recommend using free online tools such as the
Pseudoreplication may occur when the investigator GraphPad QuickCalcs available at www.graphpad.
fails to distinguish experimental and sampling units. In com/quickcalcs/randMenu. In some designs, blocking
view of the definitions presented in the Introduction, it is another possible strategy (the known source of varia-
appears that, although experimental and sampling units tion is called a blocking factor) by which the researcher
may be the same objects, in many designs they will be arranges experimental units in groups that are consid-
distinct. For example, the collection of a series of biop- ered homogeneous with respect to the blocking factor.
sies as technical replicates from each animal in an In so-called randomised block designs, experimental
experiment is an example of multiple sampling units units within blocks are subsequently randomised in an
for each experimental unit. It implies that the final attempt to control for other potential extraneous vari-
sample size (basis of the statistical analysis) is the ables. An extended description of the various options in
number of animals, not the number of biopsies. experimental design has been given by Festing et al.19
Wrongly considering sampling units as independent It is worth mentioning that randomisation is some-
experimental observations artificially inflates sample times experimentally impractical. In the above example,
sizes, violates the principle of independence and results animals with similar treatments would be housed in the
in an invalid analysis. Greater care must therefore be same cage. To overcome this, specific analyses can be
taken to correctly define and identify experimental units run that depend on the nature of the variables, such as
during the design, in other words to identify which mixed-effects ANOVA (also called split-plot ANOVA),
objects will be the independent observations in the factorial ANOVA, Analysis of Covariance, Cochran-
future statistical analysis. Mantel-Haenszel Chi2, partial correlation or multiple
In the above-mentioned example, the identification regression. I recommend consulting a professional sta-
of experimental (rats) and sampling (biopsies) units was tistician to select the appropriate design and analysis
intuitive, but the task may require more careful atten- (see Recommendation 1).
tion for some designs. A frequent scenario unfolds
when animals housed in the same cage are not indepen- Recommendation 3: Predefine the required
dent because housing cage influences the output vari-
able. Indeed, some biological readouts such as food
power and alpha threshold
consumption,21 gut microbiota,22,23 emotionality,24 or Statistical power is defined as the probability of conclud-
eye pathology due to light intensity,25 are influenced by ing a difference or relationship if there is indeed a differ-
the specific housing conditions of the cage or by animal ence or a relationship to be detected (i.e. the conditional
interactions within the cages. Here, if all animals within probability to reject the null hypothesis given that it is
the same cages are exposed to the same treatment, ani- indeed false). It is the sensitivity of the experiment and
mals are not independent observations, since the actual high power that limits the probability of false negative
design structure is that cages are the actual experimen- results. On the other hand, alpha is the probability that
tal units (they are randomised to the treatment) and the the performed test gives a false positive result (i.e. the
animals are the sampling units. conditional probability to reject the null hypothesis
In the above example, ‘‘cage’’ is not an input vari- given that it is true). Concretely, in the classical
able of primary interest but even so may influence the approach of NHST, alpha is the cut-off threshold
value of the output variable. It is called a cofactor that p-values should reach to conclude a statistical
because this variable is categorical, the term covariate significance.
Gosselin 33

A power of 0.8 (80%) and alpha set at 0.05 (5%) are and therefore not be statistically exploitable, one
the usual values defined by convention in life extra animal should be added into the experimental
sciences.26–29 These standards should be viewed as the design.
lower bounds in term of severity and may be largely Justification of sample size by habits, i.e. the sample
revised to be more conservative (i.e. more severe) to size used in the laboratory for years, is not advised;
match the specific requirements of the project. For many parameters, such as species, strains, providers,
example, a researcher investigating a possible increase variability, machines, or exact protocols change over
of brucellosis in cattle might want to increase the power time. Similarly, whenever possible, it is recommended
of the study up to 0.99 due to the potentially negative to avoid sample sizes based only on the type of experi-
consequences of false negative results on public health, mental technique to be done since central values, SDs
while the researcher may keep the alpha threshold at and event probabilities are a function of the exact vari-
0.05 as false positives are less of a concern in this able of interest in addition to the technique itself. For
instance. Conversely, in confirmatory studies with an example, deciding that western-blot analysis will always
aim of identifying infected animals, the power could use extracts from five animals in an entire project that
be revised downwards (e.g. 0.8) and alpha set at a investigates different families of proteins may result in
more conservative value (e.g. 0.001) to prevent false unreliable results due to the inconsistent SD of protein
positive results. A very clear explanation of this point content across protein families. If only one sample size
was recently reviewed by Mogil and Macleod and I must be chosen in the whole research project, it should
therefore invite the reader to refer to this article.30 be computed using the variability of the most incon-
It must be noted that modifications of alpha will stant variable, such as a target protein for example, this
impact the power if the SD, effect size or sample size can be determined through systematic review of the
remain the same (see Recommendation 4). Simply put, literature or using a pilot study.
all else being equal, the smaller the alpha value the Finally, it is worth mentioning that small samples
lower the power. This clearly reflects the difficulty not only considerably reduce power and are conse-
of minimising false positives and false negatives quently a large source of false negatives, but they also
simultaneously. inflate the false discovery rate (FDR), which is the
expected proportion of false positives in the declared
Recommendation 4: Estimate the sample significant results.31 It is more likely to collect a series
size required to reach the desired power of similar and non-representative observations by
chance alone in a small sample. Similarly, small
based on all available information sample sizes induce a risk of inflated effect size esti-
Many parameters influence power (see mates.4 Therefore, small p-values should always be con-
Recommendation 5) and the suggested approach to sidered possible false positives if small samples are
calculate the appropriate sample size includes three used, even in the presence of very large effect sizes.
steps outlined in Figure 3. In the first step, all relevant
information is gathered using scientific knowledge of Recommendation 5: Consider options to
the literature or pilot studies. Expected means or med- increase power and precision without
ians, SDs and probabilities of events should all be col-
lected. If the researcher expresses doubt about the
increasing sample size
validity and reproducibility of data disclosed in the lit- Sample size is positively correlated with power, theore-
erature, it is advisable to use a pilot study or prior tically prompting for larger samples in animal research.
information from his/her own publications. However, the ‘3Rs’ (replace, refine and reduce) provide
In the second step, the minimum sample size to reach a framework for reasonable and humane animal
the predefined power (see Recommendation 3) is calcu- research and require (alongside financial and logistic
lated, usually using a computer assisted tool such as restrictions) that the number of animals included
G*Power (www.gpower.hhu.de). This calculation uti- should be reduced to the minimum that allows the
lises the values of the parameters collected during the experiment to meet the objectives of the research. The
previous step in accordance with the planned statistical final sample size is therefore the result of a tradeoff
procedure that will be performed. between the statistical need for high power and the
The third and last step consists of adjusting the practical constraints of research. Although strongly
estimated sample size upwards by considering the connected, the notions of power and sample size are
anticipated severe events. For example, if a sample not strictly interdependent and can be treated
size calculation indicates 10 animals, and it is antici- separately. In inferential statistics, additional para-
pated that 10% of animals will die because of treatment meters beyond sample size have an important impact
34 Laboratory Animals 53(1)

Figure 3. Process to follow to determine the final sample size.

on power. These include: the spread (SD) of the vari- Recommendation 6: Avoid sequential
able of interest when the variable is continuous; the
probability of successful events for dichotomous vari-
sampling and optional stopping
ables; and the desired alpha. In observational studies, in An important cautionary note must be issued here
which the objective is often to quantify a variable as regarding the sequential addition of new animals to
precisely as possible, the concept of power is irrelevant an apparently underpowered design that seems to give
and sample size calculation answers the question ‘what a ‘trend close to significance’. So-called sequential sam-
is the minimum number of observations needed to have a pling occurs when not all animals are sampled at the
confidence interval of a certain width’? Similarly, the SD same experimental time but are rather added sequen-
of the variable of interest or probability of events influ- tially. This approach may be justified for practical
ence the width of confidence intervals. Therefore, an reasons when the full sampling cannot be done on
investigator may legitimately attempt to increase the one time point, but it is not recommended to perform
statistical power or precision of a research design with- the statistical analysis before the predefined sample
out increasing (even while reducing) the sample size size is reached. p-Values are variable and will some-
(Table 1). times cross the alpha threshold by chance only.27
Gosselin 35

Table 1. Examples of methods for improving power without increasing sample size.

Method Examples and remarks

Reduce variability  Use a trained experimenter to collect the data


 Blocking / stratification (example: analyse males and females
separately);
 Use techniques with lower variability (example: use body tempera-
ture instead of plasma cytokine concentrations as output variable to
measure disease severity);
Change the variables  Use continuous variables (e.g. the exact length in cm instead of
categories such as short and long) to quantify growth;
 Study more marked outcomes (e.g. use tumour size instead of sur-
vival to study a new anti-mitotic drug);
 Study more frequent outcomes (e.g. use old rats to study inter-
animal infectiousness of a new Mycoplasma strain since lung
infection is rare in young rats);
Use paired measurements  Use repeated measures on same animals
 Use cross-over designs;
Use a less severe alpha threshold Cautionary note: Statistically possible but not recommended since it
inflates the risk of false positive. Severe alpha thresholds must be
used in confirmatory experiments to reduce false positives.

Thus, performing the statistical analysis only when the Table 2. Definition and examples of p-hacking.
predefined total number of animals is reached, corre-
p-Hacking: collection or selection of data or statistical ana-
sponding to the estimated sample size, ensures the
lyses until ‘‘non-significant’’ results become ‘‘significant’’
acquisition of the sought power while preventing
researchers to stop the sampling process when an Examples:
‘expected’ p-value is obtained. Otherwise, sequential  Stop collecting data once p < a
sampling results in optional stopping, which is a form  Perform many tests but report only those with p < a
of p-hacking,32 and may result in deeply biased conclu-  Deciding whether to include or exclude outliers after
sions with inflated chance of false positives (Table 2). the analysis to reach p < a
 Transform or rearrange treatment groups after the
Nevertheless, experimental design may be adaptive
analysis to reach p < a
in certain situations where the experimenter needs to
review and analyse the results before the study reaches
the predefined power. This includes the need to monitor
remarkable benefits or unacceptable harmful effects, a relationship. The computation of a p-value is often
both prompting an early discontinuation of the proto- reduced to a binary ‘‘yes-or-no’’ test based on whether
col on ethical grounds. These interim analyses can be p is below an arbitrary alpha limit (usually 0.05). The
implemented, which are methods initially developed for exact p is not deemed meaningful, as opposed to the
clinical trials. However, it is vital not only that interim statistical significance. However, a ‘‘scale of evidence’’
analyses are decided during the design phase, including is paradoxically usually chosen in the form of multiple
the number of times the investigator will look at the thresholds, typically at 0.05, 0.01 and 0.001. This para-
data, but also that a correction is applied to the alpha digm based on a deeply flawed logic disguises the prob-
threshold to prevent the uncontrolled inflation of false abilistic substance of statistics that only estimates
positives.33,34 uncertainty and gives significant p-values a false
impression of truth.5,6
The definition of the p-value is the probability of
Statistical analysis getting an effect (difference or correlation) of at least
Recommendation 7: Do not use p-values the observed magnitude if the null hypothesis H0 (that
posits no difference or correlation) is true. This defini-
uncritically tion is implicitly and wrongly confused with the prob-
In hypothesis testing, statistical analyses almost invari- ability of H0, and p-values close to 0 are considered
ably culminate with p-values, which illustrates the indicator of the improbability of H0.6 It is very impor-
entrenched culture that so-called significant p-values tant to keep in mind that the p-value does not indicate
are compulsory and sufficient to reveal a difference or the probability of the tested hypothesis. When p ¼ 0.05
36 Laboratory Animals 53(1)

the probability of H0 to be true is often higher than For example, increasing the sample size tends to
5%,35 but the formal demonstration of this is beyond make p-values shrink mathematically although the
the scope of this article. One way of thinking about this magnitude of the biological effect is strictly the same.
point is to understand that if p-values were a faithful From the above, the following guidelines are given:
reflection of the truth and of the probability of H0, 1. use the conditional tense even when in case of small
repeated samplings of compared populations would p-values; 2. avoid the word significant alone and prefer
give near p-values. However p-values are extremely the idioms statistically significant, statistically convin-
variable, especially for small samples, because sampling cing or statistically likely; 3. whenever possible, draw
is inconsistent.27,36 To get a better understanding of this conclusions on effect sizes alongside exact p-values
concept, it is possible to imagine the following example. rather than significant thresholds only; 4. if a threshold
An experimenter makes a mistake after a syringe mis- is used, disclose only one value and avoid the use of a
labeling and treats two groups of rats with the exact gradual scale of thresholds to indicate apparent
same treatment, collects the variable of interest (e.g. increase in statistical significance.
blood pressure) and decides to compute p-values. The
probability of the null hypothesis is p(H0) ¼ 1 since Recommendation 8: Make an informed
there is no doubt about the impossibility of an effect. choice between parametric and
Nevertheless, sampling inconsistency generates samples
with slightly divergent statistics even though all rats
non-parametric tests
underwent the same treatment. Consequently, the com- The arsenal of tests and their variants is vast and one
parisons will give a wide variety of p-values that will possible decision is the choice between parametric and
only occasionally approach 1 and will even sometimes non-parametric tests, the full descriptions of which are
be below 0.05 by chance. Therefore, none of the com- precluded by obvious space limitations in this article.
puted p-values will reflect the actual probability of H0, The most frequent tests of both categories are indicated
which is 1. in Table 3 and the decision algorithm to decide
Another frequent misconception is the belief that the whether to select one type or the other is summarised
magnitude of the effect is commensurate with the value in Figure 4. Briefly, the power of parametric tests is
of p. p-Values provide no information about effect sizes generally higher but several assumptions must be ver-
or the magnitude of the effect, but simply give the ified otherwise the test readouts may be invalid and a
strength of the evidence against the null hypothesis. reduction of power may be observed.37,38 Parametric

Table 3. The most frequent parametric and non-parametric tests.

Variables and context Parametric test Non-parametric test

Correlation: 1 continuous output  Pearson r  Spearman 


and 1 continuous input variable
1 continuous output variable and  One sample Student’s t test  One sample Wilcoxon test (OSRT,
1 categorical input variable  (One sample Z test, rare) requires symmetrical data around
(1group) median)
 One sample Sign test (less powerful than
OSRT, but not distribution sensitive)
1 continuous output variable and  Unpaired Student’s t test  Mann-Whitney test (Wilcoxon Rank sum
1 categorical input variable  (Unpaired Z test, rare) test)
(2 groups)
1 continuous output variable and  Paired Student’s t test  Wilcoxon Sign rank test
1 categorical input variable
(2 paired groups, same animals)
1 continuous output variable and  One Way ANOVA  Kruskal-Wallis test (KW, rank test based
1 categorical input variable on c2 distribution)
(>2 unpaired groups)  Mood’s test (median test, more robust to
outliers but less powerful than KW)
1 continuous output variable and 1  Repeated measures  Friedman’s test
categorical input variable (>2 one way ANOVA
paired groups, same animals
over time)
Gosselin 37

Figure 4. Decision algorithm to help chose between parametric and nonparametric tests.

assumptions are the independence of intragroup obser- common case of low power caused by small samples,
vations, homoscedasticity (equality of variances), nor- no conclusion should be drawn but it is frequent that
mality of the sampling distribution of the means and the resulting high p-values are wrongly interpreted as a
absence of outliers. testimony of normality. Conversely, in overpowered
Homoscedasticity can be verified by several designs, these tests tend to always give very small
statistical methods such as F-test, Levene’s test or p-values, apparently indicating a departure from nor-
Goldfeld-Quandt test. Similarly, there is a widespread mality that dismisses parametric tests, simply because
yet questionable culture of testing normality using spe- many biological measurements are naturally not per-
cific tests such as Kolmogorov-Smirnov or Shapiro- fectly normally distributed (e.g. log-normally distribu-
Wilk tests. However, recommendation is given here ted). It is therefore advisable to avoid normality tests
not to use normality tests due to the usual poor under- and to utilise other approaches such as Q-Q plots or
standing of the implication of their null hypothesis common knowledge that a known pattern or a natural
(that the data is compatible with normality). In the limit exists, to assess the possible departure from
38 Laboratory Animals 53(1)

normality. In addition, it is important that the assump- convincingly higher value than the control group),
tion of normality does not apply to the data or the whereas two-tailed tests are used when the interest is
population but rather to the sampling distribution of bidirectional (e.g. whether treated animals display a
the mean which is the probability distribution of the convincingly different value than the control group,
means upon multiple samplings. Finally, when outliers higher or lower). In other words, a researcher can use
are present parametric tests should be avoided due to a one-tailed test if the possible effect is both scientifi-
the influence of extreme values on means, SDs and cally interesting in only one direction in the context of
normality. This is particularly the case when outliers the posed biological hypothesis and simultaneously pre-
are present in one group or in in one direction of the dicted to occur in that direction only. One-tailed tests
data only. use the same predefined alpha threshold (usually 5%)
In case parametric methods seem inappropriate due but do not split this percentage over the two tails.
to non-normal residuals, the researcher may decide to Instead, they use the 5% area under the curve only in
transform the data to meet the parametric require- one tail, largely increasing power in that direction.
ments. This might be a valuable alternative in case There is a long-standing debate about the use of one-
high power must be retained or if there is no satisfac- sided tests and contradictory recommendations may be
tory non-parametric solution. Common transforma- found in the literature.43–46 One frequent misuse of
tions when data are skewed include square root, cube these tests is the post-hoc decision to (re)analyse the
root and log (note that since log is not defined for 0 and data with a one-tailed test to increase power having
negative numbers, a constant must be added to make observed the direction of the effect and concluding
them all positive before log transformation). Data that findings in the other direction do not occur. This
transformation has been introduced and reviewed by approach is a form of p-hacking and is not acceptable.
others and we invite the reader to consult the corre- In the Nayman-Pearson approach of NHST, hypoth-
sponding references.39–42 eses must be generated prior to looking at the data, not
If the data does not meet parametric assumptions after. Since my view is that being more conservative is
even after data transformation or if the output variable preferable over inflating the risk of false positive results,
is not on a continuous scale, non-parametric tests can I recommend to use two-sided tests or to call in a sta-
be used. Non-parametric tests have less assumptions tistician if a one-tailed approach is considered (see
but are not assumption-free. For example, they are Recommendation 1).
similar to parametric tests in their need for indepen-
dence of intragroup observations. Non-parametric
tests are rank tests, which means they proceed by rank- Recommendation 10: Correct (correctly) for multiple
ing the values in the data and the logic of their null comparisons. For one comparison (i.e. two groups),
hypothesis is that the observed order of the individual the chance of concluding a difference although there
data was obtained by chance only due to random sam- is none is the alpha threshold, and is set by the experi-
pling. Therefore, means and SDs are not informative menter/statistician (by arbitrary convention usually at
and outliers are not considered nuisance since their 0.05 in life sciences). When more than one statistical
rank is not influenced by their distance from the test is performed to compare more than two groups,
mean. Non parametric tests are thus more universally some comparisons may give p-values less than the
applicable but the cost for not using the above-men- alpha threshold by chance only, even if there is no
tioned information is threefold: 1. non-parametric actual difference (i.e. all null hypotheses are true). In
tests tend to generally have lower power than their this case, the chance of observing at least one significant
parametric counterparts; 2. p-values are not a function comparison although none exists is not alpha but is
of the exact individual values and very large differences called the family-wise error rate (FWER, the family
between groups can give similar p-values as small dif- being the total series of comparisons). The FWER is
ferences, if ranks are the same; and 3) p-values are not higher than alpha and keeps on inflating rapidly as the
on a continuous scale but can only take discrete values number of comparisons increases (Table 4 shows the
since the combinations of ranks is limited in number. FWER for alpha ¼ 0.05). The FWER reaches 53.7%
when 15 tests are performed (corresponding to
Recommendation 9: Unless you can predict the 6 groups) and 90.1% for 45 comparisons (corre-
direction of an effect and are interested in that direc- sponding to 10 groups). In other words, if 10 experi-
tion only, use two-tailed tests. Several tests offer a mental groups are compared pairwise with no
choice between one-tailed or two-tailed options. In correction, the experimenter has 90.1% of chance of
summary, one-tailed tests are employed when the pos- finding at least one ‘‘significant’’ result by chance
sibility of an effect supports the hypothesis in only one only. Solving this problem becomes vital for such
direction (e.g. whether treated animals display a approaches as omics or brain imaging, which make it
Gosselin 39

Table 4. FWER as a function of the number of feasible to perform thousands or millions of statistical
comparisons (tests) performed. tests.
There is no universally applicable approach for deal-
Number of comparisons FWER
ing with the issue of multiple comparisons since the
1 0.050 choice of the method depends on the specificities of
2 0.098 the experiment design. However, corrective strategies
3 0.143 must be applied to reduce the chances of false positives.
4 0.185 Some approaches (e.g. Bonferroni, Sidak corrections)
5 0.226 attempt to control the FWER (at 0.05) by making the
10 0.401
alpha threshold for each comparison more conservative
as the number of tests increases. Other techniques (e.g.
15 0.537
Benjamini-Hochberg procedure) perform an adjust-
45 0.901
ment of the p-values aiming at controlling the FDR,
FWER: Family-wise error rate. which is a proportion of false positives among the

Table 5. Description of statistical information to disclose in publications and related sections.

Section (must be adapted to journal


Information to disclose Remarks guidelines)

Animal housing conditions Number of animals per cage must be given, Methods
and characteristics alongside the species, strain, weight, age,
sex
Strategies to limit bias Randomisation, blinding Methods
Sample size Must correspond to the exact number of Methods; Figure legends; Results
experimental units in each group. Do not
use intervals (e.g. n ¼ 3–10 animals)
Statistical tests used Must be unambiguous identifiable in all Methods; Figure legends; Results
(including post-tests) figures and tables
Software/package None Methods
Alpha threshold Only one threshold should be given. Note Methods; Figure legends
that disclosure of exact p-values without
mention of threshold is preferred (see
infra)
Type of measure of central For error, prefer 95% (or 99%) confidence Methods; Figure legends
tendency and error bars intervals to show imprecision of mean
estimation and SD to show variability.
Avoid the use of standard error of the
mean. If non-parametric tests are used,
prefer median and 95% confidence interval
of the median
Policies about outliers Define whether outliers were excluded prior Methods
or after analysis and the criteria used
Policies about missing data Define the methods used to handle missing Methods
values (e.g. listwise or pairwise deletion,
imputations)
Data transformation Indicate whether data were transformed Methods; Results
prior to analysis
Test statistics and exact Give the statistic (e.g. t, z, F, r, r2) alongside Figures; Results
p-values degrees of freedom and exact
p-value. Avoid reporting quantitative
results using the sole p-values, in
particular as inequalities (e.g. p < 0.05)
40 Laboratory Animals 53(1)

significant results. Finally, ANOVA is global (omnibus) small samples and more refined representations
test on all the groups of the input variable that deter- should be preferred such as box/whisker plots or
mines if among all these groups, at least one may be violin plots for larger samples. SEMs should not be
considered divergent from the others, the exact identi- used as indicators of error since they are misleading
fication of this latter requires a complementary representations and should be replaced by 95% confi-
procedure. dence interval of the mean or median to show the inter-
I will not proceed further into the exhaustive listing val in which the mean or median will be located in 95%
of methods that aim to correct for multiple compari- of future samplings, or the SD should be used to show
sons, I invite the reader to consult a statistician if neces- variability.
sary (see Recommendation 1).
Conclusion
Presentation and reporting The present series of recommendations exposes simple
Recommendation 11: Disclose all and inexpensive measures aiming to improve the qual-
information needed for the understanding ity of biostatistics in animal research. This review
focuses exclusively on biostatistics and therefore
and replication of the study extends the existing guidelines about experimental
Scientific reports must enable the reader to critically design and reporting in animal research.47 It is impor-
evaluate and reproduce the study. In animal research, tant to realise that the use of proper statistics is not
the reference document for reporting is the ARRIVE only a matter of data analysis, but ranges from experi-
guidelines,47 whose checklist is available online (https:// mental design to data presentation and should there-
www.nc3rs.org.uk/arrive-guidelines). All information fore be carefully addressed throughout the research
related to the methods and results must be disclosed. project. Furthermore, it is worth mentioning that
Disclosure about the design includes the sampling many aspects of scientific research may contribute to
methodology, policy with respect to blinding and ran- the lack of reproducibility, with flawed biostatistics
domisation as well as the species, strain, number, sex, being only one of them. The author therefore
age and housing conditions of animals (Table 5). The encourages researchers to use additional means that
given sample size must refer to the number of indepen- contribute to better reproducibility, including improved
dent observations collected (see Recommendation 2), reporting of conflict of interest as well as open-access
which ultimately are used to compute the variability and preprint policies. The scientific community should
of the dataset and should not be given as an interval adopt better practices that reward good experimental
(e.g. n ¼ 3–15 animals) or an inequality (e.g. n > 3). The design and stop creating strong incentives to publish
number of technical replicates must be unambiguously only ‘‘statistically significant’’ results. The author there-
indicated, if applicable. The full statistical procedures fore encourages journals to be at the forefront of this
must be disclosed including the software used, the sta- action and to support statistically-relevant initiatives
tistical tests (including the post-tests used after an such as confirmatory studies or alternatives to the over-
omnibus procedure), whether a correction for multiple confidence about p-values and NHST.
comparisons has been performed, the alpha threshold
and the nature of the error bars. The most practical Acknowledgements
way to gather all above-listed statistical information is The author would like to thank Dr. Aoife Keohane and Dr.
to include a paragraph dedicated to experimental Holly Green for language editing and the three reviewers who
design, biostatistics and data analysis. helped improve the quality of this article.

Recommendation 12: Chose Disclosure


correct graphical display for The author is CEO of Biotelligences LLC.
your quantitative results
Declaration of Conflicting Interests
The choice of a graphical representation must be made The author(s) declared no potential conflicts of interest with
judiciously, with the aim of giving the most information respect to the research, authorship, and/or publication of this
that informs about patterns (numerical data can be article.
given using tables). Whenever possible, bar graphs
should be avoided since they convey very little informa- Funding
tion about actual data distributions. Instead, scatter The author(s) received no financial support for the research,
plots showing all individual values should be used for authorship, and/or publication of this article.
Gosselin 41

ORCID iD 20. Hurlbert SH. Pseudoreplication and the design of


ecological field experiments. Ecol Monogr 1984; 54:
Romain-Daniel Gosselin http://orcid.org/0000-0003-1716-
187–211.
6210
21. Greenman DL, Bryant P, Kodell RL, et al. Relationship
of mouse body weight and food consumption/wastage to
References cage shelf level. Lab Anim Sci 1983; 33: 555–558.
22. Hildebrand F, Nguyen TL, Brinkman B, et al.
1. Ioannidis JP. Why most published research findings are
Inflammation-associated enterotypes, host genotype,
false. PLoS Med 2005; 2: e124.
cage and inter-individual effects drive gut microbiota
2. Landis SC, Amara SG, Asadullah K, et al. A call for
variation in common laboratory mice. Genome Biol
transparent reporting to optimize the predictive value of
2013; 14: R4.
preclinical research. Nature 2012; 490: 187–191.
23. Laukens D, Brinkman BM, Raes J, et al. Heterogeneity
3. Vaux DL, Fidler F and Cumming G. Replicates and
of the gut microbiome in mice: guidelines for optimizing
repeats – what is the difference and is it significant? experimental design. FEMS Microbiol Rev 2016; 40:
A brief discussion of statistics and experimental design. 117–132.
EMBO Rep 2012; 13: 291–296. 24. Ader DN, Johnson SB, Huang SW, et al. Group size,
4. Button KS, Ioannidis JP, Mokrysz C, et al. Power failure: cage shelf level, and emotionality in non-obese diabetic
why small sample size undermines the reliability of neuro- mice: impact on onset and incidence of IDDM.
science. Nat Rev Neurosci 2013; 14: 365–376. Psychosom Med 1991; 53: 313–321.
5. Nuzzo R. Scientific method: statistical errors. Nature 25. Greenman DL, Bryant P, Kodell RL, et al. Influence of
2014; 506: 150–152. cage shelf level on retinal atrophy in mice. Lab Anim Sci
6. Goodman S. A dirty dozen: twelve p-value misconcep- 1982; 32: 353–356.
tions. Semin Hematol 2008; 45: 135–140. 26. Adler ID, Bootman J, Favor J, et al. Recommendations
7. MacArthur RD and Jackson GG. An evaluation of the for statistical designs of in vivo mutagenicity tests with
use of statistical methodology in the Journal of Infectious regard to subsequent statistical analysis. Mutat Res 1998;
Diseases. J Infect Dis 1984; 149: 349–354. 417: 19–30.
8. Curran-Everett D. Multiple comparisons: philosophies 27. Halsey LG, Curran-Everett D, Vowler SL, et al. The
and illustrations. Am J Physiol Regul Integr Comp fickle P value generates irreproducible results. Nat
Physiol 2000; 279: R1–R8. Methods 2015; 12: 179–185.
9. Scales CD Jr, Norris RD, Peterson BL, et al. Clinical 28. Marino MJ. How often should we expect to be wrong?
research and statistical methods in the urology literature. Statistical power, P values, and the expected prevalence of
J Urol 2005; 174: 1374–1379. false discoveries. Biochem Pharmacol 2017; 151: 226–233.
10. Cumming G, Fidler F and Vaux DL. Error bars in 29. Norman GR and Streiner DL. Biostatistics: the bare
experimental biology. J Cell Biol 2007; 177: 7–11. essentials, 3rd ed. Hamilton, ON: B.C. Decker, 2008.
11. Weissgerber TL, Milic NM, Winham SJ, et al. Beyond 30. Mogil JS and Macleod MR. No publication without con-
bar and line graphs: time for a new data presentation firmation. Nature 2017; 542: 409–411.
paradigm. PLoS Biol 2015; 13: e1002128. 31. Pawitan Y, Michiels S, Koscielny S, et al. False discovery
12. Fisher R. Presidential address to the first Indian rate, sensitivity and sample size for microarray studies.
Statistical Congress. Sankhya 1938; 4: 14–17. Bioinformatics 2005; 21: 3017–3024.
13. Curran-Everett D and Benos DJ. Guidelines for report- 32. Head ML, Holman L, Lanfear R, et al. The extent and
ing statistics in journals published by the American consequences of p-hacking in science. PLoS Biol 2015; 13:
Physiological Society. Am J Physiol Regul Integr Comp e1002106.
Physiol 2004; 287: R247–R249. 33. DeMets DL and Lan KK. Interim analysis: the alpha
14. Weissgerber TL, Garovic VD, Milin-Lazovic JS, et al. spending function approach. Stat Med 1994; 13:
Reinventing biostatistics education for basic scientists. 1341–1352; discussion 53–56.
PLoS Biol 2016; 14: e1002430. 34. Kumar A and Chakraborty BS. Interim analysis: A
15. Beery AK and Zucker I. Sex bias in neuroscience and rational approach of decision making in clinical trial.
biomedical research. Neurosci Biobehav Rev 2011; 35: J Adv Pharm Technol Res 2016; 7: 118–122.
565–572. 35. Sham PC and Purcell SM. Statistical power and signifi-
16. Riederer BM. When sex matters in biomedical research. cance testing in large-scale genetic studies. Nat Rev Genet
Lab Anim 2015; 49: 265–266. 2014; 15: 335–346.
17. Sorge RE, Mapplebeck JC, Rosen S, et al. Different 36. Cumming G. The new statistics: why and how. Psychol
immune cells mediate mechanical pain hypersensitivity Sci 2014; 25: 7–29.
in male and female mice. Nat Neurosci 2015; 18: 37. Mumby PJ. Statistical power of non-parametric tests: a
1081–1083. quick guide for designing sampling strategies. Mar Pollut
18. Rosen S, Ham B and Mogil JS. Sex differences in neu- Bull 2002; 44: 85–87.
roimmunity and pain. J Neurosci Res 2017; 95: 500–508. 38. Krzywinski M and Altman N. Points of significance:
19. Festing MFW, Overend P, Cortina Borja M, et al. The Nonparametric tests. Nat Methods 2014; 11: 467–468.
design of animal experiments, 2nd ed. Los Angeles: Sage 39. Bland JM and Altman DG. The use of transformation
Publications, 2016. when comparing two means. BMJ 1996; 312: 1153.
42 Laboratory Animals 53(1)

40. Bland JM and Altman DG. Transformations, means, and 45. Wolterbeek R. One and two sided tests of significance.
confidence intervals. BMJ 1996; 312: 1079. Statistical hypothesis should be brought into line with
41. Bland JM and Altman DG. Transforming data. BMJ clinical hypothesis. BMJ 1994; 309: 873–874.
1996; 312: 770. 46. Freedman LS. An analysis of the controversy over clas-
42. Manikandan S. Data transformation. J Pharmacol sical one-sided tests. Clin Trials 2008; 5: 635–640.
Pharmacother 2010; 1: 126–127. 47. Kilkenny C, Browne WJ, Cuthill IC, et al. Improving
43. Bland JM and Altman DG. One and two sided tests of bioscience research reporting: the ARRIVE guidelines
significance. BMJ 1994; 309: 248. for reporting animal research. PLoS Biol 2010; 8:
44. Enkin MW. One and two sided tests of significance. One e1000412.
sided tests should be used more often. BMJ 1994; 309:
874.

Résumé
Il existe une inquiétude croissante que l’omniprésence de données statistiques incorrectes et la faible repro-
ductibilité qui y est associée entraı̂nent l’utilisation non éthique d’animaux dans le cadre de la recherche
animale. La présente étude vise à fournir aux chercheurs des lignes directrices en matière de biostatistiques,
sur la base des erreurs, des idées fausses et des mauvaises utilisations fréquemment observées, mais aussi
des spécificités liées à l’expérimentation animale. Douze recommandations sont formulées, couvrant l’échan-
tillonnage, l’optimisation de la taille des échantillons, le choix des tests statistiques, la compréhension de la
valeur p et la présentation des résultats. Le but est d’exposer les problèmes statistiques majeurs qu’il est
nécessaire de prendre en compte pour pouvoir concevoir, réaliser et présenter correctement des expéri-
mentations animales.

Abstract
Es gibt zunehmend Bedenken angesichts der Omnipräsenz fehlerhafter Statistiken, der daraus resultieren-
den mangelnden Reproduzierbarkeit und damit einhergehenden unethischen Verschwendung von Tieren in
der Forschung. Der vorliegende Bericht möchte Forschern Leitlinien für die Biostatistik zur Verfügung zu
stellen, die auf häufig beobachteten Fehlern, falscher Verwendung und Fehleinschätzungen sowie auf den
Spezifika von Tierversuchen beruhen. Es werden zwölf Empfehlungen formuliert, die Stichproben, die
Optimierung des Stichprobenumfangs, die Auswahl statistischer Tests, das Verständnis der p-Werte und
die Berichterstattung umfassen. Ziel ist es, wichtige statistische Gesichtspunkte hervorzuheben, die für
eine korrekte Planung, Durchführung und Berichterstattung von Experimenten berücksichtigt werden
müssen.

Resumen
Existe una mayor preocupación de que la omnipresencia de estadı́sticas erróneas y la reproducibilidad
deficiente que emana de ahı́ provocan un malgasto de animales poco ético en la investigación. El presente
estudio trata de ofrecer directrices en bioestadı́stica para investigadores, en base a errores frecuentes
observados, malos usos y equivocaciones además de en base a las particularidades de la experimentación
con animales. Se hacen doce recomendaciones que cubren el muestreo, la optimización del tamaño de
muestras, la elección de pruebas estadı́sticas, el entendimiento de valores de p y la publicación de resulta-
dos. El objetivo es exponer temas estadı́sticos importantes que deben considerarse para el correcto diseño,
ejecución y publicación de experimentos.

You might also like