You are on page 1of 6

Invited Paper:

SUGGESTIONS FOR PRESENTING THE RESULTS OF DATA ANALYSES


DAVID R. ANDERSON, 1 · 2 Colorado Cooperative Fish and Wildlife Research Unit, Room 201 Wagar Building, Colorado State
University, Fort Collins, CO 80523, USA
WILLIAM A. LINK, U.S. Geological Survey, Patuxent Wildlife Research Center, Laurel, MD 20708, USA
DOUGLAS H. JOHNSON, U.S. Geological Survey, Northern Prairie Wildlife Research Center, Jamestown, ND 58401, USA
KENNETH P BURNHAM, 1 Colorado Cooperative Fish and Wildlife Research Unit, Room 201 Wagar Building, Colorado State
University, Fort Collins, CO 80523, USA

Abstract: We give suggestions for the presentation of research results from frequentist, information-theoretic, and
Bayesian analysis paradigms, followed by several general suggestions. The information-theoretic and Bayesian
methods offer alternative approaches to data analysis and inference compared to traditionally used methods.
Guidance is lacking on the presentation of results under these alternative procedures and on non testing aspects
of classical frequentist methods of statistical analysis. Null hypothesis testing has come under intense criticism. We
recommend less reporting of the results of statistical tests of null hypotheses in cases where the null is surely false
anyway, or where the null hypothesis is of little interest to science or management.

JOURNAL OF WILDLIFE MANAGEMENT 65(3):373-378

Key words: AIC, Bayesian statistics, frequentist methods, information-theoretic methods, likelihood, publication
guidelines.

For many years, researchers have relied heavily search hypotheses (represented by models) and
on testing null hypotheses in the analysis of fish- several quantities can be computed to estimate a
eries and wildlife research data. For example, an formal strength of evidence for alternative
average of 42 ?-values testing the statistical signif- hypotheses. Finally, methods based on Bayes'
icance of null hypotheses was reported for arti- theorem have become useful in applied sciences
cles in The journal of Wildlife Management during due mostly to advances in computer technology
1994-1998 (Anderson et al. 2000). This analysis (Gelman et al. 1995).
paradigm has been challenged (Cherry 1998, Over the years, standard methods for present-
Johnson 1999), and alternative approaches have ing results from statistical hypothesis tests have
been offered (Burnham and Anderson 2001). evolved. The Wildlife Society (1995a,b), for
The simplest alternative is to employ a variety of example, addressed Type 11 errors, statistical
classical frequentist methods (e.g., analysis of power, and related issues. However, articles by
variance or covariance, or regression) that focus Cherry (1998),Johnson (1999), and Anderson et
on the estimation of effect size and measures of al. (2000) provide reason to reflect on how re-
its precision, rather than on statistical tests, ?-val- search results are best presented. Anderson et al.
ues and arbitrary, dichotomous statements about (2000) estimated that 47% of the ?-values report-
statistical significance or lack thereof. Estimated ed recently in The journal of Wildlife Management
effect sizes (e.g., the difference between the esti- were naked (i.e., only the ?-value is presented
mated treatment and control means) are the with a statement about its significance or lack of
results useful in future meta-analysis (Hedges significance, without estimated effect size or even
and Olkin 1985), while P-values are almost useless the sign of the difference being provided).
in these important syntheses. A second alterna- Reporting of such results provides no informa-
tive is relatively new and based on criteria that tion and is thus without meaning. Perhaps more
estimate Kullback-Leibler information loss (Kull- importantly, there are thousands of null hypothe-
back and Leibler 1951). These information-theo- ses tested and reported each year in biological
retic approaches allow a ranking of various re- journals that are clearly false on simple a priori
grounds Qohnson 1999). These are called "silly
nulls" and account for over 90% of the null
1
Employed by U.S. Geological Survey, Biological
hypotheses tested in Ecology and The Journal of
Resources Division. Wildlife Management (Anderson et al. 2000). We
2 E-mail: anderson@cnr.colostate.edu seem to be failing by addressing so many trivial
373
374 ANALYTICAL PRESENTATION SUGGESTIONS • Anderson et al. ]. Wild!. Manage. 65(3):2001

issues in theoretical and applied ecology. Articles usually only the direction of the effect is reported
that employ silly nulls and statistical tests of (a nearly naked P-value). The problem with
hypotheses known to be false severely retard naked and nearly naked P-values is that their
progress in our understanding of ecological sys- magnitude is often interpreted as indicative of
tems and the effects of management programs effect size. It is misleading to interpret that small
(O'Connor 2000). The misuse and overuse of P-values indicate large effect sizes because small
P-values is astonishing. Further, there is little P-values can also result from low variability or large
analogous guidance for authors to present results sample sizes. P-values are not a proper strength
of data analysis under the newer information-the- of evidence (Royall 1997, Se like et al. 2001).
oretic or Bayesian methods. We encourage authors to carefully consider
We suggest how to present results of data analy- whether the information they convey in the lan-
sis under each of these 3 statistical paradigms: guage of null hypothesis testing could be greatly
classical frequentist, information-theoretic, and improved by instead reporting estimates and
Bayesian. We make no recommendation on the measures of precision. Emphasizing estimation
choice of analysis; instead, we focus on sugges- over hypothesis testing in the reporting of the
tion's for the presentation of results of the data results of data analysis helps protect against the
analysis. We assume authors are familiar with the pitfalls associated with the failure to distinguish
analysis paradigm they have used; thus, we will between statistical significance and biological sig-
not provide introductory material here. nificance (Yoccoz 1991).
We do not recommend reporting test statistics
Frequentist Methods and P-values from observational studies, at least
Frequentist methods, dating back at least a cen- not without appropriate caveats (Sellke et al.
tury, are much more than merely test statistics 2001). Such results are suggestive rather than
and P-values. P..values resulting from statistical conclusive given the observational nature of the
tests of null hypotheses are usually of far less data. In strict experiments, these quantities can
value than estimates of effect size. Authors fre- be useful, but we still recommend a focus on the
quently report the results of null hypothesis tests estimation of effect size rather than on P-values
(test statistics, degrees of freedom, P-values) and their supposed statistical significance.
when-in less space-they could often report The computer output of many canned statistical
more complete, informative material conveyed packages contains numerous test statistics and
by estimates of effect size and their standard P..values, many of which are of little interest; report-
errors or confidence intervals (e.g., the effect of ing these values may create an aura of scientific
the neck collars on winter survival probability was objectivity when both the objectivity and substance
-0.036, SE= 0.012). are often lacking. We encourage authors to resist
The prevalence of testing null hypotheses that the temptation to report dozens of P-values only
are uninteresting (or even silly) is quite high. because these appear on computer output.
For example, Anderson et al. (2000) found an Do not claim to have proven the null hypothesis;
average of 6,188 P-values per year (1993-1997) in this is a basic tenet of science. If a test yields a non-
Ecology and 5,263 per year (1994-1998) in The significant P-value, it may not be unreasonable to
journal of Wildlife Management and suggested that state that "the test failed to reject the null hypothe-
these large frequencies represented a misuse and sis" or that "the results seem consistent with the null
overuse of null hypothesis testing methods. hypothesis" and then discuss Type I and 11 errors.
Johnson (1999) and Anderson et al. (2000) give However, these classical issues are not necessary
examples of null hypotheses tested that were when discussing the estimated effect size (e.g, "The
clearly of little biological interest, or were entire- estimated effect of the treatment was small," and
ly unsupported before the study was initiated. then give the estimate and a measure of precision).
We strongly recommend a substantial decrease in Do not report estimated test power after a sta-
the reporting of results of null hypothesis tests tistical test has been conducted and found to be
when the null is trivial or uninteresting. nonsignificant, as such post hoc power is not
Naked P-values (i.e., those reported without meaningful (Goodman and Berlin 1994). A priori
estimates of effect size, its sign, and a measure of power and sample size considerations are impor-
precision) are especially to be avoided. Nonpara- tant in planning an experimental design, but esti-
metric tests (like their parametric counterparts) mates of post hoc power should not be reported
are based on estimates of effect size, although (Gerard et al. 1998, Hoenig and Heisey 2001).
J. Wildl. Manage. 65(3):2001 ANALYfiCAL PRESENTATION SUGGESTIONS • Anderson et al. 375

Information-Theoretic Methods nent due to model selection uncertainty should


These methods date back only to the mid- be included in estimates of precision (i.e., uncon-
1970s. They are based on theory published in the ditional vs. conditional standard errors) unless
early 1950s and are just beginning to see use in there is strong evidence favoring the best model,
theoretical and applied ecology. A synthesis of such as an Akaike weight (w;) >about 0.9.
this general approach is given by Burnham and For well-designed, true experiments in which
Anderson ( 1998). Much of classical frequentist the number of effects or factors is small and fac-
statistics (except the null hypothesis testing tors are orthogonal, use of the full model will
methods) underlie and are part of the informa- often suffice (rather than considering more par-
tion-theoretic approach; however, the philosophy simonious models). If an objective is to assess the
of the 2 paradigms is substantially different. relative importance of variables, inference can be
As part of the Methods section of a paper, based on the sum of the Akaike weights for each
describe and justifY the a priori hypotheses and variable, across models that include that variable,
models in the set and how these relate specifical- and these sums should be reported (Burnham
ly to the study objectives. Avoid routinely includ- and Anderson 1998:140-141). Avoid the implica-
ing a trivial null hypothesis or model in the tion that variables not in the selected (estimated
model set; all models considered should have best) model are unimportant.
some reasonable level of interest and scientific The results should be easy to report if the
support (Chamberlin's [1965] concept of "multi- Methods section outlines convincingly the science
ple working hypotheses"). The number of mod- hypotheses and associated models of interest.
els (R) should be small in most cases. If the study Show a table of the value of the maximized log-
is only exploratory, then the number of models likelihood function (log(1')), the number of esti-
might be larger, but this situation can lead to mated parameters (K), the appropriate selection
inferential problems (e.g., inferred effects that are criterion (AIC, AI Cc, or QAICc), the simple differ-
actually spurious; Anderson et al. 2001). Situa- ences (~;),and the Akaike weights (w;) for mod-
tions with more models than samples (i.e., R > n) els in the set (or at least the models with some rea-
should be avoided, except in the earliest phases sonable level of support, such as where ~i < 10).
of an exploratory investigation. Models with Interpret and report the evidence for the various
many parameters (e.g., K- 30-200) often find lit- science hypotheses by ranking the models from
tle support, unless sample size or effect sizes are best to worst, based on the differences (~;), anc;l
large or if the residual variance is quite small. on the Akaike weights ( w;). Provide quantities of
A common mistake is the use of Akaike's Infor- interest from the best model or others in the set
mation Criterion (AIC) rather than the second- (e.g., 0. 2, coefficients of determination, estimates
order criterion, AICc. Use AICc (a generally of model parameters and their standard errors).
appropriate small-sample version of AIC) unless Those using the Bayesian Information Criterion
the number of observations is at least 40 times the (BIC; Schwarz 1978) for model selection should
number of explanatory variables (i.e., n/K > 40 justifY the existence of a true model in the set of
for the biggest K over all R models). If using candidate models (Methods section).
count data, provide some detail on how goodness Do not include test statistics and P..values when
of fit was assessed and, if necessary, an estimate of using the information-theoretic approach since
the variance inflation factor (c) and its degrees of this inappropriately mixes differing analysis para-
freedom. If evidence of overdispersion (Liang and digms. For example, do not use AICc to rank
McCullagh 1993) is found, the log-likelihood models in the set and then test if the best model
c
must be computed as loge (1') I and used in QAiq,, is significantly better than the second best model
a selection criterion based on quasi-likelihood the- (no such test is valid). Do not imply that the
ory (Anderson et al. 1994). When the appropriate information-theoretic approaches are a test in
criterion has been identified (AIC, AI Cc, or QAICc), any sense. Avoid the use of terms such as signifi-
it should be used for all the models in the set. cant and not significant, or rejected and not
Discuss or reference the use of other aspects of rejected; instead view the results in a strength of
the information-theoretic approach, such as evidence context (Royal! 1997).
model averaging, a confidence set on models, If some analysis and modeling were done after
and examination of the relative importance of the a priori effort (often called data dredging),
variables. Define or reference the notation used then make sure this procedure is clearly ex-
(e.g., K, <l;, and w;). Ideally, the variance compo- plained when such results are mentioned in the
376 ANALYTICAL PRESENTATION SUGGESTIONS • Anderson et al. J. Wild!. Manage. 65(3):2001

Discussion section. Give estimates of the impor- in sufficient detail. In particular, MCMC requires
tant parameters (e.g., effect size) and measures diagnostics to indicate that the posterior distrib-
of precision (preferably a confidence interval). ution has been adequately estimated.

Bayesian Methods General Considerations Concerning the


Although Bayesian methods date back over 2 Presentation of Results
centuries, they are not familiar to most biologists. The standard deviation (SD) is a descriptive sta-
Bayesian analysis allows inference from a poste- tistic, and the standard error (SE) is an inferen-
rior distribution that incorporates information tial statistic. Accordingly, the SD can be used to
from the observed data, the model, and the prior portray the variation observed in a sample: !1 =
distribution (Schmitt 1969, Ellison 1996, Barnett 100, SD = 25 suggests a much more variable pop-
1999, Wade 2000). Bayesian methods often re- ulation than does !1 = 100, SD = 5. The expected
quire substantial computation and have been in- value (i.e., an average over a large number of
creasingly applied since computers have become replicate samples) of the SD 2 equals 0" 2 and
widely available (Lee 1997). depends very little on sample size ( n). The SE is
In ~eporting results, authors should consider useful to assess the precision (repeatability) of an
readers' lack of familiarity with Bayesian sum- estimator. For example, in a comparison of
maries such as odds ratios, Bayes factors, and cred- males (m) and females (f), lim = 100, SE = 2 and
ible intervals. The clarity of a presentation can be !11 = 120, SE = 1.5 would allow an inference that
greatly enhanced by a simple explanation after the the population mean value Jl is greater among
first references to such quantities: "A Bayes factor females than among males. Such an inference
of 4.0 indicates that the ratio of probabilities for rests on some assumptions, such as random sam-
Model 1 and Model 2 is 4 times larger when com- pling of a defined population. Unlike the SD,
puted using the posterior rather than the prior." the SE decreases with increasing sample size.
Presentations of Bayesian analyses should report When presenting results such as a ± b, always
the sensitivity of conclusions to the choice of the indicate if b is a SD or a SE or is t x SE (indicating
prior distribution. This portrayal of sensitivity a confidence limit), where t is from the t distrib-
can be accomplished by including overlaid graphs ution (e.g., 1.96 if the degrees of freedom are
of the posterior distributions for a variety of rea- large). If a confidence interval is to be used, give
sonable priors or by tabular presentations of the lower and upper limits as these are often
credible intervals, posterior means, and medians. asymmetric about the estimate. Authors sho~.!1d
An analysis based on flat priors representing be clear concerning the distinction between pre-
limited or vague prior knowledge should be in- cision (measured by variances, standard errors,
cluded in the model set. When the data seem to coefficients of variation, and confidence inter-
contradict prevailing thought, the strength of the vals) and bias (an average tendency to estimate
contradiction can be assessed by reporting analy- values either smaller or larger than the parame-
ses based on priors reflecting prevailing thought. ter; see White et al. 1982:22-23).
Generally, Bayesian model-checking should be The Methods section should indicate the (I -a)%
reported. Model-checks vary among applications, confidence level used (e.g., 90, 95, or 99%).
and there are a variety of approaches even for a Information in tables should be arranged so that
given application (Carlin and Lewis 1996, Gelman numbers to be compared are close to each other.
and Meng 1996, Gelman et al. 1995). One par- Excellent advice on the visual display of quantita-
ticularly simple and easily implemented check is tive information is given in Tufte ( 1983). Provide
a posterior predictive check (Rubin 1984). The references for any statistical software and specific
credibility of the results will be enhanced by a options used (e.g., equal or unequal variances in
brief description of model-checks performed, t-tests, procedure TTEST in SAS, or a particular
especially as these relate to questionable aspects Bayesian procedure in BUGS). The Methods sec-
of the model. A lengthy report of model-checks tion should always provide sufficient detail so
will usually not be appropriate, but the credibili- that the reader can understand what was done.
ty of the published paper will often be enhanced In regression, discriminant function analysis,
by reporting the results of model-checks. and similar procedures, one should avoid the
The implementation of sophisticated methods term independent variables because the variables
for fitting models, such as Markov Chain Monte are rarely independent among themselves or
Carlo (MCMC; Geyer 1992) should be reported with the response variable. Better terms include
J Wild!. Manage. 65(3):2001 A:\!ALYTICAL PRESENTATION SUGGESTIONS • Andmon rt al. 377

explanatory or predictor variables (see McCul- - - - , - - - , AN!l C. C. WHITE. 1994. AIC model
lagh and Nelder 1989:8). selection in overdispersed capture-recapture data.
Ecology 75:1780-1793.
Avoid confusing low frequencies with small
BAR:\FTr, V. 1999. Comparative statistical inference.
sample sizes. If one finds only 4 birds on 230 John Wiley & Sons, New York, USA.
plot~, the proportion of plots with birds can be Bt•RNIIA\1, K. P., AND D. R. AKDERSOK. !998. Model selec-
precisely estimated. Alternatively, if the birds are tion and inference: a practical inf(Jrmation-theoretic
the object of study, the 230 plots are irrelevant, approach. Springer-Verlag, New York, USA.
---,AND---. 2001. Kullback-Leibler informa-
and the sample size (4) is very small. tion as a basis hlr strong inference in ecological studies.
It is important to separate analysis of results Wildlife Research 28: in press.
based on questions and hypotheses formed C\Rt.IN, B. P., AND T. A. LEwiS. 1996. Bayes and empiri-
before examining the data from results found cal Bayes methods for data analysis. Chap man & Hall,
New York, USA.
after sequentially examining the results of data
CHAMilERt.IN, T. C. 1965 (1890). The method of multi-
analyses. The first approach tends to be more ple working hypotheses. Science 148:754-759.
confirmatory, while the second approach tends (Reprint of 1890 paper in Science.)
to be more exploratory. In particular, if the data CHERRY, S. 1998. Statistical tests in publications of The
analysis 'suggests a particular pattern leading to Wildlife Society. Wildlife Society Bulletin 26:947-953.
EI.I.ISON, A. M. 1996. An introduction of Bayesian
an interesting hypothesis then, at this midway inference for ecological research and environmen-
point, few statistical tests or measures of precision tal decision-making. Ecological Applications
remain valid (Lindsey i999a,b; White 2000). That 6:1036-1046.
is, an inference concerning patterns or hypothe- GELMAN, A.,j. B. CARLIN, H. S. STERN, AND D. B. RL'BIN.
1995. Bayesian data analysis. Chapman & Hall, New
ses as being an actual feature of the population
York, USA.
or process of interest are not well supported - - - , AND X.-L. MENG. 1996. Model checking and
(e.g., likely to be spurious). Conclusions reached model improvement. Pages 189-201 in W. R. Gilks, S.
after repeated examination of the results of prior Richardson, and D. J. Spiegelhalter, editors. Markov
analyses, while interesting, cannot be taken with Chain Monte Carlo in practice. Chapman & Hall,
New York, USA.
the same degree of confidence as those from the GERARD, P. D., D. R. SMITH, AND G. WF.ERAKKODY. 1998.
more confirmatory analysis. However, these post Limit' of retrospective power analysis. Journal of
hoc results often represent intriguing hypotheses Wildlife Management 62:801-807.
to be readdressed with a new, independent set of GEYER, C.J. 1992. Practical Markov chain Monte Carlo
(with discussion). Statistical Science 7:473-503.
data. Thus, as part of the Introduction, authors
GooDMAN, S. N., ANDJ. A. BERLIN. 1994. The use ofprt!-
should note the degree to which the study was dicted confidence intervals when planning experi-
exploratory versus confirmatory. Provide infor- ment' and the misuse of power when interpreting
mation concerning any post hoc analyses in the results. Annals of Internal Medicine 121:200-206.
Discussion section. HEDGES, L. V., AND I. 0I.KIN. 1985. Statistical methods
for meta-analysis. Academic Press, London, United
Statistical approaches are increasingly impor- Kingdom.
tant in many areas of applied science. The field HoF.NIG, J. M., AND D. M. HEISEY. 2001. The abuse of
of statistics is a science, with new discoveries lead- power: the pervasive fallacy of power calculation for
ing to changing paradigms. New methods some- data analysis. The American Statistician 55:19-24.
JoHNSON, D. H. 1999. The insignificance of statistical
times require new ways of effectively reporting
significance testing. Journal of Wildlife Management
results. We should be able to evolve as progress is 63:763-772.
made and changes are necessary. We encourage Kui.l.llACK, S., AND R. A. LEIBLER. 1951. On information
wildlife researchers and managers to capitalize and sufficiency. Annals of Mathematical Statistics
on modern methods and to suggest how the 22:79-86.
LEE, P. M. 1997. Bayesian statistics, an introduction.
results from such methods might be best pre- Second edition. John Wiley & Sons, New York, USA.
sented. We hope our suggestions will be viewed LIANG, K-Y, AND P. McCUI.!AGH. 1993. Case studies in
as constructive. binary dispersion. Biometrics 49:623-630.
LINDSEY, J. K. 1999a. Some statistical heresies. The Sta-
tistician 48:1-40.
LITERATURE CITED - - - . 1999b. Revealing statistical principles. Oxford
ANDERSON, D. R., K. P. BURNHAM, W. R. Gol'l.D, AND S. University Press, New York, USA.
CHERRY. 2001. Concerns about finding effect~ that are McCUI.IAGH, P., AND J. A. NELDER. 1989. Generalized
actually spurious. Wildlife Society Bulletin 29:311-316. linear models. Chapman & Hall, London, United
- - - , - - - , AND W. L. THOMPSON. 2000. Null Kingdom.
hypothesis testing: problems, prevalence, and an O'CONNOR, R.J. 2000. Why ecology lags behind biolo-
alternative. Journal of Wildlife Management gy. The Scientist 14(20):35.
64:912-923. RoYALL, R. M. 1997. Statistical evidence: a likelihood
378 ANALYTICAL PRESENTATION SUGGESTIONS • Anderson et al. J. Wild!. Manage. 65(3):2001

paradigm. Chapman & Hall, London, United King- biology. Conservation Biology 14:1308-1316.
dom. WHITE, G. C., D. R. ANDERSON, K. P. BVRNHAM, AND D. L.
RuBIN, D. B. 1984. Bayesianly justifiable and relevant Ons. 1982. Capture-recapture and removal meth-
frequency calculations for the applied statistician. ods for sampling closed populations. Los Alamos
Annals of Statistics 12:1151-1172. National Laboratory Report LA-8787-NERP, Los
ScHMIIT, S. A. 1969. Measuring uncertainty: an ele- Alamos, New Mexico, USA.
mentary introduction to Bayesian statistics. Addison- WHITE, H. 2000. A reality check for data snooping.
Wesley, Reading, Massachusetts, USA. Econometrica 68:1097-1126.
ScHWARZ, G. 1978. Estimating the dimension of a THE WILDI.IFE SOCIETY. 1995a. Journal news. Journal of
model. Annals of Statistics 6:461-464. Wildlife Management 59:196-198.
SEI.I.KE, T., M. j. BAYARRI, AND j. 0. BERGER. 2001. Cali- - - - . 1995b. Journal news. Journal ofWildlife Man-
bration of pvalues for testing precise null hypotheses. agement 59:630.
The American Statistician 55:62-71. Yoccoz, N. G. 1991. Use, overuse, and misuse of sig-
TuiTE, E. R. 1983. The visual display of quantitative infor- nificance tests in evolutionary biology and ecology.
mation. Graphic Press, Cheshire, Connecticut, USA. Bulletin of the Ecological Society of America
WADE, P. R. 2000. Bayesian methods in conservation 72: 106-111.

You might also like