Professional Documents
Culture Documents
net/publication/228658202
CITATIONS READS
0 195
1 author:
Dee Silverthorn
University of Texas at Austin
56 PUBLICATIONS 794 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Dee Silverthorn on 08 June 2015.
THREE YEARS AGO, American Physiological Society (APS) jour- authors to less prescriptive journals. The outcome of the
nals published an editorial by Curran-Everett and Benos with discussion was the decision to publish the followup article,
suggested guidelines for reporting statistics in research articles along with invited commentaries, in Advances in Physiology
(1). The authors then tracked whether their guidelines were Education, in the hope of further educating authors about the
implemented and prepared a followup report (2). Their request rationale for appropriate statistical reporting. The sequel arti-
to have the sequel published in all APS journals met with cle, commentaries, and closing comments by Curran-Everett
unexpected resistance, however, and engendered considerable and Benos appear in this issue of Advances in Physiology
discussion, both formal and informal, at the March 2007 Education.
Editors’ Meeting. Some editors fully supported the guidelines,
but others had reservations about the recommendations, feeling REFERENCES
that they might be viewed as a mandate and thereby drive 1. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in jour-
WE SCIENTISTS rely on statistics. In part, this is because we use If statisticians groused about the guidelines, they never
statistics to report our own science and to interpret the pub- groused to us.
lished science of others. For some of us, reporting and inter- Good guidelines . . . Quite reasonable without being overly
preting statistics can be like reading an unfamiliar language: it fussy. The interpretation of P values is much more sensible
is awkward to do, and it is easy to misinterpret meaning. To than one often gets with medical journals, where P ⬍ 0.05 is all
facilitate these tasks, in 2004, we wrote an editorial (8) in they care about.
which we proposed specific guidelines to help investigators Statistician
Table 1. American Physiological Society journal manuscripts in 1996, 2003, and 2006: reporting of statistics
1996 2003 2006 1996 2003 2006 1996 2003 2006 1996 2003 2006 1996 2003 2006
Am J Physiol
Cell Physiol 43 30 322 21 20 19 88 73 78 0 0 1 7 13 13
Endocrinol Metab 28 28 302 18 7 13 86 89 87 0 4 2 4 39* 30
Gastrointest Liver Physiol 26 28 272 8 25 24 92 79 77 0 0 2 4 14 17
Heart Circ Physiol 60 62 627 17 23 22 87 76 77 0 5 1 10 19 20
Lung Cell Mol Physiol 25 26 261 20 19 22 84 88 81* 0 0 2 4 19 18
Regul Integr Comp Physiol 41 29 384 17 10 12 88 90 90 0 0 1 15 41* 27
Renal Physiol 27 25 289 15 12 15 93 80 79 0 4 1 7 4 17*
J Appl Physiol 62 57 519 24 39 35 79 67 65 0 7* 4 6 26* 34
J Neurophysiol 58 61 699 36 23 30 69 64 57 2 5 6 5 30* 38
Values are percentages of journals that reported numbers of manuscripts reviewed (n), standard deviation, standard error of the mean, and precise P value;
for example, P ⫽ 0.03 (rather than P ⬍ 0.05) or P ⫽ 0.12 (rather than P ⬎ 0.05). In 1996, these journals published a total of 3693 original articles; the number
of articles reviewed represents a 10% sample (systematic random sampling, fixed start) of the original articles published by each journal (9). From August 2003
through July 2004, these journals published a total of roughly 3500 original articles; the number of articles reviewed represents a 10% sample (systematic random
sampling, fixed start) of the original articles published by each journal. From August 2005 through July 2006, these journals published a total of 3675 original
2003 through July 2004, the year before the guidelines, and we The 2004 Guidelines Revisited
reviewed all original articles published from August 2005
through July 2006, the second year after the guidelines. If the These guidelines addressed the reporting of statistics in the
guidelines affected the reporting of statistics, we expected the RESULTSsection of a manuscript:
incidence of standard errors to decrease and the incidence of Guideline 5. Report variability using a standard deviation.
standard deviations, confidence intervals, and precise P values
to increase. Guideline 6. Report uncertainty about scientific importance
What did our literature review reveal? That the guidelines using a confidence interval.
had virtually no impact on the occurrence of standard errors,
Guideline 7. Report a precise P value.
standard deviations, confidence intervals, and precise P values
(Table 1). There were two exceptions: in one journal, the use Our 2004 editorial (8) summarized the theoretic rationale for
of standard errors decreased from 88% to 81%; in another each of these guidelines. The subsequent publication of Scien-
journal, the use of precise P values increased from 4% to 17%. tific Style and Format (6) reinforced their application. Because
they offer the greatest benefit to authors and readers of RESULTS Table 2. Guidelines for rounding P values to sufficient
sections, we revisit each of these guidelines. precision
Guideline 5. Report variability using a standard deviation.3
The distinction between standard deviation and standard error P Value Range Rounding Precision
of the mean is far more than cosmetic: it is an essential one. 0.01 ⱕ P ⱕ 1.00 Round to 2 decimal places: round P ⫽ 0.031 to
These statistics estimate different things: a standard deviation P ⫽ 0.03
estimates the variability among individual observations in a 0.001 ⱕ P ⱕ 0.009 Round to 3 decimal places: round P ⫽ 0.0066 to
sample, but a standard error of the mean estimates the theo- P ⫽ 0.007
P ⬍ 0.001 Report as P ⬍ 0.001; more precision is unnecessary
retical variability among sample means (8, 9).
Individual observations in a sample differ because the pop-
ulation from which they were drawn is distributed over a range Guideline 6. Report uncertainty about scientific importance
of possible values. The study of this intrinsic variability is using a confidence interval.4 A confidence interval focuses
important: it may reveal something novel about underlying attention on the magnitude and uncertainty of an experimental
scientific processes (12). The standard deviation describes the result. In essence, a confidence interval helps answer the
variability among the observations we investigators measure; it question, is the experimental effect big enough to be rele-
characterizes the dispersion of sample observations about the vant? A confidence interval is a strong tool for inference: it
sample mean. provides the same statistical information as the P value from
In contrast, the standard error of the mean provides an a hypothesis test, it circumvents the drawbacks inherent to a
冑冘
Guideline 7. Report a precise P value.5 In 2004 we wrote
共y i ⫺ y 兲 2
s⫽ and SE兵y 其 ⫽ s/ 冑n , (1) that a precise P value communicates more information with the
n⫺1 same amount of ink, and it permits readers to assess a statistical
result according to their own criteria (8). You would think most
where n is the number of observations in the sample, yi is an
authors would report a precise P value. Most do not (see Table
individual observation, and y is the sample mean. Because it
1).
incorporates information about sample size, the standard error
On occasion, an author who reported a precise P value did so
of the mean is a misguided estimate of variability among
with unnecessary precision:
observations (Fig. 1 and Ref. 8). By itself, the standard error of
the mean has no particular value (9). Even with a large sample P ⫽ 0.030711 or P ⫽ 10⫺31 .
size (n ⫽ 35), the interval
The guidelines for rounding P values to sufficient precision are
关 y ⫺ SE兵y 其, y ⫹ SE 兵y 其兴 (2) listed in Table 2.
is just a 68% confidence interval. In other words, we can The Next Step
declare, with modest 68% confidence, that the population mean
is included in the interval [y ⫺ SE{y }, y ⫹ SE{y }]. All of us–authors, reviewers, editors– gravitate toward the
To summarize, in most experiments, it is essential that an familiar. For most of us, reporting standard deviations, confi-
investigator reports on the variability among the actual indi- dence intervals, and precise P values is quite unfamiliar. To
vidual measurements. A sample standard deviation does this: it make matters worse, these reporting practices have been unfa-
describes the variability among the actual experimental mea- miliar for decades. There is considerable inertia to overcome
surements. On the other hand, if an investigator were to repeat before standard deviations, confidence intervals, and precise P
an experiment many times, and each time calculate a sample values become commonplace in journals published by the APS.
mean, the average of those sample means will be the popula- Reform is difficult (5, 7, 10, 11, 14, 16). The question is,
tion mean; the standard deviation of those sample means will how can we help it along? In part, the answer is that all of
be the standard error of the mean (9). A standard error is us–authors, reviewers, editors, Publications Committee–must
simply the standard deviation of a statistic: here, the sample make a concerted effort to use and report statistics in ways that
mean. In nearly all experiments, however, a single sample are consistent with best practices. With these guidelines (8), we
mean is computed. Therefore, it is inappropriate to report a summarized the best practices of statistics.
standard error of the mean–a theoretical estimate of the vari- In 2004, we hoped that the guidelines would improve and
ability of possible values of a sample mean about a population standardize the caliber of statistical information reported
mean–as an estimate of the variability among actual experi- throughout journals published by the APS. It is clear that the
mental measurements.
4
This guideline is mirrored in Scientific Style and Format Section 12.5.2.2
of Ref. 6.
3 5
This guideline is mirrored in Scientific Style and Format Sections 12.5.2.3 This guideline is mirrored in Scientific Style and Format Section 12.5.1.2
and 12.5.2.4 of Ref. 6. of Ref. 6.
mere publication of the guidelines failed to impact reporting 8. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in
practices. We still have an opportunity. journals published by the American Physiological Society. Am J Physiol
Heart Circ Physiol 287: H447–H449, 2004. http://ajpheart.physiology.
ACKNOWLEDGMENTS org/cgi/content/full/287/2/H447.
9. Curran-Everett D, Taylor S, Kafadar K. Fundamental concepts in
We thank Margaret Reich (Director of Publications and Executive Editor, statistics: elucidation and illustration. J Appl Physiol 85: 775–786, 1998.
American Physiological Society) for a 1-yr print subscription to APS journals, http://jap.physiology.org/cgi/content/full/85/3/775.
and we thank Matthew Strand (National Jewish Medical and Research Center, 10. Fidler F, Cumming G, Thomason N, Pannuzzo D, Smith J, Fyffe P,
Denver, CO) and the Editors of the American journals for comments and Edmonds H, Harrington C, Schmitt R. Toward improved statistical
suggestions. reporting in the Journal of Consulting and Clinical Psychology. J Consult
Clin Psychol 73: 136 –143, 2005.
REFERENCES 11. Fidler F, Thomason N, Cumming G, Finch S, Leeman J. Editors can
1. Altman DG. Statistics in medical journals: developments in the 1980s. lead researchers to confidence intervals, but can’t make them think:
Stat Med 10: 1897–1913, 1991. statistical reform lessons from medicine. Psychol Sci 15: 119 –126, 2004.
2. Altman DG. Statistics in medical journals: some recent trends. Stat Med 12. Fisher RA. Statistical Methods for Research Workers. New York: Hafner,
19: 3275–3289, 2000. 1954, p 3.
3. Altman DG, Machin D, Bryant TN, Gardner MJ. Statistics with 13. International Committee of Medical Journal Editors. Uniform Re-
Confidence. Bristol: BMJ Books, 2000. quirements for Manuscripts Submitted to Biomedical Journals. IV. Manu-
4. American Physiological Society. Instructions for Preparing Your Manu- script Preparation and Submission (online). http://www.icmje.org/index.
script. Manuscript Sections (online). http://www.the-aps.org/publications/ html#manuscript [14 April 2007].
i4a/prep_manuscript.htm#manuscript_sections [16 April 2007]. 14. Kendall PC. Editorial. J Consult Clin Psychol 65: 3–5, 1997.
I THINK IT OBVIOUS that authors should be required to report their When the SEM is used incorrectly as a measure of disper-
statistical analyses accurately and completely and that the sion for a set of measurements, readers may well assume that
standard error of the mean (SEM) should not be used when the measurements have less variability than they actually do:
other statistics are more appropriate. To argue otherwise is to the SEM is always smaller than the SD. (Indeed, an often-cited
promote a lower standard of research. Although an outsider to explanation for this inappropriate use of the SEM in the
your field, I have been asked to participate in this debate, so I literature is specifically to make measurements appear to be
will do so, first by addressing the misuse of SEM and then the more precise than they really are.) When the SEM is used
RECENTLY, Curran-Everett and Benos (5) developed guidelines misunderstanding exists as to what statistics can and cannot do,
for reporting statistics in journals published by the American and the authors deal with some very important elements in their
Physiological Society. In this brief commentary, I will com- short review. There is an unwarranted obsession with P values,
ment on those guidelines as well as present my own personal in particular, and this may lead to either exaggerated or
view of this discipline to make this meaningful for readers of understated claims. This point is made more trenchantly by
Advances in Physiology Education. Goodman (8), when he refers to the P value fallacy. The
Curran-Everett and Benos (5) also cautioned authors from test” (9). Motulsky (10) made this point quite explicit when he
presenting data giving variability in terms of standard errors of said that “If your data speak for themselves, don’t interrupt.”
the mean rather than standard deviations. This is a reasonable Although I am not disputing the need for clear guidelines and
position but could, again, be more rigid than necessary. I would that the authors have succeeded admirably in their aims, I feel
argue that what we are really trying to do in most studies is to a bit uncomfortable that these may be interpreted in a narrow
gauge what the population parameters are based on our sample sense to exclude good information from being published in this
characteristics. Thus, giving the standard error of the mean journal.
allows the reader to have some idea as to how closely the mean
values reported approximate the population mean. Further- REFERENCES
more, giving standard errors of the mean and n values helps 1. Carver RP. The case against statistical significance testing. Harvard Educ
readers gauge the confidence limits of the mean. Rev 48: 378 –399, 1978.
2. Cohen J. The Earth is round (p⬍0.05). Am Psychol 49: 997–1003, 1994.
Another point on which authors should be given leeway is to 3. Curran-Everett D, Taylor S, Kafadar K. Fundamental concepts in
choose not to do any statistical tests at all. The authors point statistics; elucidation and illustration. J Appl Physiol 85: 775–786, 1998.
out, in an earlier report (3), the problem with tests of signifi- 4. Curran-Everett D. Multiple comparisons: philosophies and illustrations.
cance. This particular issue has been debated at length (1, 2, 6). Am J Physiol Regul Integr Comp Physiol 279: R1–R8 2000.
Daniel (6) made much the same point when he suggested that 5. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in
journals published by the American Physiological Society. Adv Physiol
editors should “require authors to avoid using SSTs (statistical Educ 28: 85– 87, 2004.
significance tests) where not appropriate.” Carver (1), in an 6. Daniel LG. Statistical significance testing: a historical overview of misuse
Murray K. Clayton
Departments of Statistics and Plant Pathology, University of Wisconsin, Madison, Wisconsin
SOME DISTURBING NEWS was reported in the March 2005 issue of falls below this particular value, then you should reject the null
Significance, a publication of the Royal Statistical Society (7). hypothesis; if you have two groups, it’s a t-test; three groups
Based on a variety of audits, it was found that 38% of the and it’s ANOVA, etc. We bypass the complexities and leave
articles published in 2001 in Nature contained some statistical students believing that, upon the completion of the semester,
“incongruence”: a disparity between reported test statistics they are now qualified to “do statistics.” That’s hopelessly
(t-tests, F-tests, etc.) and their corresponding P values. The naı̈ve; most experiments yield data far more complex than can
British Medical Journal (BMJ) faired a little better, with 25% be handled by a semester or two of statistics.
Bayesian school but is a patently incorrect interpretation of a P tunately for the nonspecialist, the field of statistics is deep,
value from either a Neyman-Pearson or Fisherian view. complex, and evolving. In each of the last several years, for
Guideline 10 (Interpret each main result with a confidence example, the journal Statistics in Medicine has been publishing
interval and precise P value) repeats the error of interpretation about 4,000 pages of articles focused principally on the devel-
and compounds the problem with poor advice: “If either bound opment of new methods for analyzing data. Over the last 20
of the confidence interval is important from a scientific per- years or so, the manual for the statistical analysis component of
spective, then the experimental effect may be large enough to SAS has expanded from about 1,000 pages to roughly 5,000
be relevant.” The “may be” leaves the authors an out, but the pages. Aspects of apparently routine methods such as regres-
critical issue is that, if the sample is too small, then the sion and ANOVA are under continuous refinement, and meth-
confidence interval will be large, thus increasing the chance ods employed today are often quite different from those used
that one of the bounds of the confidence intervals is, appar- even a few decades ago. It is difficult to keep up.
ently, “important from a scientific perspective.” This confusion So what to do? I have two suggestions. First, I think the APS
underlies guideline 6 (Report uncertainty about scientific im- editorial board should indicate in their instructions that authors
portance using a confidence interval) as well. Although I
are responsible for ensuring that the statistics in articles are
appreciate the technical accuracy of the definition of a confi-
correct, appropriate, follow modern practices, and are well-
dence interval, the authors oversimplify. It is correct to say that
presented and that they are subject to review. And mean it. This
a confidence interval characterizes uncertainty about the true
value of a parameter but incorrect to say that it provides an is effectively what the Federal Drug Administration does in
work countless times, and I have had the good fortune to medical journals–amplifications and explanations. Ann Intern Med 108:
participate in it. Despite my quibbling with the details of the 266 –273, 1988.
3. Carmer SG, Swanson MR. An evaluation of ten pairwise multiple com-
suggestions by Curran-Everettt and Benos, I have complete parison procedures by Monte Carlo methods. J Am Statist Assoc 68: 66 –74,
sympathy with their efforts to improve the situation. However, 1973.
rather than promulgating a handful of guidelines, I believe it 4. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in jour-
will take support from APS members as a whole, and not just nals published by the American Physiological Society. Adv Physiol Educ
28: 85– 87, 2004.
a few advocates, to make positive change.
5. Fisher RA. Statistical Methods and Scientific Inference (3rd ed). New
York: Hafner, 1973, p. 79 – 82.
REFERENCES
6. Neyman J. “Inductive behavior” as a basic concept of philosophy of
1. Andrews HP, Snee RD, Sarner MH. Graphical display of means. Am science. Rev Int Stat Inst 25: 7–22, 1957.
Statistician 34: 195–199, 1980. 7. Royal Statistical Society. More scrutiny for scientific statistics. Signifi-
2. Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for cance 2: 2, 2005.
DOUGLAS CURRAN-EVERETT, DALE BENOS, and the editors of the A common reporting deficiency not discussed in the guide-
American Physiological Society journals are to be commended lines is the omission of the absolute value of the 100% or
for their efforts to raise the standards of statistics reporting in control value (and a measure of its variability) in the presen-
physiology journals. I read with interest the followup (in this tation of normalized data. Adding it to the figure legend is no
issue) to the 2004 guidelines (2), and I am glad to see those difficult task, and its inclusion is necessary for other investi-
efforts sustained. Curran-Everett and Benos reported that “the gators in planning future studies.
mere publication of the guidelines failed to impact reporting While some may disagree with specific recommendations in
WE APPRECIATE ALL COMMENTS, past (13, 15) and present (2, 14, science, the statistical philosophies about ␣ and P have been
16, 17), on our guidelines (6) for reporting statistics in journals melded together.
published by the American Physiological Society (APS). In Guideline 4. Control for multiple comparisons. Clayton (2)
2004 we wrote that the guidelines embodied fundamental is concerned that we mentioned the Newman-Keuls procedure
concepts in statistics (8) and that they were consistent with the in a footnote to the exposition of this guideline (6). In that
Uniform Requirements (12), used by roughly 650 biomedical footnote, we listed also the Bonferroni and least significant
journals, and with Scientific Style and Format (4), used by APS difference procedures as examples of common multiple com-
the sample mean y , sample standard deviation s, P value, and 99% changed. The guidelines we published in 2004 (6) embody
confidence interval: those fundamental concepts. We continue to believe the guide-
lines offer a concise, accurate framework that we hope will
help improve the caliber of statistical information reported in
Sample Standard articles published by the American Physiological Society.
Sample Mean y Deviation s P Value Confidence Interval
Drug A ⫺20 18 ⬍0.001 ⫺30 to ⫺10 REFERENCES
Drug B ⫺0.2 18 ⬍0.001 ⫺0.3 to ⫺0.1
Drug C ⫺20 18 0.07 ⫺50 to ⫹10 1. Bailar JC III, Mosteller F. Guidelines for statistical reporting in articles
for medical journals. Ann Intern Med 108: 266 –273, 1988.
2. Clayton MK. How should we achieve high-quality reporting of statistics
in scientific journals? A commentary on “Guidelines for reporting statis-
tics in journals published by the American Physiological Society: the
How do you interpret these results? sequel”. Adv Physiol Educ; doi:10.1152/advan.00084.2007.
Drug A decreased blood pressure by 20 mmHg, a change 3. Conover WJ. Practical Nonparametric Statistics (2nd ed.). New York:
that differed convincingly from 0 (P ⬍ 0.001). The confidence Wiley, 1980.
interval suggests the true mean impact of drug A is likely to be 4. Council of Science Editors, Style Manual Subcommittee. Scientific
between a 10- and 30-mmHg decrease in blood pressure, a Style and Format: the CSE Manual for Authors, Editors, and Publishers
(7th ed.). Reston, VA: Rockefeller Univ. Press, 2006.
change that is scientifically meaningful and reasonably precise. 5. Curran-Everett D. Multiple comparisons: philosophies and illustrations.
Drug A produced a convincing change of scientific importance. Am J Physiol Regul Integr Comp Physiol 279: R1–R8, 2000.