You are on page 1of 13

STATISTICS IN MEDICINE, VOL.

1, 59-71 (1982)

STATISTICS IN MEDICAL JOURNALS*

DOUGLAS G . ALTMAN
MRC Clinical Research Centre, Watford Road, Harrow, Middlesex, U . K .

SUMMARY
The general standard of statistics in medical journals is poor. This paper considers the reasons for this with
illustrations of the types of error that are common. The consequences of incorrect statistics in published
papers are discussed; these involve scientific and ethical issues. Suggestionsare made about ways in which the
standard of statistics may be improved. Particular emphasis is given to the necessity for medical journals to
have proper statistical refereeing of submitted papers.
KEY WORDS Medical journals Statistical errors Ethics Statistical refereeing Statistical guidelines

1. INTRODUCTION
It is widely recognized, especially by statisticians, that the general standard of statistics in journals
in the medical., biologicaland related fields is poor, although it is perhaps improving gradually. This
paper will consider, with reference to medical journals:
(i) What sorts of error are made?
(ii) Why is the overall standard of statistics low?
(iii) What are the consequences of poor statistics?
(iv) What can be done to improve the standard?
Although this paper will concentrate on statistics in medical journals, many of the points apply
equally well to journals in other fields. When discussing the low standard of statistics in medical
journals I will, of necessity, be talking about the general situation-there are, of course, notable
exceptions.

2. WHAT SORTS OF ERROR?


Despite the simple nature of the statistics in most medical papers very many errors occur. These
errors are not confined to data analysis but can occur at every stage of research.' Other areas that
seem to cause particular difficulty are study design, and the presentation and interpretation of
results.
In recent years there have been quite a few reviews of the quality of statistics in published
papers,2-6 and in each case about half of the papers were found to contain statistical errors. This
observation must, however, be qualified in several ways. Firstly, the surveys were not comparable in
that they covered various journals over different periods and looked for different types of error.
For example, one review2 related only to clinical trials, whereas another6 concentrated on the use

* Based upon a talk given to the Medical Section of the Royal Statistical Society on 19 May 1981.

0277-671 5/82/010059-13$01.30 Received 28 October 1981


KJ 1982 by John Wiley & Sons, Ltd. Revised 21 December 1981
60 DOUGLAS G . ALTMAN

(or misuse) of the t test. Secondly, there are considerable problems in deciding what is an error-
this will often be subjective. Thirdly, a large number of errors are relatively minor, in that they
would have no material bearing on the conclusions of the study. It must, however, be remembered
that a review of published papers is likely to lead to an underestimate of the prevalence of statistical
mistakes-many errors will be undetectable in the finished paper. For example, certain
observations might have been excluded from analysis because they were incompatible with the rest
of the data.
It is worth considering in some detail the sorts of error often made in the four areas of a research
paper already referred to: design, analysis, interpretation and presentation. For each a brief list of
common types of error will be followed by one or more specific examples from published papers.

2.1. Design
Perhaps the most common design error is to have too small a sample to get reliable and/or useful
results. A possible serious consequence of having too small a sample (and thus a lack of statistical
‘power’) is the inability to detect an important effect. Yet it is common to find undue emphasis
placed on ‘negative’findings from small studies (although if the results are properly presented it is
not necessary for the reader to rely upon the authors’ own interpretation of their results).
The importance of sample size is illustrated by a recent study of the use of Debendox in early
pregnancy by Fleming et aL7 Although this was a large study (over 20,000 women) the number of
women in the study who had been prescribed the drug was small (620), especially as abnormal
births are relatively rare. The study thus had a high chance of not detecting a quite large adverse
drug effect.’ In the event there were fewer abnormal outcomes in the Debendox group (5.0 per cent
compared with 5.4percent) with a 95 per cent confidence interval for the relative risk of 0.64to 1.33.
The authors commented that ‘these results support the hypothesis that Debendox is not
terat~genic’,~ but some caution is needed in interpreting their results, especially in the absence of
information about drug compliance.
Other errors are perhaps more serious in that they may lead to incorrect results. In controlled
trials the use of an inappropriate control group, for example having non-concurrent or historical
controls, will severely weaken the credibility of the results, as will the failure to randomize the
subjects to the alternative treatments. In matched studies, for example case-control studies,
published papers often show unequal numbers in the groups with no comment from the authors-
this might indicate poor design. Sometimes studies are designed in such a way that the effect being
studied is inseparable from other factors.
A good example of this last point is given by a study to compare two methods of assessing the
. ~ assessments using one method were made by a single
gestational age of newborn b a b i e ~ All
observer, whereas all the assessments using the second method were made by a different observer.
The effect of interest-the difference between the methods-was clearly confounded with possible
between-observer differences. Observer variability is very common, especially when, as in this case,
the observations involve subjective judgments. The authors’ response when this criticism was
voiced” illustrated their failure to appreciate this point: ‘It was felt t h a t . . . . two different
observers . . . . would result in greater objectivity’.”
An example of a more common error involves the selection of study groups for the purpose of
estimating prevalence. This problem is well-recognized in surveys, but would appear to be less well

appreciated in clinical research. Ellenberg and Nelson’ reviewed 23 follow-up studies of children
who had had febrile seizures. Seventeen of these studies involved children attending special clinics.
The other six obtained their sample from an initial population study to identify such children. One
purpose of these studies was to assess the risk of such children subsequently having non-febrile
STATISTICS IN MEDICAL JOURNALS 61

80 ..
I :

Clinic Population
based based
Figure 1. Percentage of children who experienced non-febrile seizures after one or more febrile seizures
in 23 studies (from Ellenberg and Nelson”)

seizures. This is an important question as the degree of risk would determine whether or not
prophylactic treatment should be prescribed routinely.
The results from these 23 studies are shown in Figure 1. It is clear that the six population-based
studies yielded much lower estimates of the risk; they were also much more consistent than were the
results from the other 17 studies. This is not really surprising. We would expect the children
attending special clinics to include more severe cases-variability in the degree of this bias would
explain the wide range of observed prevalences of further seizures. Ellenberg and Nelson observed:
‘Only 5 of the reports of the clinic-based studies acknowledged the potential for sample selection
bias, and of those five, only one carried such acknowledgement into the conclusions’. They also
pointed out that, despite the results of the population-based studies, many authors recommend
that all such children should receive therapeutic levels of phenobarbital daily for as long as six
years. Such advice must have affected very many children, yet it is based on the results of studies
that did not have an appropriate design to answer questions about the prevalence of recurrence.

2.2. Analysis
Errors in analysis are probably quite familiar. It is particularly common to find that the
assumptions underlying a statistical technique have been violated (although often it is impossible
to tell). Regression, correlation, t tests and analysis of variance all make specific assumptions about
the distribution of the data. Another problem occurs when related observations such as repeated
measurements on the same subject are treated as independent observations. Finally, an incorrect or
inappropriate analysis is often the result of using a statistical method that is too simple for the
problem in hand. Examples include the use o f t tests instead of analysis of variance or covariance
(the latter being surprisingly rarely used), and the investigation of multivariate data by considering
variables two at a time (such as using regression, correlation or two-way contingency tables) rather
than by multilinear regression or multivariate techniques. Thus often an outcome measure is
separately regressed on (or correlated with) two or more explanatory variables without any
investigation of the inter-relationships between these variables.
Apart from the relatively obvious misuse oft and chi-squared tests,’-‘ there are numerous other
mistakes that can be and are made in data analysis. These include the combination, in a clinical trial,
62 DOUGLAS G . ALTMAN

of the results for all non-responders or drop-outs with those on placebo therapy, the exclusion of
subjects from analysis on the basis of their results, and several misapplications of correlation,
notably when comparing alternative methods of measurement.13 All of these errors may have a
profound effect on the results. The following two examples are interesting but not necessarily
typical illustrations of poor statistical analysis.

e a

0 2 4 6 8 10 12 14 16
Age (weeks)

Figure 2. Fatty acid content of infant liver by age with fitted cubic curve (from Figure 3H in Clandinin et a / 14)

Figure 2 shows some data relating fatty acids in infant liver to age.I4 To these data the authors
fitted a cubic curve (as shown) and pronounced this ‘significant’ with r = 0.47, P < 0.05. The
correlation quoted is, however, a multiple correlation coefficient. The linear correlation is only
about 0.1 and is not nearly significant ( P > 0.9). Simple inspection suggests that the data are much
too variable to draw any useful conclusions. The astonishing aspect of this analysis is that it is one
of no less than 83 cubic curves fitted to different (but similar) sets of data (in four papers including
the reference given). In no case was any justification, either statistical or scientific, given for fitting
cubic curves; there clearly was none. Nor in any case was the cubic curve shown to be a better fit to
the data than a quadratic curve (let alone a straight line). Yet the authors have drawn conclusions
from the shapes of these curves; none of their conclusions can have any credibility.
Robinson et d . 1 5 looked at the effect of oral vitamin D supplements on plasma 25-
hydroxyvitamin D (25-OHD) concentrations in pre-term infants. Their results are shown in Figure
3, where the plasma 25-OHD level at 14 weeks is plotted against the percentage rise from 14 to 36
weeks. The authors described the results of their analysis thus: ‘There was a significant correlation
between the plasma 25-OHD level at day 14 and the percentage rise at 36 weeks (r = -0.68,
P < 0.0025)’. This led them to say that: ‘The negative correlation . . . indicates that there may be
constraints on 25-hydroxylation in the pre-term infant . . .’
There were several statistical errors in this study, the most important of which was the use of
correlation to relate the change in a quantity to its initial value. It is well-known that this
procedure results in a spurious negative correlation even when the initial and subsequent levels are
unrelated.I6 Other problems include the dubious use of percentages to assess the individual rises,
STATISTICS IN MEDICAL JOURNALS 63

I
04 .
0

. . $XI. . lzoo. . . . d

0 400 1600
Sb rise in 25 - OHD at 36 weeks

Figure 3. Plasma 25-OHD response to vitamin D supplementation related to initial level


(from Robinson ut d.”)
the consequent presence of an outlier that further inflates the correlation, and the combination of
different treatment groups in a single analysis. Any conclusions based upon the authors’ analysis
are clearly unjustified.

2.3. Interpretation
The majority of errors in interpretation relate to tests of significance.Although very widely familiar
and used in almost every paper presenting research findings, the general degree of understanding of
the true meaning of such tests is very low. It is common to find that the meaning of the significance
level is misunderstood. P is not the probability of making a mistake when taking the observed effect
as a true one.
Although it may be tempting to accept either the clinical importance or statistical significance of
a result as indicating an important finding, ideally both are required. Often, however, statistical
significance is taken to indicate an important finding regardless of clinical importance, or a
marginally non-significant result is interpreted as showing no effect even when the results appear
clinically important. Such circumstances require a more flexible attitude to interpretation than is
often adopted. This is, though, an area which highlights a weakness of the significance test
approach.
Another aspect of interpretation-the over-confidence frequently placed on negative (non-
significant)results from small studies-has been superbly illustrated by the study of Freiman and
colleagues.” They found that of 71 supposedly negative studies (i.e. P > 0.1) 50 had a greater than
10 per cent risk of missing a true improvement of 50 per cent. They observed that many of the
therapies investigated in these studies had not received a fair test, and that ‘In most studies the lack
of a (significant)difference . . . was taken to mean that no clinically meaningful difference existed.’
Not all errors of interpretation are related to significance tests, however. Problems are common
in the evaluation of the usefulness of a diagnostic test. Here, over-reliance on sensitivity and
specificity can be very misleading.
Bravo and colleagues’* found that plasma catecholamines were elevated in 22 of 23 subjects with
pheochromocytomas, but in only 12 of 40 normal subjects. The sensitivity and specificity were thus
64 DOUGLAS G. ALTMAN

96 per cent (22/23) and 70 per cent (28/40), respectively. It is very common to see sensitivity
(especially) and specificity used to assess the value o f a diagnostic test, but they only tell us what
proportions of subjects known either to have the disease or not were correctly identified. What we
really want to know is the usefulness of a positive test in predicting the disease (and vice versa). If we
know the prevalence of the disease we can calculate the ‘positive predictive value’ as
sensitivity x prevalence
sensitivity x prevalence + (1 - specificity) (1 - prevalence)
and the ‘negative predictive value’ by interchanging the sensitivity and specificity.l 9 In the present
case it has been suggested that: ‘If one were to screen all hypertensive patients for pheochromo-
cytomas, the prevalence would be about 0 5 per cent.’20 For these data the predictive power
of positive and negative results work out at 1.6 per cent and 99.97 per cent. McCarthy2” has
calculated comparative figures for an alternative test (urinary VMA) as 35 per cent and 99.98 per
cent, which are clearly better. These are the main considerations necessary when evaluating the
usefulness of diagnostic tests-they are very often missing and arguments are incorrectly based on
the sensitivity and specificity alone.

2.4. Presentation of results


The presentation of statistical results in medical journals leaves much to be desired. Although
arguably less important than the aspects already discussed, good presentation is still highly
desirable, especially as poor presentation may unintentionally mislead or confuse the reader. The
following examples are all taken from published papers.
Spurious precision is very common:
(i) Y = 0.608X+ 12.97643
(ii) x 2 = 0.7264
(iii) P =
(iv) ‘. . . 86.95 per cent of cases . . .’ ( n = 92).
Such presentation is very often an indication of a failure to grasp the ideas behind the statistical
methods used and may indicate more serious problems.
Ambiguity, too, may hinder the reader. The most common example is the failure to specify
whether standard errors or standard deviations are quoted. Other examples include the expression
P < 0 . 1 > 0.05 instead of the correct 0.05 < P < 0.10. In the following regression equation
+
TBN (g) = (28.8* FFM (kg) 228) f 8.5 per cent
the final term was undefined, and its meaning not deducible. (The use of * rather than x suggests
that this result was taken straight from computer print-out.)
Regression analysis is especially prone to poor presentation. Two particularly bad practices are
the showing of the regression line without the data, and the extension of the regression line beyond
the range of the observations.
Other misuses of graphical techniques include the failure to show coincident points in scatter
diagrams, changing the scale in the middle of an axis, and the inclusion of so many different lines
and symbols that interpretation is very difficult.
Finally, two further examples where the authors displayed a lack of understanding of statistics:
(i) ‘Three hundred and twenty one consecutive patients satisfying these inclusion criteria were
more or less randomly allocated to groups.’21
(ii) ‘The difference between groups 2 and 3 was probably significant’22
STATISTICS IN MEDICAL JOURNALS 65

2.5. Omission of information


All of the reviews already cited2” commented on the common fault of omission of vital
information. Mosteller and colleagues23 looked at the information provided in 132 published
randomized trials relating to myeloma, leukaemia, and gastro-intestinal and breast cancer. They
found that the method of randomization was only specified in 33 per cent of the studies, and that
only 23 per cent reported their studies as being double blind (although, of course, probably not all
could have been double blind for practical reasons). Only 35 per cent specified the statistical
methods used to derive their significance level, and (not surprisingly in view of the study by
Freiman et d.”)only 2 per cent reported that power had been taken into account when designing
their studies. The omission of information was not confined to statistical aspects-only 10 per cent
of the studies reported that they had obtained informed consent from the subjects in the study.
With the frequent absence of vital information, it is very hard if not impossible to evaluate
research papers. Most of the omissions may be unintentional, but it is very unsatisfactory in such
circumstances to have to assume that correct or reasonable procedures were used.

3. WHY ARE STANDARDS SO LOW?


There are many contributory factors to the generally low standard of statistics in medical journals.
It will help to consider first the reasons for the poor quality in papers submitted for publication and
then the reasons in published papers.

3.1. Submitted papers


In the majority of medical research papers no statistician is involved. The quality of the statistics in
such papers thus depends largely upon the statistical education of medical researchers. It is now
customary in Great Britain for medical students to receive formal statistics teaching, usually pre-
clinically. Perhaps because of the need to examine the students, these courses have tended to be
fairly strongly slanted towards the teaching of methods of analysis (such as significance tests),
although current attitudes amongst teachers of such courses appear rather more enlightened.
If such a course is the only training received (and similar courses are given to non-clinical medical
researchers) then it is inadequate for a research scientist. The ground covered would not be
extensive enough, and a large proportion of what was learnt would have been forgotten by the time
it was needed, perhaps ten years later. Many medical (and other) students will not enter research,
but all would benefit from some insight into statistical principles. It makes sense to consider
undergraduate teaching as an introduction to thinking statistically, and not just as a means of
learning a few simple tests to apply to data. Those entering research need a further course, with time
to consider design, analysis and interpretation in more depth. It is particularly important that such
courses give more emphasis to problem-based rather than method-based teaching. Often statistical
methods are misused because of an attempt to fit a different problem into a familiar framework. An
example is the widespread but incorrect use of correlation to compare two alternative measuring
instruments.’ Textbooks are very largely method-based too-very few consider this particular
problem at all and one24even uses a comparison of two methods of blood pressure measurement as
an example of the use of the correlation coefficient.
The attitude of medical researchers also contributes to the low standard of statistics in their
papers. There is an overwhelming preference for tests of significance rather than confidence
intervals or other estimation procedures, and the results of such tests tend to be given undue
importance, with insufficient regard to the clinical implications (although this emphasis on
significance tests largely reflects the nature of most introductory courses and textbooks). It is even
66 DOUGLAS G. ALTMAN

rumoured that some journals will not publish studies unless a significant result (1.e. P < 0-05 or
even P c 0.01) has been achieved. If true, this is a travesty of the role that statistics should have in
medicine. A further problem is the requirement, for career purposes, to publish as much as possible,
which results in much sub-standard and unnecessarily repetitious research and numerous poor
papers.
It is also relevant that the training that statisticians receive is not sufficiently practical, and is
usually too general to contain a large component of techniques specific to medical statistics. Thus,
much of the medical statistician’s expertise has to be obtained in the job, and this will take several
years. Partly for this reason and partly because there are not enough posts for medical statisticians,
there is a shortage of statistical advice available to the medical researcher. Many scientists would be
only too pleased to get some expert assistance, but they have nobody to provide it.
Lastly, there is more that ethical committees could do to control the quality of research. There
are important ethical consequences of carrying out inferior research-these will be discussed in
detail later. Thus there should be more to ethical approval than just ensuring that the research is not
unethical in the usual sense. Ethical committees should not approve research that is scientifically
unsound, and this ought to include consideration of the statistical quality. It is, therefore,
particularly desirable that a statistician should be involved in ethical committee appraisal of
research proposals, but this is very rare.

3.2. Published papers


The previous section discussed areas where there is not likely to be any notable improvement in the
short term. Improving the standard of published papers is, however, a much more realistic
possibility. The poor standard of statistics in submitted papers need not necessarily affect the
standard of published papers, but the general standard of statistics in published papers is also low.
There is certainly plenty of room for improvement as so many published papers contain statistical
errors. Although their reputation rests largely on the quality of the papers they publish, the medical
journals have mostly been slow to appreciate the importance of good statistical methodology and
the consequences of publishing misleading results.
A past editor of the New England Journal of Medicine wrote: ‘ . . . publication in a reputable
journal automatically implies that the editor and his reviewers condone the experimentation’.’’ He
was referring to the ethical nature of research in the accustomed sense, but surely this statement
should apply equally to the statistical aspects of a paper.
It is obvious that to improve the standard of published statistics it is necessary to stop publishing
poor papers. Several of those cited here, as well as many seen in the various reviews of medical
journals, clearly should not have been published, at least not without modification. At present, very
few journals have any system of statisticai refereeing rhat would enable them to avoid publishing
such papers.

4. CONSEQUENCES OF THE MISUSE OF STATISTICS


There are many implications of carrying out and publishing research that do not normally need to
be considered. When a paper containing incorrect results (not necessarily through statistical
mistakes) is published there may be serious consequences, although surprisingly this does not seem
to be generally appreciated:
(i) Subjects used in the research have been put at risk or inconvenienced for no benejit.
This applies equally to the use of experimental animals.
STATISTICS IN MEDICAL JOURNALS 67

(ii) Other resources have been diverted from more worthwhile use.
This refers notably to time and money, but also to hospital beds, laboratory facilities, etc.
(iii) Other patients may subsequently receive an inferior treatment either as a direct consequence of
thejindings of the study or possibly by delaying the introduction of a better treatment.
Ellenberg and Nelson’s study’’ illustrated the first possibility, and the retrolental
fibroplasia story26 is an example of the latter.
(iv) Other scientists’ research may be ajected.
It may prove impossible to get ethical committee approval to investigate a treatment if a
previous study has found the treatment beneficial, even if that study was statistically
unreliable. On the other hand, if the results of some studies are accepted uncritically,
other scientists may be misled into abandoning useful treatments or investigating poor
ones.
(v) lf the results go unchallenged the researcher@) involved may use the same substandard
statistical methods again in subsequent work, and others may copy them
The wide misuse of correlation coefficients for comparing methods is a good example of
this.
All of the above points make the misuse of statistics very much an ethical as well as a scientific
issue. Points (i) and (iii) are particularly relevant to medical research, but (ii) and (iv) apply more
widely and these should be considered to be ethical issues too. Point (v) applies even if the paper in
question does not actually produce misleading conclusions, so that any paper that misuses statistics
could be potentially harmful.

5. WHAT CAN BE DONE TO RAISE STANDARDS?


The above discussion has made it clear that it is highly desirable for scientific, ethical and moral
reasons to prevent the misuse of statistical methods in medical research. In this section I will
consider ways in which this might be achieved.
The vast majority of statistical errors are unintentional, usually resulting from an inadequate
understanding of statistical principles and methods. Ignorance is not really an acceptable defence,
since using statistical techniques (like using laboratory equipment or driving a car) requires
adequate training (or supervision).
Both long and short-term possibilities exist for the raising of the general standard of statistics;
these will be considered in turn.
5.1. Long term
In the long term, improvement may be achieved by changes in statistical education, as discussed in
Section 3.1. Statistics courses could give more emphasis to problem-solving rather than just
teaching various ‘methods’. It is especially important to emphasize to students the limitations in
what they have been taught-it is impossible to be comprehensive in a short statistics course. There
is a need for more postgraduate courses in statistics since most of what is learnt as an
undergraduate will be forgotten by the time it is needed.
Another area where we may hope for improvement is in the composition of ethical committees. It
is obvious that many studies have inferior statistical designs, especially with respect to sample size.
More involvement of statisticians at this stage would be beneficial.
Statisticians can help by making their views known on these issues, notably on teaching
syllabuses and membership of ethical committees. When they see misuses of statistics in published
papers, they should write letters to the journals to explain the mistakes and attempt to make
journal editors aware of the importance of good statistics. Statistics does not always get a very good
68 DOUGLAS G . ALTMAN

press in the medical world-it is in our interest to help to improve standards and increase
confidence in the value of correct statistics. Statisticians should try to convince other scientists that
collaboration in research is in their mutual interest; certainly medical research offers many
challenging problems for the statistician. Also statisticians should make more effort to understand
the problems of their medical colleagues, perhaps especially where there are direct clinical
implications of the research. Poor communication between statisticians and medical researchers is
certainly a problem-it is to be hoped that Statistics in Medicine will be able to make a valuable
contribution here.

5.2. Short-term
The above points all relate to the long term, and some may feel that not very much improvement in
the quality of statistics would result from the degree of change that is likely to occur. There are,
however, important possible short-term changes which could have a marked effect on statistical
standards, and these largely involve the medical journals.
As already discussed, the journals do not in general place much importance on the statistical
content of research papers. The preceding discussion should have shown that this is an area to
which they really need to give much more attention. The most important improvement would be
for more journals to institute a formal policy of statistical refereeing of papers, following the lead of
the relatively few enlightened journals.
Another useful step would be the development of comprehensive statistical guidelines for
prospective contributors to medical journals.
It is widely felt (by statisticians) that editors are not very amenable to suggestions such as these,
but I experienced a considerable amount of sympathy for these ideas when I participated in a
workshop for medical journal editors. They did appreciate the importance of good statistics, they
were keen to increase the amount of statistical refereeing, but did not know how to do this, and they
were enthusiastic about having guidelines for contributors. These last two topics will be considered
in more detail below.
Before considering how a statistical refereeing system would work, it is important to explain why
the correspondence pages of a journal do not provide an adequate safeguard.
It is not possible to ‘unpublish’ a paper, so that all of the previously outlined consequences of
publishing misleading results or conclusions will apply irrespective of any subsequent correspon-
dence. Additionally, any media publicity given to the original publication will rarely extend to any
ensuing critical comments. It is also important to realize that the majority of published papers will
not be read by anybody likely to detect the types of error illustrated above. Lastly, writing letters to
journals often leads to apparently unresolved disputes (which each side will believe itself to have
‘won’) which will bame many readers. It is much better to have improved quality control at the
refereeing stage. Journals should, however, be duty bound to publish valid criticisms of papers they
have published; this does, of course, require judgment of what is ‘valid’.

5.3. Statistical refereeing


It is only action by the medical journals that will lead to an improvement in the standard of
statistics in medical journals in the short term. A comprehensive statistical refereeing system would
ideally require journals to:
(i) recruit statistical referees
(ii) send all papers involving statistics (including ‘short reports’) to a statistician (otherwise
there will be selection problems)
STATISTICS IN MEDICAL JOURNALS 69

(iii) send revised papers back to the same referee


(iv) state their policy on statistical refereeing
(v) give priority to well-executed and well-documented studies
(vi) encourage authors to send extra material for the referee’s benefit (such as computer output,
further details of methods used, and papers in press, and even the raw data if necessary)
(vii) employ editorial staff with some statistical knowledge.
These suggestions have been discussed in more detail el~ewhere.~’ It must be emphasized that
the above points constitute the ideal. It may be impractical to implement all of them-indeed the
first requirement may be none too easy to meet-but any action along these lines is to be welcomed,
and more journals should m w e in this direction. If those journals adopting statistical refereeing
schemes make their policy clear,as suggested in (iv) above, this may help to put further pressure on
those which do not have such a system.
It is probable that there are not nearly enough statisticians to cope with the large amount of extra
work that the above involvement would necessitate. This is, of course, no reason for not describing
how things might be improved. It is, however, unrealistic to expect that it would be possible to
persuade many journals to alter their refereeing systems overnight; it is much more likely that there
will be a continuation of the gradual move in this direction, but perhaps more quickly than at
present. To persuade more statisticians to referee for medical journals it is necessary to appreciate
that it may be unrealistic to expect statisticians to do this work without recompense-such
refereeing is rather different from the usual peer review where one referees papers by others in one’s
own subject.
Although it seems obvious that improved refereeing would lead to fewer published papers with
statistical errors, there is little direct evidence to support this. The only study to report on this2’ was
very favourable, but is some fifteen years old. More recently the editors of another journal have
observed (in a footnote in a paper29)that since they implemented such a scheme ‘there has been a
marked improvement in the selection and use of statistical methods to test the significance of
results in papers published in the Journal’. Just as journals seem quite keen to publish reviews of
the quality of statistics in the papers they have published, they should also consider assessing the
effectiveness of their statistical refereeing.

5.4 Guidelines for contributors


Very few medical journals offer much assistance to prospective contributors concerning statistics.
(This is in notable contrast to the position regarding the format of references, which is, in
comparison, a trivial problem.) Indeed, most journals do not mention statistical matters at all in
their advice to contributors. Some do have a few general comments, such as a requirement that the
statistical methods used should be specified, or statements like: ‘Significance should be given as
values of probability’.
A few journals give rather more recommendations, but even these are quite general and could
hardly be considered to be comprehensive. Textbooks could help here, but introductory statistics
books primarily concentrate on methods of analysing data (although the recent book by Glantz3’
is unusual in focusing closely on errors in the medical literature).
There seems to be a need for rather more comprehensive statistical guidelines. (See O’Fallon et
aL31 for a discussion of this issue.) Such guidelines would certainly be useful for contributors and
journals, and might help the task of referees also. They would include advice on several areas of
statistics in research, notably design, descriptive statistics, analysis, presentation and interpretation.
70 DOUGLAS G . ALTMAN

6. CONCLUSIONS
Positive action is necessary to raise the standard of statistics in medical journals. In the short term,
considerable improvement is possible if the journals adopt tighter quality control. In the long term,
we may also hope for improvements in education, and the greater direct involvement of
statisticians in research.
If higher standards of publication were established and adhered to, it would become apparent to
researchers that there was little chance of papers with poor statistics being published in a reputable
journal.
It is surely in the interests of everyone,especially current and future patients, to try to achieve this
as soon as possible.

REFERENCES
1. Altman, D. G. ‘Statistics and ethics in medical research. Misuse of statistics is unethical’, British Medical
Journal, 281, 1182-1 184 (1980).
2. Lionel, N. D. W. and Herxheimer, A. ‘Assessing reports of therapeutic trials’, British Medical Journaf,iii,
6 3 7 6 4 0 (1970).
3. Schor, S. and Karten, I. ‘Statistical evaluation of medical journal manuscripts’, Journal ofthe American
Medical Associarion, 195, 1123-1 128 (1966).
4. Gore, S. M., Jones, I. G. and Rytter, E. C. ‘Misuse of statistical methods: critical assessment of articles in
BMJ from January to March 1976, British Medical Journal i, 85-87 (1977).
5. White, S. J. ‘Statistical errors in papers in the British Journal of Psychiatry’, British Journal of’Psychiatry,
135, 336342 (1979).
6. Glantz, S . ‘Biostatistics: how to detect, correct and prevent errors in the medical literature’, Circulation, 61,
1-7 (1980).
7. Fleming, D. M., Knox, J. D. E. and Crombie, D. L. ‘Debendox in early pregnancy and fetal
malformations’, British Medical Journal, 283, 99-101 (198 1).
8. Vivian, S. P. and Golding, J. ‘Debendox in early pregnancy and fetal malformation’, British Medical
Journal, 283, 725 (1981).
9. Serfontein, G . L. and Jaroszewicz, A. M. ‘Estimation of gestational age at birth’, Archives ofDisease in
Childhood, 53, 509-5 1 1 (1978).
10. Altman, D. G. ‘Estimation of gestational age at birth’, Archives of Disease in Childhood, 54,242-243 (1979).
11. Serfontein, G. L. and Jaroszewicz, A. M. ‘Estimation of gestational age at birth’, Archires oj Disease in
Childhood, 54, 243 (1 979).
12. Ellenberg, J. H. and Nelson, K. B. ‘Sample selection and the natural history of disease’, Journal o f t h e
American Medical Association, 243, 1337-1340 (1980).
13. Altman, D. G. and Bland, J. M. ‘Measurement in medicine: the analysis of method comparison studies’,
Submitted for publication.
14. Clandinin, M. T., Chappeli, J. E., Heim, T., Swyer, P. R. and Chance, G. W. ‘Fatty acid accretion in fetal
and neonatal liver: implications for fatty acid requirements’, Early Human Development, 5, 7-14 (1981).
15. Robinson, M. J., Merrett, A. L., Tetlow, V. A. and Compston, J. E. ‘Plasma 25-hydroxyvitamin D
concentrations in preterm infants receiving oral vitamin D supplements’, Archives ofDisease in Childhood,
56, 144-145 (1981).
16. Oldham, P. D. Measurement in Medicine: The Interpretation ofNumerica1 data, English Universities Press,
London, 1968, p. 148.
17. Freiman, J. A., Chalmers, T. C., Smith, H. and Kuebler, R. R. ‘The importance of beta,the type I1 error and
sample size in the design and interpretation of the randomized control trial’, New England Journal of
Medicine, 299, 69&694 (1978).
18. Bravo, E. L., Tarazi, R. C., Gifford, R. W. and Stewart, B. H. ‘Circulating and urinary catecholamines in
pheochromocytoma. Diagnostic and pathophysiologic implications’, New England Journal of Medicine,
301,682-686 (1979).
19. Habicht, J. -P. ‘Some characteristics of indicators of nutritional status for use in screening and
surveillance’, American Journal of Clinical Nutrition, 33, 53 1-535 (1980).
20. McCarthy, D. ‘Value of predictive values’, New England Journal of Medicine, 302, 1479-1480 (1980).
STATISTICS IN MEDICAL JOURNALS 71

21. Clarke B. F. and Campbell I. W. ‘Long-term comparative trial of glibenclamide and chlorpropamide in
diet-failed maturity onset diabetics’, Lancet, i, 2 4 6 2 4 8 (1975).
22. Newton, J., Illingworth, R., Elias, J. and McEwan, J. ‘Continuous intrauterine copper contraception for
three years: comparison of replacement at two years with continuation of use’, British Medical Journal, i,
197-199 (1977).
23. Mosteller, F., Gilbert, J. P. and McPeek, B. ‘Reporting standards and research strategies for controlled
trials. Agenda for the editor’, Controlled Clinical Trials, 1, 37-58 (1980).
24. Daniel, W. W. Biostatistics: a Foundationfor Analysis in the Health Sciences, (2nd edition), Wiley, New
York (1978).
25. Woodford, F. P. ‘Ethical experimentation and the editor’, New England Journal of Medicine, 286, 892
(1972).
26. Silverman, W. A. ‘The lesson of retrolental fibroplasia’, Scientific American, 236(6), 1 W 1 0 7 (1977).
27. Altman, D. G . ‘Statistics and ethics in medical research. VIII-Improving the quality of statistics in
medical journals’, Brirish Medical Journal, 282, 44-47 (1981).
28. Schor, S. ‘Statistical reviewing program for medical manuscripts’, American Statistician, 21,28-31 (1967).
29. Wallenstein, S., Zucker, C. L. and Fleiss, J. L. ‘Some statistical methods useful in Circulation Research,
Circulation Research, 47, 1-9 (1980).
30. Glantz, S. Primer of Biostatistics, McGraw-Hill, New York, 1981.
31. OFallon, J. R., Dubey, S. B., Salsburg, D. S., Edmonson, J. H., Soffer, A. and Colton, T. ‘Should there be
statistical guidelines for medical research papers?’ Biornetrics, 34, 6 8 7 6 9 5 (1978).

You might also like