Professional Documents
Culture Documents
Altman 1982
Altman 1982
1, 59-71 (1982)
DOUGLAS G . ALTMAN
MRC Clinical Research Centre, Watford Road, Harrow, Middlesex, U . K .
SUMMARY
The general standard of statistics in medical journals is poor. This paper considers the reasons for this with
illustrations of the types of error that are common. The consequences of incorrect statistics in published
papers are discussed; these involve scientific and ethical issues. Suggestionsare made about ways in which the
standard of statistics may be improved. Particular emphasis is given to the necessity for medical journals to
have proper statistical refereeing of submitted papers.
KEY WORDS Medical journals Statistical errors Ethics Statistical refereeing Statistical guidelines
1. INTRODUCTION
It is widely recognized, especially by statisticians, that the general standard of statistics in journals
in the medical., biologicaland related fields is poor, although it is perhaps improving gradually. This
paper will consider, with reference to medical journals:
(i) What sorts of error are made?
(ii) Why is the overall standard of statistics low?
(iii) What are the consequences of poor statistics?
(iv) What can be done to improve the standard?
Although this paper will concentrate on statistics in medical journals, many of the points apply
equally well to journals in other fields. When discussing the low standard of statistics in medical
journals I will, of necessity, be talking about the general situation-there are, of course, notable
exceptions.
* Based upon a talk given to the Medical Section of the Royal Statistical Society on 19 May 1981.
(or misuse) of the t test. Secondly, there are considerable problems in deciding what is an error-
this will often be subjective. Thirdly, a large number of errors are relatively minor, in that they
would have no material bearing on the conclusions of the study. It must, however, be remembered
that a review of published papers is likely to lead to an underestimate of the prevalence of statistical
mistakes-many errors will be undetectable in the finished paper. For example, certain
observations might have been excluded from analysis because they were incompatible with the rest
of the data.
It is worth considering in some detail the sorts of error often made in the four areas of a research
paper already referred to: design, analysis, interpretation and presentation. For each a brief list of
common types of error will be followed by one or more specific examples from published papers.
2.1. Design
Perhaps the most common design error is to have too small a sample to get reliable and/or useful
results. A possible serious consequence of having too small a sample (and thus a lack of statistical
‘power’) is the inability to detect an important effect. Yet it is common to find undue emphasis
placed on ‘negative’findings from small studies (although if the results are properly presented it is
not necessary for the reader to rely upon the authors’ own interpretation of their results).
The importance of sample size is illustrated by a recent study of the use of Debendox in early
pregnancy by Fleming et aL7 Although this was a large study (over 20,000 women) the number of
women in the study who had been prescribed the drug was small (620), especially as abnormal
births are relatively rare. The study thus had a high chance of not detecting a quite large adverse
drug effect.’ In the event there were fewer abnormal outcomes in the Debendox group (5.0 per cent
compared with 5.4percent) with a 95 per cent confidence interval for the relative risk of 0.64to 1.33.
The authors commented that ‘these results support the hypothesis that Debendox is not
terat~genic’,~ but some caution is needed in interpreting their results, especially in the absence of
information about drug compliance.
Other errors are perhaps more serious in that they may lead to incorrect results. In controlled
trials the use of an inappropriate control group, for example having non-concurrent or historical
controls, will severely weaken the credibility of the results, as will the failure to randomize the
subjects to the alternative treatments. In matched studies, for example case-control studies,
published papers often show unequal numbers in the groups with no comment from the authors-
this might indicate poor design. Sometimes studies are designed in such a way that the effect being
studied is inseparable from other factors.
A good example of this last point is given by a study to compare two methods of assessing the
. ~ assessments using one method were made by a single
gestational age of newborn b a b i e ~ All
observer, whereas all the assessments using the second method were made by a different observer.
The effect of interest-the difference between the methods-was clearly confounded with possible
between-observer differences. Observer variability is very common, especially when, as in this case,
the observations involve subjective judgments. The authors’ response when this criticism was
voiced” illustrated their failure to appreciate this point: ‘It was felt t h a t . . . . two different
observers . . . . would result in greater objectivity’.”
An example of a more common error involves the selection of study groups for the purpose of
estimating prevalence. This problem is well-recognized in surveys, but would appear to be less well
’
appreciated in clinical research. Ellenberg and Nelson’ reviewed 23 follow-up studies of children
who had had febrile seizures. Seventeen of these studies involved children attending special clinics.
The other six obtained their sample from an initial population study to identify such children. One
purpose of these studies was to assess the risk of such children subsequently having non-febrile
STATISTICS IN MEDICAL JOURNALS 61
80 ..
I :
Clinic Population
based based
Figure 1. Percentage of children who experienced non-febrile seizures after one or more febrile seizures
in 23 studies (from Ellenberg and Nelson”)
seizures. This is an important question as the degree of risk would determine whether or not
prophylactic treatment should be prescribed routinely.
The results from these 23 studies are shown in Figure 1. It is clear that the six population-based
studies yielded much lower estimates of the risk; they were also much more consistent than were the
results from the other 17 studies. This is not really surprising. We would expect the children
attending special clinics to include more severe cases-variability in the degree of this bias would
explain the wide range of observed prevalences of further seizures. Ellenberg and Nelson observed:
‘Only 5 of the reports of the clinic-based studies acknowledged the potential for sample selection
bias, and of those five, only one carried such acknowledgement into the conclusions’. They also
pointed out that, despite the results of the population-based studies, many authors recommend
that all such children should receive therapeutic levels of phenobarbital daily for as long as six
years. Such advice must have affected very many children, yet it is based on the results of studies
that did not have an appropriate design to answer questions about the prevalence of recurrence.
2.2. Analysis
Errors in analysis are probably quite familiar. It is particularly common to find that the
assumptions underlying a statistical technique have been violated (although often it is impossible
to tell). Regression, correlation, t tests and analysis of variance all make specific assumptions about
the distribution of the data. Another problem occurs when related observations such as repeated
measurements on the same subject are treated as independent observations. Finally, an incorrect or
inappropriate analysis is often the result of using a statistical method that is too simple for the
problem in hand. Examples include the use o f t tests instead of analysis of variance or covariance
(the latter being surprisingly rarely used), and the investigation of multivariate data by considering
variables two at a time (such as using regression, correlation or two-way contingency tables) rather
than by multilinear regression or multivariate techniques. Thus often an outcome measure is
separately regressed on (or correlated with) two or more explanatory variables without any
investigation of the inter-relationships between these variables.
Apart from the relatively obvious misuse oft and chi-squared tests,’-‘ there are numerous other
mistakes that can be and are made in data analysis. These include the combination, in a clinical trial,
62 DOUGLAS G . ALTMAN
of the results for all non-responders or drop-outs with those on placebo therapy, the exclusion of
subjects from analysis on the basis of their results, and several misapplications of correlation,
notably when comparing alternative methods of measurement.13 All of these errors may have a
profound effect on the results. The following two examples are interesting but not necessarily
typical illustrations of poor statistical analysis.
e a
0 2 4 6 8 10 12 14 16
Age (weeks)
Figure 2. Fatty acid content of infant liver by age with fitted cubic curve (from Figure 3H in Clandinin et a / 14)
Figure 2 shows some data relating fatty acids in infant liver to age.I4 To these data the authors
fitted a cubic curve (as shown) and pronounced this ‘significant’ with r = 0.47, P < 0.05. The
correlation quoted is, however, a multiple correlation coefficient. The linear correlation is only
about 0.1 and is not nearly significant ( P > 0.9). Simple inspection suggests that the data are much
too variable to draw any useful conclusions. The astonishing aspect of this analysis is that it is one
of no less than 83 cubic curves fitted to different (but similar) sets of data (in four papers including
the reference given). In no case was any justification, either statistical or scientific, given for fitting
cubic curves; there clearly was none. Nor in any case was the cubic curve shown to be a better fit to
the data than a quadratic curve (let alone a straight line). Yet the authors have drawn conclusions
from the shapes of these curves; none of their conclusions can have any credibility.
Robinson et d . 1 5 looked at the effect of oral vitamin D supplements on plasma 25-
hydroxyvitamin D (25-OHD) concentrations in pre-term infants. Their results are shown in Figure
3, where the plasma 25-OHD level at 14 weeks is plotted against the percentage rise from 14 to 36
weeks. The authors described the results of their analysis thus: ‘There was a significant correlation
between the plasma 25-OHD level at day 14 and the percentage rise at 36 weeks (r = -0.68,
P < 0.0025)’. This led them to say that: ‘The negative correlation . . . indicates that there may be
constraints on 25-hydroxylation in the pre-term infant . . .’
There were several statistical errors in this study, the most important of which was the use of
correlation to relate the change in a quantity to its initial value. It is well-known that this
procedure results in a spurious negative correlation even when the initial and subsequent levels are
unrelated.I6 Other problems include the dubious use of percentages to assess the individual rises,
STATISTICS IN MEDICAL JOURNALS 63
I
04 .
0
. . $XI. . lzoo. . . . d
0 400 1600
Sb rise in 25 - OHD at 36 weeks
2.3. Interpretation
The majority of errors in interpretation relate to tests of significance.Although very widely familiar
and used in almost every paper presenting research findings, the general degree of understanding of
the true meaning of such tests is very low. It is common to find that the meaning of the significance
level is misunderstood. P is not the probability of making a mistake when taking the observed effect
as a true one.
Although it may be tempting to accept either the clinical importance or statistical significance of
a result as indicating an important finding, ideally both are required. Often, however, statistical
significance is taken to indicate an important finding regardless of clinical importance, or a
marginally non-significant result is interpreted as showing no effect even when the results appear
clinically important. Such circumstances require a more flexible attitude to interpretation than is
often adopted. This is, though, an area which highlights a weakness of the significance test
approach.
Another aspect of interpretation-the over-confidence frequently placed on negative (non-
significant)results from small studies-has been superbly illustrated by the study of Freiman and
colleagues.” They found that of 71 supposedly negative studies (i.e. P > 0.1) 50 had a greater than
10 per cent risk of missing a true improvement of 50 per cent. They observed that many of the
therapies investigated in these studies had not received a fair test, and that ‘In most studies the lack
of a (significant)difference . . . was taken to mean that no clinically meaningful difference existed.’
Not all errors of interpretation are related to significance tests, however. Problems are common
in the evaluation of the usefulness of a diagnostic test. Here, over-reliance on sensitivity and
specificity can be very misleading.
Bravo and colleagues’* found that plasma catecholamines were elevated in 22 of 23 subjects with
pheochromocytomas, but in only 12 of 40 normal subjects. The sensitivity and specificity were thus
64 DOUGLAS G. ALTMAN
96 per cent (22/23) and 70 per cent (28/40), respectively. It is very common to see sensitivity
(especially) and specificity used to assess the value o f a diagnostic test, but they only tell us what
proportions of subjects known either to have the disease or not were correctly identified. What we
really want to know is the usefulness of a positive test in predicting the disease (and vice versa). If we
know the prevalence of the disease we can calculate the ‘positive predictive value’ as
sensitivity x prevalence
sensitivity x prevalence + (1 - specificity) (1 - prevalence)
and the ‘negative predictive value’ by interchanging the sensitivity and specificity.l 9 In the present
case it has been suggested that: ‘If one were to screen all hypertensive patients for pheochromo-
cytomas, the prevalence would be about 0 5 per cent.’20 For these data the predictive power
of positive and negative results work out at 1.6 per cent and 99.97 per cent. McCarthy2” has
calculated comparative figures for an alternative test (urinary VMA) as 35 per cent and 99.98 per
cent, which are clearly better. These are the main considerations necessary when evaluating the
usefulness of diagnostic tests-they are very often missing and arguments are incorrectly based on
the sensitivity and specificity alone.
rumoured that some journals will not publish studies unless a significant result (1.e. P < 0-05 or
even P c 0.01) has been achieved. If true, this is a travesty of the role that statistics should have in
medicine. A further problem is the requirement, for career purposes, to publish as much as possible,
which results in much sub-standard and unnecessarily repetitious research and numerous poor
papers.
It is also relevant that the training that statisticians receive is not sufficiently practical, and is
usually too general to contain a large component of techniques specific to medical statistics. Thus,
much of the medical statistician’s expertise has to be obtained in the job, and this will take several
years. Partly for this reason and partly because there are not enough posts for medical statisticians,
there is a shortage of statistical advice available to the medical researcher. Many scientists would be
only too pleased to get some expert assistance, but they have nobody to provide it.
Lastly, there is more that ethical committees could do to control the quality of research. There
are important ethical consequences of carrying out inferior research-these will be discussed in
detail later. Thus there should be more to ethical approval than just ensuring that the research is not
unethical in the usual sense. Ethical committees should not approve research that is scientifically
unsound, and this ought to include consideration of the statistical quality. It is, therefore,
particularly desirable that a statistician should be involved in ethical committee appraisal of
research proposals, but this is very rare.
(ii) Other resources have been diverted from more worthwhile use.
This refers notably to time and money, but also to hospital beds, laboratory facilities, etc.
(iii) Other patients may subsequently receive an inferior treatment either as a direct consequence of
thejindings of the study or possibly by delaying the introduction of a better treatment.
Ellenberg and Nelson’s study’’ illustrated the first possibility, and the retrolental
fibroplasia story26 is an example of the latter.
(iv) Other scientists’ research may be ajected.
It may prove impossible to get ethical committee approval to investigate a treatment if a
previous study has found the treatment beneficial, even if that study was statistically
unreliable. On the other hand, if the results of some studies are accepted uncritically,
other scientists may be misled into abandoning useful treatments or investigating poor
ones.
(v) lf the results go unchallenged the researcher@) involved may use the same substandard
statistical methods again in subsequent work, and others may copy them
The wide misuse of correlation coefficients for comparing methods is a good example of
this.
All of the above points make the misuse of statistics very much an ethical as well as a scientific
issue. Points (i) and (iii) are particularly relevant to medical research, but (ii) and (iv) apply more
widely and these should be considered to be ethical issues too. Point (v) applies even if the paper in
question does not actually produce misleading conclusions, so that any paper that misuses statistics
could be potentially harmful.
press in the medical world-it is in our interest to help to improve standards and increase
confidence in the value of correct statistics. Statisticians should try to convince other scientists that
collaboration in research is in their mutual interest; certainly medical research offers many
challenging problems for the statistician. Also statisticians should make more effort to understand
the problems of their medical colleagues, perhaps especially where there are direct clinical
implications of the research. Poor communication between statisticians and medical researchers is
certainly a problem-it is to be hoped that Statistics in Medicine will be able to make a valuable
contribution here.
5.2. Short-term
The above points all relate to the long term, and some may feel that not very much improvement in
the quality of statistics would result from the degree of change that is likely to occur. There are,
however, important possible short-term changes which could have a marked effect on statistical
standards, and these largely involve the medical journals.
As already discussed, the journals do not in general place much importance on the statistical
content of research papers. The preceding discussion should have shown that this is an area to
which they really need to give much more attention. The most important improvement would be
for more journals to institute a formal policy of statistical refereeing of papers, following the lead of
the relatively few enlightened journals.
Another useful step would be the development of comprehensive statistical guidelines for
prospective contributors to medical journals.
It is widely felt (by statisticians) that editors are not very amenable to suggestions such as these,
but I experienced a considerable amount of sympathy for these ideas when I participated in a
workshop for medical journal editors. They did appreciate the importance of good statistics, they
were keen to increase the amount of statistical refereeing, but did not know how to do this, and they
were enthusiastic about having guidelines for contributors. These last two topics will be considered
in more detail below.
Before considering how a statistical refereeing system would work, it is important to explain why
the correspondence pages of a journal do not provide an adequate safeguard.
It is not possible to ‘unpublish’ a paper, so that all of the previously outlined consequences of
publishing misleading results or conclusions will apply irrespective of any subsequent correspon-
dence. Additionally, any media publicity given to the original publication will rarely extend to any
ensuing critical comments. It is also important to realize that the majority of published papers will
not be read by anybody likely to detect the types of error illustrated above. Lastly, writing letters to
journals often leads to apparently unresolved disputes (which each side will believe itself to have
‘won’) which will bame many readers. It is much better to have improved quality control at the
refereeing stage. Journals should, however, be duty bound to publish valid criticisms of papers they
have published; this does, of course, require judgment of what is ‘valid’.
6. CONCLUSIONS
Positive action is necessary to raise the standard of statistics in medical journals. In the short term,
considerable improvement is possible if the journals adopt tighter quality control. In the long term,
we may also hope for improvements in education, and the greater direct involvement of
statisticians in research.
If higher standards of publication were established and adhered to, it would become apparent to
researchers that there was little chance of papers with poor statistics being published in a reputable
journal.
It is surely in the interests of everyone,especially current and future patients, to try to achieve this
as soon as possible.
REFERENCES
1. Altman, D. G. ‘Statistics and ethics in medical research. Misuse of statistics is unethical’, British Medical
Journal, 281, 1182-1 184 (1980).
2. Lionel, N. D. W. and Herxheimer, A. ‘Assessing reports of therapeutic trials’, British Medical Journaf,iii,
6 3 7 6 4 0 (1970).
3. Schor, S. and Karten, I. ‘Statistical evaluation of medical journal manuscripts’, Journal ofthe American
Medical Associarion, 195, 1123-1 128 (1966).
4. Gore, S. M., Jones, I. G. and Rytter, E. C. ‘Misuse of statistical methods: critical assessment of articles in
BMJ from January to March 1976, British Medical Journal i, 85-87 (1977).
5. White, S. J. ‘Statistical errors in papers in the British Journal of Psychiatry’, British Journal of’Psychiatry,
135, 336342 (1979).
6. Glantz, S . ‘Biostatistics: how to detect, correct and prevent errors in the medical literature’, Circulation, 61,
1-7 (1980).
7. Fleming, D. M., Knox, J. D. E. and Crombie, D. L. ‘Debendox in early pregnancy and fetal
malformations’, British Medical Journal, 283, 99-101 (198 1).
8. Vivian, S. P. and Golding, J. ‘Debendox in early pregnancy and fetal malformation’, British Medical
Journal, 283, 725 (1981).
9. Serfontein, G . L. and Jaroszewicz, A. M. ‘Estimation of gestational age at birth’, Archives ofDisease in
Childhood, 53, 509-5 1 1 (1978).
10. Altman, D. G. ‘Estimation of gestational age at birth’, Archives of Disease in Childhood, 54,242-243 (1979).
11. Serfontein, G. L. and Jaroszewicz, A. M. ‘Estimation of gestational age at birth’, Archires oj Disease in
Childhood, 54, 243 (1 979).
12. Ellenberg, J. H. and Nelson, K. B. ‘Sample selection and the natural history of disease’, Journal o f t h e
American Medical Association, 243, 1337-1340 (1980).
13. Altman, D. G. and Bland, J. M. ‘Measurement in medicine: the analysis of method comparison studies’,
Submitted for publication.
14. Clandinin, M. T., Chappeli, J. E., Heim, T., Swyer, P. R. and Chance, G. W. ‘Fatty acid accretion in fetal
and neonatal liver: implications for fatty acid requirements’, Early Human Development, 5, 7-14 (1981).
15. Robinson, M. J., Merrett, A. L., Tetlow, V. A. and Compston, J. E. ‘Plasma 25-hydroxyvitamin D
concentrations in preterm infants receiving oral vitamin D supplements’, Archives ofDisease in Childhood,
56, 144-145 (1981).
16. Oldham, P. D. Measurement in Medicine: The Interpretation ofNumerica1 data, English Universities Press,
London, 1968, p. 148.
17. Freiman, J. A., Chalmers, T. C., Smith, H. and Kuebler, R. R. ‘The importance of beta,the type I1 error and
sample size in the design and interpretation of the randomized control trial’, New England Journal of
Medicine, 299, 69&694 (1978).
18. Bravo, E. L., Tarazi, R. C., Gifford, R. W. and Stewart, B. H. ‘Circulating and urinary catecholamines in
pheochromocytoma. Diagnostic and pathophysiologic implications’, New England Journal of Medicine,
301,682-686 (1979).
19. Habicht, J. -P. ‘Some characteristics of indicators of nutritional status for use in screening and
surveillance’, American Journal of Clinical Nutrition, 33, 53 1-535 (1980).
20. McCarthy, D. ‘Value of predictive values’, New England Journal of Medicine, 302, 1479-1480 (1980).
STATISTICS IN MEDICAL JOURNALS 71
21. Clarke B. F. and Campbell I. W. ‘Long-term comparative trial of glibenclamide and chlorpropamide in
diet-failed maturity onset diabetics’, Lancet, i, 2 4 6 2 4 8 (1975).
22. Newton, J., Illingworth, R., Elias, J. and McEwan, J. ‘Continuous intrauterine copper contraception for
three years: comparison of replacement at two years with continuation of use’, British Medical Journal, i,
197-199 (1977).
23. Mosteller, F., Gilbert, J. P. and McPeek, B. ‘Reporting standards and research strategies for controlled
trials. Agenda for the editor’, Controlled Clinical Trials, 1, 37-58 (1980).
24. Daniel, W. W. Biostatistics: a Foundationfor Analysis in the Health Sciences, (2nd edition), Wiley, New
York (1978).
25. Woodford, F. P. ‘Ethical experimentation and the editor’, New England Journal of Medicine, 286, 892
(1972).
26. Silverman, W. A. ‘The lesson of retrolental fibroplasia’, Scientific American, 236(6), 1 W 1 0 7 (1977).
27. Altman, D. G . ‘Statistics and ethics in medical research. VIII-Improving the quality of statistics in
medical journals’, Brirish Medical Journal, 282, 44-47 (1981).
28. Schor, S. ‘Statistical reviewing program for medical manuscripts’, American Statistician, 21,28-31 (1967).
29. Wallenstein, S., Zucker, C. L. and Fleiss, J. L. ‘Some statistical methods useful in Circulation Research,
Circulation Research, 47, 1-9 (1980).
30. Glantz, S. Primer of Biostatistics, McGraw-Hill, New York, 1981.
31. OFallon, J. R., Dubey, S. B., Salsburg, D. S., Edmonson, J. H., Soffer, A. and Colton, T. ‘Should there be
statistical guidelines for medical research papers?’ Biornetrics, 34, 6 8 7 6 9 5 (1978).