Professional Documents
Culture Documents
ORIGINAL PAPER
S
ystematic reviews and meta-analyses have emerged as an increase or decrease the probability of there being disease in
important branch of biomedical research. They are con- the presence of certain signs, symptoms and other test
sidered to provide the best scientic evidence to support results; this is often referred to as Bayes’ theorem.1
or to reject the use of certain interventions and to acknowl- In diagnostics, the ability of a test to revise the prior
edge prognostic, aetiological and risk factors of a particular probability of disease is summarised by the likelihood ratio.
condition or disease. Whereas the terms “systematic review” The likelihood ratio is the likelihood that a given test result
and “meta-analysis” are often used synonymously, the latter will be observed in a patient with the target disorder
should mainly be regarded as a statistical technique to quan- compared with the likelihood that the same result will be
tify and to summarise the information provided by a thorough observed in a patient without the target disorder.2 The positive
systematic review of published and unpublished studies. likelihood ratio is the ratio between the chance of a positive
The major goals of meta-analysis are to display the range test result in the presence of the characteristic under investi-
of effect sizes of individual investigations and to compile gation and the chance of a positive result in the absence of
these data into some measure of overall ef cacy that can be this attribute. For example, a positive likelihood ratio of 4.0
uniformly interpreted and communicated (which means means that a positive test result is four times more likely in a
that audiences with varying background knowledge are able diseased subject than in a healthy person. Likewise, the
to gain similar information on the topic of interest). These negative likelihood ratio represents the ratio between the
basic requirements have been met by the established chance of a negative test result in the absence of the charac-
methods of meta-analysis of therapeutic studies. teristic under investigation and the chance of a negative
Results are usually presented graphically by means of a forest result in the presence of this attribute. Given the positive and
plot, which gives an excellent impression of the dispersion of the negative likelihood ratio of a test procedure and a certain
effect sizes. The common point estimates, such as odds ratios prevalence, or prior disease probability, the related prob-
(OR), relative risks (RR) and risk differences (RD), in combi- ability of disease can easily be obtained from nomograms.2,3
nation with their related 95% or 99% con dence intervals, By convention, marked changes in prior disease probability
allow clear inferences to be drawn from the available data: the can be assumed in positive likelihood ratios exceeding 10.0
experimental arm fares better, fares equivocally (there is no and negative likelihood ratios below 0.1.4,5 The scientic back-
evidence of a difference) or fares worse than the control arm. ground for these somewhat arbitrary thresholds can be found
The problem that occurs in a meta-analysis of diagnostic in Bayesian decision theory: “classical” signicance levels
studies is the multidirectional performance of the diagnostic often correspond to still substantial posterior probabilities of
instrument regarding its ability to detect (specicity) or to the null hypothesis of “no change in prior belief”. For
exclude (sensitivity) the characteristic of interest. Multi- instance, a p value of 0.05 (which would normally support
dimensional outcomes cannot be summarised well by a rejection of the assumption of a null effect) can still corre-
single estimate. The results gained from a diagnostic test will spond to a 52% posterior probability of the null hypothesis.6
For the purpose of this study we stress only that an investi- The individual positive likelihood ratio (LR+) can be
gation of a certain medical intervention must provide a large expressed as:
amount of additional information to change prior knowl-
LR+ = SN/(1 - SP)
edge. Since no threshold values of sensitivity or specicity are
available that would allow either the adoption or the The negative likelihood ratio is calculated as:
rejection of the routine application of a diagnostic procedure,
LR- = (1 - SN)/SP.
likelihood ratios appear as preferable indices of test perform-
ance, at least in the setting of clinical decision-making. Positive and negative likelihood ratios were then arranged
In this investigation we were interested in how likelihood in a “multidimensional forest plot” together with their 95%
ratios can be used to conduct a diagnostic meta-analysis. Our con dence intervals. We hypothesised that this method of
objectives were to develop a clearly arranged graphical graphical presentation could be easily interpreted, especially
presentation of the results from individual diagnostic by readers already used to the “classic” forest plots of
studies, and to obtain a summary measure of diagnostic test therapeutic meta-analyses. The discriminating features of a
ef cacy by reasonable computational efforts. test will be reected by the localisation of the data cloud
within the quadrants of the matrix space, and this in turn
METHODS will allow a rapid assessment of the potential usefulness of a
Basic considerations diagnostic procedure, even by eyeballing.
The basic steps to obtain a summary measure in a meta-
The results of a diagnostic test with a dichotomous outcome analysis are to weight individual effect measures according
can be easily summarised in a 2 ´ 2 table. Sensitivity (SN, the to their variance and then to divide the sum of the weighted
number of true positives divided by all test positives) and quantities by the sum of weights.
specicity (SP, the number of true negatives divided by all We adopted general inverse-variance weighted xed-
test negatives) are well known characteristics of a diagnostic effects models to summarise likelihood ratios.9–13 In case of
intervention. Knowledge of both indices is required to statistical heterogeneity, the DerSimonian–Laird random-
appraise test precision fully. However, one might think of effects method was applied to account for both within-study
clinical situations in which only one of these characteristics and between-study variance.
is of real interest. For example, the Ottawa ankle rules In analogy to therapeutic meta-analyses, summary esti-
provide 99% SN to exclude ankle fractures, whereas a 99% mates within the matrix space were given prominence over
SP of transcranial Doppler sonography allows detection of individual results by the use of lled symbols.
vasospasm in the middle cerebral artery.7,8
Many attempts have been made to calculate a single
summary statistic from a set of 2 ´ 2 tables; this statistic is Worked example
often referred to as the “diagnostic odds ratio”. In fact, the Focused abdominal sonography for trauma (FAST) is an
application of a diagnostic test always yields two odds ratios: emergency ultrasound protocol to screen for free intra-
the prior odds will be increased with positive test ndings abdominal uid after blunt injuries. In general, the exami-
(leading to an odds ratio ranging from 1 to in nity) and will nation is performed in three or four standard planes to detect
be decreased if the test turns out negative (with an odds ratio uid collections in the perihepatic space (Morrison’s pouch),
between 0 and 1). surrounding the spleen, and in the Douglas’ spatium.
Approximately 20% of all intra-abdominal traumatic lacera-
Multidimensional forest plots tions are not accompanied by signicant haemoperitoneum.
The ef cacy of a diagnostic test is precisely characterised by In comparisons with other reference standards, such as diag-
its positive and its negative likelihood ratio in both quali- nostic peritoneal lavage (DPL) or helical computed tomogra-
tative (the vector of test performance) and quantitative (the phy (HCT), the ability of FAST examination to detect or to
size of this vector) terms. exclude the underlying visceral lesion remains unclear.
A test set of four different meta-analytic scenarios was In the meta-analysis, 11 trials were eligible for analysis,
constructed, comprising 10 ctitious studies of identical size with a total of 2819 subjects – they employed ultrasono-
(n = 2485). The studies included in these meta-analyses graphy to detect organ injury and made comparisons with
showed the possible combinations of test characteristics: a either DPL or HCT. 14 It was obvious that ultrasound provided
test of high SN and SP; a test of low SN and high SP; a test of extremely high specicity (range 0.84–1.00) but sensitivity
high SN and low SP; and a test of low SN and SP. was below 90% in nine of the 11 trials. Table 1 summarises
Table 1 Major characteristics and descriptive statistics of studies included in the meta-analysis of emergency ultrasound
(FAST) 14
Prior
Reference n TP FN TN FP probability Sensitivity (95% CI) Specicity (95% CI)
Akgür F 208 42 8 157 1 0.24 0.84 (0.79, 0.89) 0.99 (0.98, 1.00)
Froelich JW 26 13 2 10 1 0.58 0.87 (0.74, 1.00) 0.91 (0.80, 1.00)
Förster R 140 28 1 107 4 0.21 0.97 (0.94, 1.00) 0.96 (0.93, 0.99)
Goletti O 73 26 4 42 1 0.41 0.87 (0.79, 0.94) 0.98 (0.94, 1.00)
Healey MA 796 45 6 728 17 0.06 0.88 (0.86, 0.90) 0.98 (0.97, 0.99)
Katz S 121 10 1 92 18 0.09 0.91 (0.86, 0.96) 0.84 (0.77, 0.90)
Krupnick AS 64 20 12 32 0* 0.52 0.62 (0.50, 0.74) 0.98 (0.95, 1.00)
McGahan JP 121 24 14 79 4 0.31 0.63 (0.55, 0.72) 0.95 (0.91, 0.99)
McKenney KL 884 95 15 768 6 0.12 0.86 (0.84, 0.89) 0.99 (0.99, 1.00)
Röthlin MA 313 24 31 257 1 0.18 0.44 (0.38, 0.49) 1.00 (0.99, 1.00)
Singh G 73 26 9 33 5 0.48 0.74 (0.64, 0.84) 0.87 (0.79, 0.95)
* In case of zero values, all cells were corrected for continuity by adding 0.5.
The number of subjects enrolled into the trials is denoted by n. TP, FN, TN and FP are the number of true-positive, false-negative, true-negative, and false-positive
ndings, respectively. Values in parentheses are 95% condence intervals.
RESULTS
Fictitious studies
Figure 1 shows the distribution of the ctitious data on a 100
10
DISCUSSION
The usefulness of a diagnostic test can be examined in three
ways:15
(1) its ef ciency in discriminating between a diseased and a
healthy population (which is expressed in terms of
sensitivity and specicity);
(2) its consequences for clinical decision-making;
1
1.00 0.10 0.01 (3) the outcome related to the medical actions taken on the
LR neg (log scale)
basis of test ndings.
Figure 1 Likelihood ratio (LR) scatterplot of the four ctitious meta- The proven discriminatory power of a diagnostic test is a
analyses of 10 studies. Unlled symbols represent individual studies;
lled symbols are weighted summary likelihood ratios (xed-effects prerequisite to investigate patient outcome, and a systematic
model) with related 95% condence intervals. Circles = studies with work-up of all the scientic information is needed to assure
high SN and high SP; diamonds = studies with low SN and low SP; health-care professionals of the properties of tests used in
triangles = studies with low SN and high SP; squares = studies with daily practice. Interestingly, there are far more systematic
high SN and low SP. In analogy to the “classic” forest plot, test
reviews of therapeutic interventions than of diagnostic
characteristics can be assessed by eyeballing. The threshold values
of 10.0 and 0.1 are a useful procedure by which to divide the tests.16 This discrepancy might relate to dif culties in inter-
matrix into four quadrants. preting diagnostic meta-analyses.
One might argue against combining data from studies of one-point estimate, Q*, provides some global evidence of
diagnostic tests. Obviously, the ndings from such studies test validity but it fails to discriminate tests with singular
will be in uenced by the characteristics and the size of the strengths (or weaknesses) on one aspect of test performance.
patient sample, the prevalence of the condition under If one generally doubts the clinical value of a diagnostic test
investigation and the test threshold used; this provides an when SN and SP decrease, Q* perfectly meets the minimal
argument for a description only of the range of data. requirements of a univariate measure of effectiveness. If not,
The same limitations also apply to meta-analyses of relying on a single index of test ef cacy risks missing
therapeutic studies. By its nature, meta-analysis provides the procedures that perform well in one direction but poorly in
most probable approximation of a common effect size based the other.
on the available data. However, with proper weighting Likelihood ratios are helpful in showing both the bipolar
methods and sensitivity analyses, there is no reason why the and the unipolar weaknesses of a test (as well as its unipolar
central estimate gained from individual diagnostic studies strengths). In analogy to forest plots, the likelihood ratio
should provide different or even less information than the scatterplot matrix illustrates the distribution and the centre
summary measure calculated from a set of therapeutic trials. of effect sizes from a pool of individual studies and allows
The problem of combining different diagnostic studies in a identi cation of outliers, as well as of studies relevant for
meta-analysis was rst addressed by summary receiver sensitivity analyses.
operating characteristics (SROC) curves, proposed by Moses, By using established xed-effects and random-effects
Shapiro and Littenberg.17,18 Regressing the true-positive techniques, likelihood ratios from individual trials can be
rates (sensitivity) to their false-positive rates (1 – specicity) condensed to a point estimate from diagnostic meta-analysis.
visualises the trade-off between sensitivity and specicity, Summary likelihood ratios are convenient numerical
that is, the price in terms of false-positive ndings that must descriptors that cover both qualitative and quantitative
be paid for a reasonable true-positive rate. The SROC characteristics of test performance.
approach is the recommended reference standard for
diagnostic meta-analysis.19 As a univariate summary
measure, the Q* value was proposed as the point of inter- ACKNOWLEDGEMENTS
section where sensitivity equals specicity. Q* shows the
The authors are indebted to Dr Barbara Herzberger for editorial
desirable characteristics of a univariate measure of the over- assistance. We also wish to thank the referee team of the Journal of
all discriminatory features of a diagnostic test. However, Medical Screening for helpful remarks.
there are some situations in which Q* leads to some mis-
interpretation.
.................
Using the test set of ctitious studies, it can be shown that Authors’ affiliations
Q* hardly distinguishes highly sensitive but unspeci c tests Dirk Stengel, Senior Surgeon, Department of Trauma Surgery,
(Q* = 0.70) from worthless procedures (Q* = 0.58). Unfallkrankenhaus Berlin, and Lecturer in Evidence-Based Medicine,
Clinical Epidemiology and Biostatistics, Ernst-Moritz-Arndt-University of
Moreover, tests with poor sensitivity might yield virtually Greifswald, Germany
similar Q* values, regardless of their specicity (0.52 and Kai Bauwens, Senior Surgeon, Department of Trauma Surgery,
0.58, respectively) (see Figure 3). Unfallkrankenhaus Berlin, and Lecturer in Evidence-Based Medicine,
Clinical Epidemiology and Biostatistics, Ernst-Moritz-Arndt-University of
If one is concerned about true-positive rates, the SROC
Greifswald, Germany
suf ciently depicts the range of effect sizes. The associated Jalid Sehouli, Senior Oncological Surgeon, Department of
Gynaecology and Gynaecological Oncology, and Lecturer in Evidence-
Based Medicine, Charité Virchow University Hospital, Berlin, Germany
1.0 Axel Ekkernkamp, Professor of Surgery and Head, Department of
Trauma Surgery, Ernst-Moritz-Arndt-University of Greifswald, and
0.9 Department of Trauma Surgery, Unfallkrankenhaus Berlin Trauma Centre,
Germany
0.8 Franz Porzsolt, Professor of Haematology, Oncology and Evidence-
Based Health Care, Human Studies Centre, Ludwig-Maximilians
True Positive Rate (Sensitivity)
0.6
REFERENCES
0.5
1 Berger JO. Statistial decision theory and bayesian analysis. New York:
Springer, 1985.
0.4 2 Fagan TJ. Nomogram for Bayes’ theorem. N Engl J Med 1975;293:257.
High SN/High SP
3 Centre for Evidence-Based Medicine. Toolbox: Likelihood ratios. Available
Q*=0.96 (95% CI0.73, 1.18)
0.3 at: http://minerva.minervation.com/cebm (last accessed November
Low SN/High SP 2002).
Q*=0.52 (95% CI0.36, 0.68) 4 Jaeschke R, Guyatt GH, Sackett DL, for the Evidence-Based Medicine
0.2 High SN/Low SP Working Group. Users’ guides to the medical literature. III. How to use an
Q*=0.70 (95% CI0.37, 1.04) article about a diagnostic test. A. Are the results of the study valid? JAMA
0.1 Low SN/Low SP 1994;271:389–91.
Q*=0.58 (95% CI0.49, 0.67) 5 Jaeschke R, Guyatt GH, Sackett DL, for the Evidence-Based Medicine
0.0 Working Group. Users’ guides to the medical literature. III. How to use an
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 article about a diagnostic test. B. What are the results and will they help
me in caring for my patients? JAMA 1994;271:703–7.
False Positive Rate (1 – Specificity) 6 Goodman SN. Toward evidence-based medical statistics. 2. The Bayes
factor. Ann Intern Med 1999;30:1005–13.
Figure 3 Summary receiver operating characteristics (ROC) as the 7 Markert RJ, Walley ME, Guttman TG, Mehta R. A pooled analysis of the
standard approach to a meta-analysis of diagnostic tests, performed Ottawa ankle rules used on adults in the ED. Am J Emerg Med
on data of the four ctitious meta-analyses displayed in Figure 1. 1998;16:564–7.
8 Lysakowski C, Walder B, Costanza MC, Tramer MR. Transcranial
Q* values (where sensitivity equals specicity) have been proposed Doppler versus angiography in patients with vasospasm due to a ruptured
as the one-point estimate of test efcacy. Q* hardly differs between cerebral aneurysm: a systematic review. Stroke 2001;32:2292–8.
tests with low SN and high SP and tests with both low SN and SP. 9 Pettiti DB. Meta-analysis, decision analysis, and cost-effectiveness analysis.
Calculations were made by generalised estimating equations (GEE) Methods for quantitative synthesis in medicine. Oxford: Oxford University
models. Press, 2000.
10 Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Methods for meta- 15 Fineberg HV, Bauman R, Sosman M. Computerized cranial tomography.
analysis in medical research. Chichester: Wiley, 2000. Effect on diagnostic and therapeutic plans. JAMA 1977;238:224–7.
11 Deeks J, on behalf of the Statistical Methods Working Group of the 16 Knottnerus JA, van Weel C, Muris JWM. Evidence base of clinical
Cochrane Collaboration. Statistical Methods Programmed in Meta View, diagnosis: evaluation of diagnostic procedures. BMJ 2002;324:477–80.
Version 4. 1999. Available at http://www.cochrane.org (last accessed 17 Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple
January 2002). conicting reports: a new meta-analytic method. Med Decis Making
12 Mantel N, Haenszel W. Statistical aspects of the analysis of data 1993;13:313–21.
from retrospective studies of disease. J Natl Cancer Inst 18 Moses LE, Shapiro D, Littenberg B. Combining independent studies of a
1959;22:719–48. diagnostic test into a summary ROC curve: data analytic approaches and
13 DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials some additional considerations. Stat Med 1993;12:1293–316.
1986;7:177–88. 19 Cochrane Methods Group on Systematic Review of Screening and
14 Stengel D, Bauwens K, Sehouli J, et al. Systematic review and meta- Diagnostic Tests. Recommended Methods. 1996. Available at
analysis of emergency ultrasonography for blunt abdominal trauma. Br J http://www.cochrane.org/cochrane/sadtdoc1.htm (last accessed April
Surg 2001;88:901–12. 2002).