You are on page 1of 5

47

ORIGINAL PAPER

A likelihood ratio approach to meta-analysis of


diagnostic studies
D Stengel, K Bauwens, J Sehouli, A Ekkernkamp, F Porzsolt
...................................................................................................

J Med Screen 2003;10 :47–51

Objective: To develop a clinically and methodologically sound approach to diagnostic meta-analysis.


Methods: Two-step model was used involving four Žctitious sets of 10 studies each with varying
sensitivity and speciŽcity; this was followed by the application of the method to data from a published
systematic review of emergency ultrasound. Multidimensional test characteristics (relating to the
detection or exclusion of the condition of interest) were described by likelihood ratio scatterplots and
pooled likelihood ratios. Likelihood ratios summarise the ability of a test to revise the prior probability
of disease. They can be summarised by established Žxed-effects and random-effects methods.
See end of article for Results: Likelihood ratios precisely describe both directions of test performance. By plotting positive
authors’ afŽliations
................ against negative likelihood ratios, together with their 95% conŽdence intervals, a multidimensional
forest plot is obtained that can be interpreted in analogy to therapeutic meta-analyses. There are
Correspondence to:
accepted threshold values of positive and negative likelihood ratios (i.e. 10.0 and 0.1) to recommend
Dr D Stengel, Department
of Trauma Surgery, a test for clinical use. In the matrix space, distinct test characteristics can even be assessed by
Unfallkrankenhaus Berlin eyeballing. With regard to data from the real meta-analysis, the suggested high discriminatory power
Trauma Centre, Warener of ultrasound was only partially qualiŽed by likelihood ratios. The positive value conŽrms the reliability
Str 7, 12683 Berlin,
Germany; of a positive scan, whereas the negative value questions a normal sonogram.
dirk.stengel@ukb.de Conclusions: A full characterisation of test performance requires multidimensional effect measures.
Likelihood ratios are recommended descriptors of the two dimensions of diagnostic research evidence
Accepted for publication
11 December 2002
and provide a convenient means to visualise and to communicate results as weighted summary
................ estimates of a diagnostic meta-analysis.

S
ystematic reviews and meta-analyses have emerged as an increase or decrease the probability of there being disease in
important branch of biomedical research. They are con- the presence of certain signs, symptoms and other test
sidered to provide the best scientiŽc evidence to support results; this is often referred to as Bayes’ theorem.1
or to reject the use of certain interventions and to acknowl- In diagnostics, the ability of a test to revise the prior
edge prognostic, aetiological and risk factors of a particular probability of disease is summarised by the likelihood ratio.
condition or disease. Whereas the terms “systematic review” The likelihood ratio is the likelihood that a given test result
and “meta-analysis” are often used synonymously, the latter will be observed in a patient with the target disorder
should mainly be regarded as a statistical technique to quan- compared with the likelihood that the same result will be
tify and to summarise the information provided by a thorough observed in a patient without the target disorder.2 The positive
systematic review of published and unpublished studies. likelihood ratio is the ratio between the chance of a positive
The major goals of meta-analysis are to display the range test result in the presence of the characteristic under investi-
of effect sizes of individual investigations and to compile gation and the chance of a positive result in the absence of
these data into some measure of overall efŽ cacy that can be this attribute. For example, a positive likelihood ratio of 4.0
uniformly interpreted and communicated (which means means that a positive test result is four times more likely in a
that audiences with varying background knowledge are able diseased subject than in a healthy person. Likewise, the
to gain similar information on the topic of interest). These negative likelihood ratio represents the ratio between the
basic requirements have been met by the established chance of a negative test result in the absence of the charac-
methods of meta-analysis of therapeutic studies. teristic under investigation and the chance of a negative
Results are usually presented graphically by means of a forest result in the presence of this attribute. Given the positive and
plot, which gives an excellent impression of the dispersion of the negative likelihood ratio of a test procedure and a certain
effect sizes. The common point estimates, such as odds ratios prevalence, or prior disease probability, the related prob-
(OR), relative risks (RR) and risk differences (RD), in combi- ability of disease can easily be obtained from nomograms.2,3
nation with their related 95% or 99% conŽ dence intervals, By convention, marked changes in prior disease probability
allow clear inferences to be drawn from the available data: the can be assumed in positive likelihood ratios exceeding 10.0
experimental arm fares better, fares equivocally (there is no and negative likelihood ratios below 0.1.4,5 The scientiŽc back-
evidence of a difference) or fares worse than the control arm. ground for these somewhat arbitrary thresholds can be found
The problem that occurs in a meta-analysis of diagnostic in Bayesian decision theory: “classical” signiŽcance levels
studies is the multidirectional performance of the diagnostic often correspond to still substantial posterior probabilities of
instrument regarding its ability to detect (speciŽcity) or to the null hypothesis of “no change in prior belief”. For
exclude (sensitivity) the characteristic of interest. Multi- instance, a p value of 0.05 (which would normally support
dimensional outcomes cannot be summarised well by a rejection of the assumption of a null effect) can still corre-
single estimate. The results gained from a diagnostic test will spond to a 52% posterior probability of the null hypothesis.6

www.jmedscreen.com Journal of Medical Screening 2003 Volume 10 Number 1


48 Stengel, Bauwens, Sehouli, et al.

For the purpose of this study we stress only that an investi- The individual positive likelihood ratio (LR+) can be
gation of a certain medical intervention must provide a large expressed as:
amount of additional information to change prior knowl-
LR+ = SN/(1 - SP)
edge. Since no threshold values of sensitivity or speciŽcity are
available that would allow either the adoption or the The negative likelihood ratio is calculated as:
rejection of the routine application of a diagnostic procedure,
LR- = (1 - SN)/SP.
likelihood ratios appear as preferable indices of test perform-
ance, at least in the setting of clinical decision-making. Positive and negative likelihood ratios were then arranged
In this investigation we were interested in how likelihood in a “multidimensional forest plot” together with their 95%
ratios can be used to conduct a diagnostic meta-analysis. Our conŽ dence intervals. We hypothesised that this method of
objectives were to develop a clearly arranged graphical graphical presentation could be easily interpreted, especially
presentation of the results from individual diagnostic by readers already used to the “classic” forest plots of
studies, and to obtain a summary measure of diagnostic test therapeutic meta-analyses. The discriminating features of a
efŽ cacy by reasonable computational efforts. test will be reected by the localisation of the data cloud
within the quadrants of the matrix space, and this in turn
METHODS will allow a rapid assessment of the potential usefulness of a
Basic considerations diagnostic procedure, even by eyeballing.
The basic steps to obtain a summary measure in a meta-
The results of a diagnostic test with a dichotomous outcome analysis are to weight individual effect measures according
can be easily summarised in a 2 ´ 2 table. Sensitivity (SN, the to their variance and then to divide the sum of the weighted
number of true positives divided by all test positives) and quantities by the sum of weights.
speciŽcity (SP, the number of true negatives divided by all We adopted general inverse-variance weighted Ž xed-
test negatives) are well known characteristics of a diagnostic effects models to summarise likelihood ratios.9–13 In case of
intervention. Knowledge of both indices is required to statistical heterogeneity, the DerSimonian–Laird random-
appraise test precision fully. However, one might think of effects method was applied to account for both within-study
clinical situations in which only one of these characteristics and between-study variance.
is of real interest. For example, the Ottawa ankle rules In analogy to therapeutic meta-analyses, summary esti-
provide 99% SN to exclude ankle fractures, whereas a 99% mates within the matrix space were given prominence over
SP of transcranial Doppler sonography allows detection of individual results by the use of Ž lled symbols.
vasospasm in the middle cerebral artery.7,8
Many attempts have been made to calculate a single
summary statistic from a set of 2 ´ 2 tables; this statistic is Worked example
often referred to as the “diagnostic odds ratio”. In fact, the Focused abdominal sonography for trauma (FAST) is an
application of a diagnostic test always yields two odds ratios: emergency ultrasound protocol to screen for free intra-
the prior odds will be increased with positive test Žndings abdominal  uid after blunt injuries. In general, the exami-
(leading to an odds ratio ranging from 1 to inŽ nity) and will nation is performed in three or four standard planes to detect
be decreased if the test turns out negative (with an odds ratio  uid collections in the perihepatic space (Morrison’s pouch),
between 0 and 1). surrounding the spleen, and in the Douglas’ spatium.
Approximately 20% of all intra-abdominal traumatic lacera-
Multidimensional forest plots tions are not accompanied by signiŽcant haemoperitoneum.
The efŽ cacy of a diagnostic test is precisely characterised by In comparisons with other reference standards, such as diag-
its positive and its negative likelihood ratio in both quali- nostic peritoneal lavage (DPL) or helical computed tomogra-
tative (the vector of test performance) and quantitative (the phy (HCT), the ability of FAST examination to detect or to
size of this vector) terms. exclude the underlying visceral lesion remains unclear.
A test set of four different meta-analytic scenarios was In the meta-analysis, 11 trials were eligible for analysis,
constructed, comprising 10 Ž ctitious studies of identical size with a total of 2819 subjects – they employed ultrasono-
(n = 2485). The studies included in these meta-analyses graphy to detect organ injury and made comparisons with
showed the possible combinations of test characteristics: a either DPL or HCT. 14 It was obvious that ultrasound provided
test of high SN and SP; a test of low SN and high SP; a test of extremely high speciŽcity (range 0.84–1.00) but sensitivity
high SN and low SP; and a test of low SN and SP. was below 90% in nine of the 11 trials. Table 1 summarises

Table 1 Major characteristics and descriptive statistics of studies included in the meta-analysis of emergency ultrasound
(FAST) 14
Prior
Reference n TP FN TN FP probability Sensitivity (95% CI) SpeciŽcity (95% CI)

Akgür F 208 42 8 157 1 0.24 0.84 (0.79, 0.89) 0.99 (0.98, 1.00)
Froelich JW 26 13 2 10 1 0.58 0.87 (0.74, 1.00) 0.91 (0.80, 1.00)
Förster R 140 28 1 107 4 0.21 0.97 (0.94, 1.00) 0.96 (0.93, 0.99)
Goletti O 73 26 4 42 1 0.41 0.87 (0.79, 0.94) 0.98 (0.94, 1.00)
Healey MA 796 45 6 728 17 0.06 0.88 (0.86, 0.90) 0.98 (0.97, 0.99)
Katz S 121 10 1 92 18 0.09 0.91 (0.86, 0.96) 0.84 (0.77, 0.90)
Krupnick AS 64 20 12 32 0* 0.52 0.62 (0.50, 0.74) 0.98 (0.95, 1.00)
McGahan JP 121 24 14 79 4 0.31 0.63 (0.55, 0.72) 0.95 (0.91, 0.99)
McKenney KL 884 95 15 768 6 0.12 0.86 (0.84, 0.89) 0.99 (0.99, 1.00)
Röthlin MA 313 24 31 257 1 0.18 0.44 (0.38, 0.49) 1.00 (0.99, 1.00)
Singh G 73 26 9 33 5 0.48 0.74 (0.64, 0.84) 0.87 (0.79, 0.95)

* In case of zero values, all cells were corrected for continuity by adding 0.5.
The number of subjects enrolled into the trials is denoted by n. TP, FN, TN and FP are the number of true-positive, false-negative, true-negative, and false-positive
Žndings, respectively. Values in parentheses are 95% conŽdence intervals.

Journal of Medical Screening 2003 Volume 10 Number 1 www.jmedscreen.com


Meta-analysis of diagnostic studies 49

all the relevant descriptive statistics of the studies included in 1000


the Ž nal model.

RESULTS
Fictitious studies
Figure 1 shows the distribution of the Ž ctitious data on a 100

LR pos (log scale)


scatterplot diagram. The matrix presentation enables a quick
visual impression of the strengths and the weaknesses of a
diagnostic test in either direction. Tests with excellent
discriminatory properties will be plotted to the upper right
corner of the matrix space (i.e. exceeding the 10.0 and 0.1
threshold levels), whereas almost useless tests can be found 10
in the lower left quadrant (in other words, in the south-
western edge of the diagram).
With regard to the Ž ctitious sample of trials with high SN
and poor SP, their expression as likelihood ratios indicates
that the majority of studies support the validity of negative
test results, whereas uncertainty reigns for positive Ž ndings.
1
In this example, high SN and poor SP translates to a 1.00 0.10 0.01
positive summary likelihood ratio of 1.44 (which means LR neg (log scale)
virtually no change in the prior disease probability in the
presence of positive results) and a negative likelihood ratio Figure 2 Likelihood ratio (LR) scatterplot matrix of the FAST meta-
of 0.07 (indicating sufŽ cient power to exclude the presence analysis. UnŽlled circles represent individual studies. The Žlled circle
of disease in negative Ž ndings). shows the weighted summary likelihood ratios (random-effects
model). Error bars represent 95% conŽdence intervals. FAST
provides a reasonable shift in prior probability for positive Žndings,
Worked example but has only weak power to exclude the presence of organ injury.

Figure 2 illustrates the dispersion of effect sizes observed in


the FAST meta-analysis in the scatterplot matrix. The
studies with more precise estimates. Obviously, FAST has
discriminatory power of FAST can be divided into its two
complex characteristics. As already indicated by the scope of
components by plotting the positive against the negative
the data cloud, the relaxed combined estimate of the positive
likelihood ratios. By including the 95% conŽ dence intervals,
likelihood ratio (as derived from the random-effects model)
we are able to localise the weight of the data cloud, as well as
of this meta-analysis is 25.03, with its associated 95%
its outliers. Studies with large variance and wide conŽ dence
conŽ dence interval ranging from 11.74 to 53.35. This con-
intervals will contribute less to the overall weight than
siderably exceeds the above-mentioned threshold level, and
positive ultrasound Žndings should substantially inuence
100 the prior probability of the presence of organ lacerations.
The corresponding negative summary likelihood ratio is
estimated at 0.22 (95% conŽ dence interval 0.15–0.33). This
translates to a fourfold increased chance of normal ultra-
sound Žndings in the absence of intra-abdominal lesions
(compared with the chance of a normal scan in the presence
of organ injuries) or a moderate to minor expectation of a
LR pos (log scale)

negative FAST examination.

10
DISCUSSION
The usefulness of a diagnostic test can be examined in three
ways:15
(1) its efŽ ciency in discriminating between a diseased and a
healthy population (which is expressed in terms of
sensitivity and speciŽcity);
(2) its consequences for clinical decision-making;
1
1.00 0.10 0.01 (3) the outcome related to the medical actions taken on the
LR neg (log scale)
basis of test Ž ndings.
Figure 1 Likelihood ratio (LR) scatterplot of the four Žctitious meta- The proven discriminatory power of a diagnostic test is a
analyses of 10 studies. UnŽlled symbols represent individual studies;
Žlled symbols are weighted summary likelihood ratios (Žxed-effects prerequisite to investigate patient outcome, and a systematic
model) with related 95% conŽdence intervals. Circles = studies with work-up of all the scientiŽc information is needed to assure
high SN and high SP; diamonds = studies with low SN and low SP; health-care professionals of the properties of tests used in
triangles = studies with low SN and high SP; squares = studies with daily practice. Interestingly, there are far more systematic
high SN and low SP. In analogy to the “classic” forest plot, test
reviews of therapeutic interventions than of diagnostic
characteristics can be assessed by eyeballing. The threshold values
of 10.0 and 0.1 are a useful procedure by which to divide the tests.16 This discrepancy might relate to difŽ culties in inter-
matrix into four quadrants. preting diagnostic meta-analyses.

www.jmedscreen.com Journal of Medical Screening 2003 Volume 10 Number 1


50 Stengel, Bauwens, Sehouli, et al.

One might argue against combining data from studies of one-point estimate, Q*, provides some global evidence of
diagnostic tests. Obviously, the Ž ndings from such studies test validity but it fails to discriminate tests with singular
will be in uenced by the characteristics and the size of the strengths (or weaknesses) on one aspect of test performance.
patient sample, the prevalence of the condition under If one generally doubts the clinical value of a diagnostic test
investigation and the test threshold used; this provides an when SN and SP decrease, Q* perfectly meets the minimal
argument for a description only of the range of data. requirements of a univariate measure of effectiveness. If not,
The same limitations also apply to meta-analyses of relying on a single index of test efŽ cacy risks missing
therapeutic studies. By its nature, meta-analysis provides the procedures that perform well in one direction but poorly in
most probable approximation of a common effect size based the other.
on the available data. However, with proper weighting Likelihood ratios are helpful in showing both the bipolar
methods and sensitivity analyses, there is no reason why the and the unipolar weaknesses of a test (as well as its unipolar
central estimate gained from individual diagnostic studies strengths). In analogy to forest plots, the likelihood ratio
should provide different or even less information than the scatterplot matrix illustrates the distribution and the centre
summary measure calculated from a set of therapeutic trials. of effect sizes from a pool of individual studies and allows
The problem of combining different diagnostic studies in a identiŽ cation of outliers, as well as of studies relevant for
meta-analysis was Ž rst addressed by summary receiver sensitivity analyses.
operating characteristics (SROC) curves, proposed by Moses, By using established Ž xed-effects and random-effects
Shapiro and Littenberg.17,18 Regressing the true-positive techniques, likelihood ratios from individual trials can be
rates (sensitivity) to their false-positive rates (1 – speciŽcity) condensed to a point estimate from diagnostic meta-analysis.
visualises the trade-off between sensitivity and speciŽcity, Summary likelihood ratios are convenient numerical
that is, the price in terms of false-positive Ž ndings that must descriptors that cover both qualitative and quantitative
be paid for a reasonable true-positive rate. The SROC characteristics of test performance.
approach is the recommended reference standard for
diagnostic meta-analysis.19 As a univariate summary
measure, the Q* value was proposed as the point of inter- ACKNOWLEDGEMENTS
section where sensitivity equals speciŽcity. Q* shows the
The authors are indebted to Dr Barbara Herzberger for editorial
desirable characteristics of a univariate measure of the over- assistance. We also wish to thank the referee team of the Journal of
all discriminatory features of a diagnostic test. However, Medical Screening for helpful remarks.
there are some situations in which Q* leads to some mis-
interpretation.
.................
Using the test set of Ž ctitious studies, it can be shown that Authors’ affiliations
Q* hardly distinguishes highly sensitive but unspeciŽ c tests Dirk Stengel, Senior Surgeon, Department of Trauma Surgery,
(Q* = 0.70) from worthless procedures (Q* = 0.58). Unfallkrankenhaus Berlin, and Lecturer in Evidence-Based Medicine,
Clinical Epidemiology and Biostatistics, Ernst-Moritz-Arndt-University of
Moreover, tests with poor sensitivity might yield virtually Greifswald, Germany
similar Q* values, regardless of their speciŽcity (0.52 and Kai Bauwens, Senior Surgeon, Department of Trauma Surgery,
0.58, respectively) (see Figure 3). Unfallkrankenhaus Berlin, and Lecturer in Evidence-Based Medicine,
Clinical Epidemiology and Biostatistics, Ernst-Moritz-Arndt-University of
If one is concerned about true-positive rates, the SROC
Greifswald, Germany
sufŽ ciently depicts the range of effect sizes. The associated Jalid Sehouli, Senior Oncological Surgeon, Department of
Gynaecology and Gynaecological Oncology, and Lecturer in Evidence-
Based Medicine, Charité Virchow University Hospital, Berlin, Germany
1.0 Axel Ekkernkamp, Professor of Surgery and Head, Department of
Trauma Surgery, Ernst-Moritz-Arndt-University of Greifswald, and
0.9 Department of Trauma Surgery, Unfallkrankenhaus Berlin Trauma Centre,
Germany
0.8 Franz Porzsolt, Professor of Haematology, Oncology and Evidence-
Based Health Care, Human Studies Centre, Ludwig-Maximilians
True Positive Rate (Sensitivity)

University, Munich, Germany


0.7

0.6
REFERENCES
0.5
1 Berger JO. Statistial decision theory and bayesian analysis. New York:
Springer, 1985.
0.4 2 Fagan TJ. Nomogram for Bayes’ theorem. N Engl J Med 1975;293:257.
High SN/High SP
3 Centre for Evidence-Based Medicine. Toolbox: Likelihood ratios. Available
Q*=0.96 (95% CI0.73, 1.18)
0.3 at: http://minerva.minervation.com/cebm (last accessed November
Low SN/High SP 2002).
Q*=0.52 (95% CI0.36, 0.68) 4 Jaeschke R, Guyatt GH, Sackett DL, for the Evidence-Based Medicine
0.2 High SN/Low SP Working Group. Users’ guides to the medical literature. III. How to use an
Q*=0.70 (95% CI0.37, 1.04) article about a diagnostic test. A. Are the results of the study valid? JAMA
0.1 Low SN/Low SP 1994;271:389–91.
Q*=0.58 (95% CI0.49, 0.67) 5 Jaeschke R, Guyatt GH, Sackett DL, for the Evidence-Based Medicine
0.0 Working Group. Users’ guides to the medical literature. III. How to use an
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 article about a diagnostic test. B. What are the results and will they help
me in caring for my patients? JAMA 1994;271:703–7.
False Positive Rate (1 – Specificity) 6 Goodman SN. Toward evidence-based medical statistics. 2. The Bayes
factor. Ann Intern Med 1999;30:1005–13.
Figure 3 Summary receiver operating characteristics (ROC) as the 7 Markert RJ, Walley ME, Guttman TG, Mehta R. A pooled analysis of the
standard approach to a meta-analysis of diagnostic tests, performed Ottawa ankle rules used on adults in the ED. Am J Emerg Med
on data of the four Žctitious meta-analyses displayed in Figure 1. 1998;16:564–7.
8 Lysakowski C, Walder B, Costanza MC, Tramer MR. Transcranial
Q* values (where sensitivity equals speciŽcity) have been proposed Doppler versus angiography in patients with vasospasm due to a ruptured
as the one-point estimate of test efŽcacy. Q* hardly differs between cerebral aneurysm: a systematic review. Stroke 2001;32:2292–8.
tests with low SN and high SP and tests with both low SN and SP. 9 Pettiti DB. Meta-analysis, decision analysis, and cost-effectiveness analysis.
Calculations were made by generalised estimating equations (GEE) Methods for quantitative synthesis in medicine. Oxford: Oxford University
models. Press, 2000.

Journal of Medical Screening 2003 Volume 10 Number 1 www.jmedscreen.com


Meta-analysis of diagnostic studies 51

10 Sutton AJ, Abrams KR, Jones DR, Sheldon TA, Song F. Methods for meta- 15 Fineberg HV, Bauman R, Sosman M. Computerized cranial tomography.
analysis in medical research. Chichester: Wiley, 2000. Effect on diagnostic and therapeutic plans. JAMA 1977;238:224–7.
11 Deeks J, on behalf of the Statistical Methods Working Group of the 16 Knottnerus JA, van Weel C, Muris JWM. Evidence base of clinical
Cochrane Collaboration. Statistical Methods Programmed in Meta View, diagnosis: evaluation of diagnostic procedures. BMJ 2002;324:477–80.
Version 4. 1999. Available at http://www.cochrane.org (last accessed 17 Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple
January 2002). conicting reports: a new meta-analytic method. Med Decis Making
12 Mantel N, Haenszel W. Statistical aspects of the analysis of data 1993;13:313–21.
from retrospective studies of disease. J Natl Cancer Inst 18 Moses LE, Shapiro D, Littenberg B. Combining independent studies of a
1959;22:719–48. diagnostic test into a summary ROC curve: data analytic approaches and
13 DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials some additional considerations. Stat Med 1993;12:1293–316.
1986;7:177–88. 19 Cochrane Methods Group on Systematic Review of Screening and
14 Stengel D, Bauwens K, Sehouli J, et al. Systematic review and meta- Diagnostic Tests. Recommended Methods. 1996. Available at
analysis of emergency ultrasonography for blunt abdominal trauma. Br J http://www.cochrane.org/cochrane/sadtdoc1.htm (last accessed April
Surg 2001;88:901–12. 2002).

www.jmedscreen.com Journal of Medical Screening 2003 Volume 10 Number 1

You might also like