Professional Documents
Culture Documents
A systematic review is a comprehensive search, critical evaluation, and synthesis of all the relevant studies on a specific (clinical) topic
that can be applied to the evaluation of diagnostic and screening imaging studies. It can be a qualitative or a quantitative (meta-
analysis) review of available literature. A meta-analysis uses statistical methods to combine and summarize the results of several studies.
In this review, a 12-step approach to performing a systematic review (and meta-analysis) is outlined under the four domains: (1) Problem
Formulation and Data Acquisition, (2) Quality Appraisal of Eligible Studies, (3) Statistical Analysis of Quantitative Data, and (4) Clinical
Interpretation of the Evidence. This review is specifically geared toward the performance of a systematic review and meta-analysis of
diagnostic test accuracy (imaging) studies.
Key Words: Diagnostic accuracy; evidence-based medicine; evidence-based radiology; heterogeneity; literature search; meta-
analysis; meta-regression; publication bias; receiver operating characteristic analysis; ROC analysis; sensitivity analyses; systematic
review; subgroup analysis; threshold effect.
© 2018 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.
S
ystematic reviews and meta-analyses have become
In this review, a 12-step framework for performing sys-
popular in medicine and are very commonly applied
tematic reviews (and meta-analyses) is outlined under the four
to treatment trials. However, they are still less common
domains: (1) Problem Formulation and Data Acquisition, (2)
for diagnostic imaging studies. Systematic reviews and meta-
Quality Appraisal of Eligible Studies, (3) Statistical Analysis
analyses aim to provide summaries of the average result. In
of Quantitative Data, and (4) Clinical Interpretation of the
the case of imaging tests, this is diagnostic performance such
Evidence (Table 1). We will subsequently use “systematic
as sensitivity or specificity, and the uncertainty of this average.
review” and “meta-analysis” to represent the whole process
In radiology, the smaller patient size and limited method-
of evidence synthesis. The steps in “problem formulation and
ological quality of the primary studies can limit the quality
data acquisition” are “define the question and objective of
of the review and meta-analysis. However, systematic reviews
the review,” “establish criteria for including studies in the
and meta-analyses may be the best assessment of the pub-
review,” and “conduct a literature search to retrieve the rel-
lished literature available at any point in time, especially in
evant literature.” The steps in “quality appraisal of eligible
the absence of large, definitive trials. They may provide im-
studies” are “extract data on variables of interest,” “assess study
portant information to guide patient care and direct future
quality and applicability to the clinical problem at hand,” and
clinical research. Performing and interpreting systematic reviews
“summarize the evidence qualitatively and, if appropriate, quan-
in radiology can be challenging given the paucity of avail-
titatively (meta-analysis).” The steps in “statistical analysis of
able clinical studies. However, if investigators adhere to proper
quantitative data” are “estimate summary diagnostic test per-
methodology, systematic reviews may provide useful
formance metrics and display the data,” “assess heterogeneity,”
“investigate data for publication bias,” “assess the robustness
of estimates of diagnostic accuracy using sensitivity analy-
Acad Radiol 2018; 25:573–593
ses,” and “explore and explain heterogeneity in test accuracy
From the Department of Radiology, University of Michigan, B1 132G Taubman
Center/5302, 1500 East Medical Center Drive, Ann Arbor, MI 48109 (P.C., using subgroup analysis (if applicable).” The steps in “clini-
A.M.K., B.F., M.P., B.A.D.); Department of Neurology, University of Michigan cal interpretation of the evidence” are “graphically display how
(D.A.); Nuclear Medicine Service, VA Ann Arbor Health Care System, Ann Arbor,
Mississippi (B.A.D.). Received August 10, 2017; revised November 21, 2017; the evidence alters the posttest probability using a Fagan plot
accepted December 6, 2017. Address correspondence to: P.C. e-mail: (Bayes nomogram), likelihood ratio scatter graph, or probability-
pcronin@med.umich.edu
modifying plot.” This review is tailored for radiologists who
© 2018 The Association of University Radiologists. Published by Elsevier Inc. are new to the process of performing a systematic review
All rights reserved.
https://doi.org/10.1016/j.acra.2017.12.007 and meta-analysis. However, we hope that those with
573
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
TABLE 1. An Outline of the Main Steps in Doing a Meta- graphic angiography (CTA) compare with magnetic resonance
analysis of Diagnostic Test Accuracy angiography (MRA) for the detection and quantification of
carotid stenosis?” or “In patients with known or suspected
1. Problem formulation and data acquisition coronary artery disease, how does CT coronary angiogra-
Step 1. Define the question and objective of the review
phy compare with invasive catheter coronary angiography for
Step 2. Establish criteria for including studies in the review
identifying one (or more) potentially or probably hemody-
Step 3. Conduct a literature search to retrieve the relevant
literature
namically significant (≥50% coronary artery luminal diameter)
2. Quality appraisal of eligible studies stenosis in terms of sensitivity, specificity and diagnostic ac-
Step 4. Extract data on variables of interest curacy?” or “In patients with a solitary pulmonary nodule,
Step 5. Assess study quality and applicability to the clinical how well does dynamic contrast material–enhanced CT,
problem at hand dynamic contrast material–enhanced MR imaging, FDG PET,
Step 6. Summarizing the evidence qualitatively and if and 99mTc-depreotide SPECT compare for the diagnosis of
appropriate, quantitatively (meta-analysis) malignancy (diagnostic accuracy)?” or “In patients with known
3. Statistical analysis of quantitative data or suspected rotator cuff tears, how does ultrasound compare
Step 7. Estimate diagnostic accuracy and display the data to MRI for diagnosis?” or “Is low-dose CT colonography
Step 8. Assess heterogeneity
equivalent to optical colonoscopy in identifying clinically mean-
Step 9. Assess for publication bias
ingful colonic polyps?” It should be remembered that evidence
Step 10. Assess the robustness of estimates of diagnostic
accuracy using sensitivity analyses (if applicable)
synthesis can be derailed by not asking a focused question.
Step 11. Explore and explain heterogeneity in test accuracy It is also important to have a focused research question as this
using subgroup analysis (if applicable) is used to direct the search.
4. Clinical interpretation of the evidence
Step 12. Graphically display how the evidence alters the
Step 2. Establish Criteria for Including Studies in the
posttest probability
Review
574
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
studies published. Bias can occur in the way that the system- the meta-analysis (6). Quality assessment of included studies
atic review or meta-analysis results are summarized and may take precedence over attempts to locate unpublished re-
presented, or in the level of importance attributed to the study search (7).
results. To avoid selection bias, an objective, systematic, and The articles are first screened (screening phase) at the ab-
comprehensive search strategy using several electronic data- stract level for specified exclusions, and the potentially eligible
bases should be used. A search protocol that clearly shows the articles based on the abstracts alone are retrieved and then re-
methods used to search the literature should be used. Search- viewed in their entirety (eligibility phase), with further
ing for and identifying imaging studies are more difficult than subsequent exclusions applied at the full-text level. The re-
searching for and identifying therapeutic studies. Imaging ac- maining articles are included in the review (included phase)
curacy studies are not limited to a single study design (5). The (Fig 1). What distinguishes a systematic review from a nar-
most useful search terms for identifying diagnostic imaging rative review is documenting the literature search with sufficient
studies are the diagnostic tests of interest and the clinical dis- detail so that it could be reproduced. A systematic search should
order. An iterative process is often required to maximize the minimize bias, producing more reliable estimates of diagnos-
search yield. It may be worthwhile including or discussing tic accuracy. Therefore, describe the search and provide an
your research question with a librarian or information spe- illustrative flow chart such as Figure 1. The description and
cialist and have them perform or assist with the literature search. chart should include the number of articles initially identi-
Identifying existing systematic reviews and health technolo- fied through database and other searching, and the number
gy assessments can be used to expand or refine the search. of articles after removal of duplicate. Then, describe and chart
The literature search should encompass published and un- the number of articles screened and excluded, and the number
published materials. A search of single electronic database is of full-text articles assessed for eligibility and number of ex-
not considered adequate for a systematic review as it may miss cluded studies (with reasons for exclusion).
studies and lead to bias. Table A2 outlines examples of search
sources. The search should involve several electronic data-
bases such as MEDLINE, EMBASE, and Cochrane Central QUALITY APPRAISAL OF ELIGIBLE STUDIES
Register of Controlled Trials. PubMed is a free search engine. Step 4. Extract Data on Variables of Interest
It primarily accesses the MEDLINE database of references and
abstracts. The database is maintained as part of the Entrez system Extracted data usually include details of participants or pa-
of information retrieval by the United States National Library tients, index test, comparator test, target disorder, study design,
of Medicine at the National Institutes of Health. Entrez Global results, publications, and investigators. Review authors should
Query is an integrated search and retrieval system that pro- plan a priori what data will be required for their systematic
vides access to all databases simultaneously with a single query review, and develop a strategy for obtaining data. Informa-
string and user interface. EMBASE is a biomedical database tion about patients should include the spectrum of patients
of published literature produced by the medical or scientific who received the test, demographic data such age and gender,
publisher, Elsevier. It contains over 28 million records from co-morbid conditions, and information about the index and
over 8400 published journals, from 90 countries, and with comparator test such as scan parameters and generation of tech-
daily updates. The Cochrane Central Register of Con- nology. The reference standard should be capable of classifying
trolled Trials (CENTRAL) is an excellent source of reports the target condition correctly. Information about a disorder
of randomized and quasi-randomized controlled trials. The may be about a lesion (size, imaging characteristics, classifi-
majority of CENTRAL records are mainly taken from cation [benign vs malignant]) or about a disease. Basic study
MEDLINE and EMBASE, but records are also derived from characteristics would include design (randomized control trial,
other published and unpublished sources. CENTRAL records prospective or retrospective cohort) and the duration of the
often include an abstract or summary of the article but do study. Geographical regions may have important differences
not contain the full text of the article. The search should also that could affect test accuracy or delivery. Technology dif-
include relevant journals, conference proceedings or check- ferences or trends over time may also be important. Generic
ing of the reference lists of papers found in the searches, but information that may be extracted from each study includes
also books of recently published abstracts presented at scien- study citation, study’s first author name, year and journal of
tific meetings, and those that summarize doctoral theses. It publication, and country or region where the study was per-
may involve personal contact with experts in the area. Two formed. Other data, such as whether ethical approval was
important reasons to do this are to identify published studies obtained, whether a sample size calculation was performed,
that might have been missed because they are in press or not funding sources, or potential conflicts of interest of the study
yet indexed and to identify unpublished or gray literature. Con- authors, can indicate the quality of the study conducted.
troversy remains about including gray literature. Not including Relevant test outcomes may be dichotomous (common-
unpublished studies in the review increases the chances of pub- ly), ordinal or categorical, or continuous. For dichotomous
lication bias, that is, the overrepresentation of studies with outcomes, disease positive or disease negative and test posi-
positive results. This can result in a systematic overestima- tive or test negative, there are four groups within the 2 × 2
tion of test performance, a possible threat to the validity of table. These are true positive, false positive, false negative, and
575
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
true negative. Extracting these data allows calculation of study- enced readers may overestimate diagnostic accuracy and not
specific sensitivity, specificity, positive- and negative- reflect daily practice (8). Using each reader’s data and treat-
predictive values and accuracy, and likelihood ratios and ing each as an individual study will overrepresent the results
diagnostic odds ratio (DOR) for the meta-analysis. Having of that single study in the sample, and the biases inherent to
sensitivities and specificities as proportions without numera- that study will be magnified in the pooled results (8). In ad-
tors and denominators is not sufficient. This lack of raw data dition, there are the statistical challenges due to the paired
results in exclusion of otherwise eligible articles. However, nature of the data. McGrath et al. evaluated the handling of
if study-specific sample size is known, it may be possible to multiple readers in systematic reviews-meta-analyses of imaging
derive the raw 2 × 2 data using the proportions provided in diagnostic accuracy (8). The authors found that most meta-
the article. Further challenges to data extraction particularly analysts do not report how they handled the issue. In 25%
relevant to radiology studies involve studies reporting results of meta-analyses, investigators averaged the results from mul-
from multiple readers, multiple sessions (eg, using different tiple readers within the study, whereas in 50%, the results from
combinations of pulse sequences), or multiple interpretative each individual reader within a study was treated as a sepa-
approaches. The case of multiple readers is a particularly dif- rate dataset (8). Optimal methods for handling multiple reader
ficult issue commonly encountered for imaging systematic data are not available, but multilevel hierarchical models ac-
reviews and meta-analyses. Different strategies may be used counting for between-observer variability within studies, and
to handle data from more than one reader, such as (1) using between-study variability, provided multiple reader data are
an average of the diagnostic accuracy results across readers within reported consistently at the primary study level, are needed
a study, (2) selecting data from the reader with the highest (8). Using such models, all readers would be included in the
accuracy within a study, (3) selecting data from the reader with meta-analysis, interobserver variability at the primary study
the most years of clinical experience, (4) treating each read- level would not be lost, and a single study would not be over-
er’s data within a study as an individual study, and (5) randomly represented (8).
selecting one reader within a study (8). Averaging the results There are no easily implemented statistically rigorous methods
of multiple readers minimizes heterogeneity from interobserver for dealing with continuous data or ordinal data. It is rec-
variability (8). Using data from the best or most experi- ommended that ordinal or continuous test results be
576
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
dichotomized by selecting a threshold using cut points based guidance is developed. Third, the published flow diagram for
on criteria such as Youden’s index or Euclidean distance. There- the primary study is reviewed. If none is reported, a flow
fore, it is best to extract raw data and investigate the impact diagram is constructed. Fourth, a judgment of bias and ap-
of choice of cut point in a sensitivity analysis. plicability is made (13). The investigator uses signaling questions
Extracting data from the primary studies is an important for scoring each article as high, low, or unclear risk of bias
but time-consuming part of a systematic review. A data col- and applicability (13). It is essential to tailor QUADAS-2 to
lection form should be used and designed with data extraction each review by adding or omitting signaling questions. The
in mind. Each included study should be read in its entirety. QUADAS-2 tool is applied in four phases: summarize the
To minimize errors and reduce potential biases, it is strongly review question, tailor the tool and produce review-specific
recommended that data are extracted from every study by more guidance, construct a flow diagram for the primary study, and
than one person. This is particularly important where there judge bias and applicability (13). Figure 2 shows an example
is subjective interpretation or information extraction is crit- of QUADAS 2 as a graphic illustration of the percentage of
ical to the interpretation of results. Research has shown that studies meeting each criteria. The QUADAS-2 tool is avail-
independent data extraction by two or more authors re- able at http://www.bristol.ac.uk/media-library/sites/quadas/
sulted in fewer errors than a data extraction by a single author migrated/documents/quadas2.pdf. At the website http://www
(9). One study observed a high prevalence of data extrac- .bristol.ac.uk/population-health-sciences/projects/quadas/
tion errors, whereas another study found that at least 7 of 27 quadas-2/, there is a table that summarizes QUADAS-2 and
reviews had substantial errors (10,11). Those involved in the lists all signaling, risk of bias, and applicability rating ques-
data extraction should practice using the data extraction form. tions. In the top row of the table are the domains of patient
Pilot testing the form using a representative sample is nec- selection, index test, reference standard, and flow and timing.
essary so as to identify important but missing extraction items. In the first column are description, signaling questions
This should also minimize revising the form after data ex- (yes/no/unclear), risk of bias (high/low/unclear), and con-
traction has begun. There is potential for disagreement when cerns regarding applicability (high/low/unclear).
more than one person is extracting data. There should be a
procedure for resolving disagreements such as arbitration by
Step 6. Summarize the Evidence Qualitatively and, if
a third party. Appropriate, Quantitatively
577
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
(a)
(b)
(c)
(d)
Figure 2. (a) Methodological quality, study validity, and risk of bias summary for each study showing authors’ judgments about each domain
for each included study using the QUADAS tool. (b) Study quality scores. Graph illustrates study quality based on QUADAS criteria, ex-
pressed as a percentage of studies meeting each criterion. (c) Risk of bias and applicability summary for each study showing authors’ judgments
about each domain for each included study using the QUADAS 2 tool. (d) Study quality scores. Graph illustrates study quality based on
QUADAS 2 criteria, expressed as a percentage of studies meeting each criterion.
578
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
for both sensitivity and specificity, the relationship between ty, investigators continue to use univariate methods. The use
them, and the heterogeneity in imaging test accuracy re- of univariate methods can potentially lead to overoptimistic
quires fitting hierarchical random-effects models such as conclusions of meta-analyses. Recently, McGrath et al. as-
summary receiver operator characteristic (SROC) curves and sessed whether authors of systematic reviews of diagnostic
may require biostatistical expertise to do this. accuracy studies published in imaging journals used the rec-
ommended methods for meta-analysis. The authors also
Analyzing the Data evaluated the effect of traditional methods on summary es-
The summary statistics for imaging test accuracy commonly timates of sensitivity and specificity (23). The authors found
used are sensitivity and specificity, positive- and negative- that bivariate methods are used a minority of the time, that
predictive values, and accuracy as determined by the area under this issue is not improving with time, and that univariate
the receiver operator characteristic (ROC) curve (Table A3). methods can lead to overestimation of diagnostic accuracy with
Other summary statistics include likelihood ratios (positive, narrower confidence intervals (CIs) than the recommended
negative, or multiple level) and DOR. Meta-analysis can also (HSROC or bivariate) methods (23). Of the 300 reviews from
address how imaging test accuracy varies with clinical and meth- January 2005 to May 2015 that met the authors’ inclusion
odological characteristics. criteria, only 39% used recommended meta-analysis methods
Model Fitting and Statistical Methods for Pooling Data (24). No change in the method used was observed over time.
Moses-Littenberg SROC curves.—The Moses-Littenberg method However, there was geographic, subspecialty, and journal het-
provides a simple model for deriving an SROC (14,15). It erogeneity (25). Fifty-one meta-analyses using univariate
was one of the earliest models to be proposed and has been random-effects methods were reanalyzed with the bivariate
used extensively in meta-analyses of diagnostic test accura- model. The average change in the summary estimate for sen-
cy. Its inability to provide estimates of the heterogeneity sitivity was 1.4% and for specificity was 2.5%. Both changes
between studies is a limitation. Therefore, more complex hi- were statistically significant. The average change in width of
erarchical models that properly allow random effects in the CI was 7.7% for sensitivity and 9.9% for specificity. Sim-
diagnostic test accuracy have superseded it. ilarly, both changes were statistically significant (24).
579
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
Figure 3. Forest plot showing study-specific and mean sensitivity and specificity. Each black square is a study-specific sensitivity and
specificity. The size of the black square reflects the weight of the study in the meta-analysis, and the horizontal line reflects the 95% con-
fidence interval (CI). The vertical broken line represents the pooled sensitivity or specificity and the boundaries of the hollow diamond displayed
at the bottom represent the 95% CI of the pooled results.
Although constructed from sensitivity and specificity, ROC Test Results Are Available Only as a Dichotomy
curves do not depend on the decision threshold. In an ROC If primary studies dichotomize data (disease present or absent)
curve, each point represents the sensitivity and FPR at a dif- and therefore only provide sufficient information to esti-
ferent decision threshold. The area under the ROC curve is mate sensitivity and specificity, the mean sensitivity and the
an overall measure of the test’s accuracy. A perfect test has a mean specificity can be estimated, possibly weighted by the
value of 1, whereas a value of 0.5 is obtained if the test does sample size of each study.
no better than chance (27).
Test Results Are Available in More Than Two Categories
Summary ROC Plots If test results are measured as a continuum such as standard
Summary ROC plots display the results of individual studies uptake value, or as responses on an ordinal categorical scale
in ROC space, each study is plotted as a single sensitivity- such as with ventilation/perfusion scanning, normal to high
specificity point. As discussed previously, the size of points probability of pulmonary embolism, other analytic tech-
depicts the precision of the estimate, which is the inverse of niques can be used. If no threshold or scaling differences
the standard error of the logit of sensitivity and logit of between primary studies exist and test comparison is not an
specificity. objective, then result-specific likelihood ratios can be ob-
tained from logistic modeling procedures.
Linked ROC Plots
Where two tests are evaluated in each study, linked ROC Step 8. Assess Heterogeneity
plots are used in analyses of pairs of tests. Points are plotted
as in a normal summary ROC plot, but the two estimates, Meta-analyses should only include studies that exactly match
one for each test, from each study are joined by a line (Fig 4). the question. However, studies can differ by patients studied,
580
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
disease severity, co-morbidity, test methods, study design, and yses have two variables: sensitivity and specificity. The I2 statistic
other factors. These systematic differences between studies can can be estimated separately for sensitivity and specificity but
lead to heterogeneity between studies. In meta-analyses of di- this is not ideal. Zhou et al. derived an improved I2 statistic
agnostic imaging studies, heterogeneity is the rule rather than measuring heterogeneity for dichotomous outcomes, such as
the exception because of nonrandomized design of most in- with diagnostic test meta-analyses (31). For bivariate diag-
cluded studies and natural variation in sensitivity and specificity nostic meta-analyses, the authors derived a bivariate version
across positivity thresholds. Therefore, heterogeneity should of I2 that is able to account for the correlation between sen-
be tested. Random-effects meta-analysis methods are recom- sitivity and specificity (31) and dependence of within-study
mended when data are heterogeneous, as in diagnostic imaging variance on the value of binomial proportions.
meta-analyses and should be fitted by default. Random- Some have argued that heterogeneity is not quantified in
effects (hierarchical) models provide an estimate of the average systematic reviews of diagnostic test accuracy because one, it
accuracy of the test and the variability in this effect. Where is expected, and two, tests for heterogeneity in sensitivity and
there are too few studies to estimate between-study variabil- specificity and estimates of the I2 statistic do not account for
ity, fixed-effect models can be used. Cochrane Q is a heterogeneity explained by phenomena such as positivity thresh-
commonly used test. It is a statistic based on the chi-square old effects. Instead, heterogeneity should be assumed, and
test (28). However, this test has low power and may fail to sources for heterogeneity explored. Therefore, another ap-
detect heterogeneity when it is present. Therefore, the Higgins’ proach is to use subgroup analyses and multiple univariate meta-
I2 statistic was developed to overcome this limitation (29). The regression analyses. Meta-regression can also determine whether
I2 test scores heterogeneity between 0% and 100%, with 25% the heterogeneity is attributable to the covariates used. The
corresponding to low heterogeneity, 50% to moderate, and modeling strategy should specify the criterion to decide whether
75% to high. A major advantage is, I2 does not inherently or not a covariate is included, and the adding or removing
depend on the number of studies in the meta-analysis (30). of covariates.
However, this test too may also have insufficient power to If there is sufficient heterogeneity, it may not be appro-
detect heterogeneity if present and should be interpreted with priate to calculate overall summary measures such as sensitivity
some caution (28). The I2 statistic was originally developed or specificity. It is important to reiterate that if too much het-
for univariate meta-analyses. However, diagnostic test anal- erogeneity is encountered or there is a lack of high-quality
581
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
studies, it may be more appropriate to solely perform the sys- region. This suggests threshold effect. If a negative correla-
tematic review and to refrain from a further meta-analysis. tion between sensitivity and specificity or a positive correlation
For example, in the following study, the authors performed between sensitivity and 1 − specificity is found and a corre-
a systematic review of the topic and then concluded that meta- lation coefficient is computed, it has been suggested that the
analysis was not appropriate in light of the identified literature square of the correlation coefficient is approximately equal
(32): In this study, a forest plot of sensitivity and specificity to the amount of heterogeneity that can be attributed to thresh-
for the included studies in the diagnostic accuracy portion of old effect. For instance, if the correlation coefficient was 0.5,
the analysis was generated. However, pooling was not per- this squared is 0.25. Therefore, approximately 25% or a quarter
formed because of the relatively small number of studies, the of the heterogeneity observed could be attributed to thresh-
relatively high risk of bias, and the inherent heterogeneity sec- old effect. When evidence of a threshold effect between studies
ondary to the varied study design among the included studies. in the systematic review and meta-analysis is observed, summary
points alone should be avoided. Summary points such as
Heterogeneity Due to Threshold Effect summary sensitivity, specificity, or DOR may not correctly
It should be noted that some of the variability in test per- reflect the variability between studies and may miss impor-
formance between studies might relate to the selection of a tant information regarding heterogeneity between studies (33).
different diagnostic threshold rather than true differences in It is more appropriate to construct an SROC curve to show
test performance. Threshold effect is one of the primary causes how the different sensitivities and specificities of primary studies
of heterogeneity in meta-analyses of test accuracy studies. It are related to each other (34).
occurs when differences in sensitivities and specificities due
to different cutoffs or thresholds are used to define a posi- Heterogeneity Due to Non-threshold Effect
tive (or negative) test result in the different studies included As stated previously, heterogeneity may be due to other factors
in the meta-analysis. When threshold effect exists, there is a other than threshold effect. These include variations in study
negative correlation between sensitivity and specificity or a population such as severity of disease and co-morbidities, index
positive correlation between sensitivity and 1 − specificity. It test factors such as differences in technology and generations
should be noted that correlation between sensitivity and speci- of technology, reference standard differences, and differ-
ficity could arise due to a number of reasons other than ences in the way a study was designed and conducted (35).
threshold effect. These include partial verification bias, dif- Heterogeneity among included studies in a meta-analysis
ferent spectra of patients, or different settings. can be assessed in two different ways. The first option is a
There are a number of ways to assess threshold effect in visual inspection of “paired” forest plots of sensitivity and speci-
meta-analysis. The first option is to produce “paired” forest ficity. If studies are reasonably homogeneous, the sensitivity
plots of sensitivity and specificity. If this forest plot is in as- and specificity estimates from individual studies will lie along
cending order of sensitivity along with the corresponding the line corresponding to the summary sensitivity and speci-
specificity, and as sensitivity increases, there is decreasing speci- ficity estimate. However, if there are large deviations from
ficity, this could be explained by a threshold effect or vice this line, this indicates possible heterogeneity. A second option
versa. The same inverse relationship will be seen with pos- is based on statistical testing such chi-square, Cochrane Q,
itive and negative likelihood ratio. Similarly, one can assess and the inconsistency index or I2. The pros and cons of these
the correlation of the logits of sensitivity and specificity. If are discussed previously. A third option is to assume heter-
there is a negative correlation, this suggests threshold effect. ogeneity is present and explore sources of heterogeneity with
Alternatively, using the logit of sensitivity and 1 − specific- meta-regression.
ity, a positive correlation suggests the presence of a threshold
effect. If there is a positive correlation of sensitivity plotted Meta-regression
against 1 − specificity in logit space, this suggests a threshold Univariate or multivariate regression analysis can be used in
effect. The second option is a representation of accuracy es- primary studies to assess the relationship between one or more
timates from each study in ROC space. If a plot of sensitivity covariates and a dependent variable. Essentially the same ap-
against 1 − specificity results in a typical ROC pattern some- proach can be used with meta-analysis. This is called meta-
times referred to as a “shoulder arm” plot, this suggests threshold regression. The difference here is that the covariates are at
effect. A third option is a computation of Spearman corre- the level of the study rather than the level of the subject. The
lation coefficient between the logit of sensitivity and logit of causes of heterogeneity should be investigated when de-
1 − specificity. Threshold effect is suggested if there is a strong tected. Patient characteristics, definitions of the test and reference
positive correlation. A fourth option is to create a chi plot. standards, and operating characteristics of the test can result
This is used to judge whether or not sensitivity and speci- in heterogeneity in sensitivity and specificity. Meta-regression
ficity are independent by augmenting the scatter plot with allows the exploration of which types of patient-specific factors
an auxiliary display. In the case of independence, the points or study design factors contribute to the heterogeneity. Meta-
will be concentrated in the central region, in the horizontal regression uses summary data from each trial, such as the
band indicated on the plot. In the case of interdependence, accuracy. Covariates may be introduced into a regression with
the points will be scattered and not concentrated in the central any test performance measure as the dependent variable. The
582
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
sample size corresponds to the number of studies in the anal- include unpublished studies. Publication bias can be assessed
ysis. A small sample size limits the power to detect significant with a funnel plot in which effect size is plotted against the
effects. The accuracy measure that is frequently used is DOR. sample size (24,36,37). An inverted symmetrical funnel of dots
It is a useful measure of diagnostic performance, as it is a single is consistent with the absence of publication bias (38). An asym-
measure that encompasses both sensitivity and specificity and metric plot suggests that some studies may have been missed
likelihood ratios. It can compare the overall diagnostic ac- by the meta-analysis. Asymmetry can also occur if small studies
curacy of different tests but is limited because it cannot be have larger effects (25,39). However, it can be difficult to detect
used directly in clinical practice (35). In primary studies, a asymmetry visually (40). Therefore, formal statistical methods,
minimum of 10 subjects is required for each covariate as- such as Egger’s regression, have been developed (36). Egger’s
sessed. Similarly, in meta-analysis, a minimum of 10 studies regression tests whether small studies have larger effect sizes
is required for each covariate assessed. A lower number of than would be expected by chance, and whether small studies
included studies in the meta-analysis limits the number of with small effects have not been published. For meta-
covariates that can be included in and the ability to perform analyses of diagnostic imaging accuracy studies, the best method
meta-regression. Often, there are too few studies included to for investigating publication bias has not yet been decided,
perform multivariate meta-regression and instead one or several and there is a paucity of research. Statistical tests detect funnel
univariate meta-regression analyses are performed. In a meta- plot asymmetry rather than publication bias, and tests de-
analysis with less than 10 included studies, meta-regression signed for meta-analyses of randomized trials are probably not
may not be performed. This means that causes of heteroge- applicable to diagnostic studies. Other regression tests are being
neity cannot be investigated. This may be a significant limitation developed to overcome the problem of small numbers of small
to a meta-analysis of diagnostic test accuracy as heterogene- studies with weakly positive effects (Fig 5) (24,41,42). Cur-
ity is to be expected. rently, there is no clear consensus on which test to use and
Another approach, using individual patient data, allows when. Whatever test is used, findings should be interpreted
greater flexibility for the analysis and exploration of issues not with care. Further research is required. At present, the best
covered in the published trials. However, obtaining the orig- method to assess publication bias is that proposed by Deeks
inal patient data from the trials can be challenging. et al. (41). The authors have shown that the regression test
has greater power to detect publication bias than the rank cor-
Step 9. Assess Publication Bias relation test (41). The authors recommend that systematic
reviewers undertake funnel plot investigations to examine the
When authors or journals are more likely to publish re- possibility of publication and other sample size–related effects.
search with positive results, this distorts the available evidence They recommend testing for asymmetry using regression tests,
and is known as publication bias. However, null and nega- plotting the log of DOR against the square root of 1/effective
tive results are just as valid, and meta-analyses should try to sample size (ESS) (41). Deeks et al. showed that these tests
583
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
are robust when used in meta-analyses of studies of diagnos- should be deployed to resolve these uncertainties. Sensitivi-
tic test accuracy (41). ty analyses may generate areas for further investigations and
future research.
584
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
Figure 7. Likelihood ratio or Fagan nomogram for different pretest probability of disease: 25%, 50% and 75% for two tests. Posttest prob-
ability is derived by drawing a straight line from the pretest probability vertical axis to the appropriate likelihood ratio and continuing the
straight line to the vertical posttest probability axis. Where this line intersects the vertical posttest probability axis is the posttest probabil-
ity. When Bayes theorem is expressed in terms of log-odds, the posterior log-odds are linear functions of the earlier log-odds and the
log-likelihood ratios. A Fagan plot consists of a vertical axis on the left with the earlier log-odds, an axis in the middle representing the
log-likelihood ratio and a vertical axis on the right representing the posterior log-odds (43).
585
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
Figure 8. Likelihood ratio scatter graph shows summary point of likelihood ratios obtained as functions of mean sensitivity and specificity
in the right upper quadrant, suggesting that the test is useful for confirmation of presence of disease (when positive) and not for its exclu-
sion (when negative). Informativeness may also be represented graphically by a likelihood ratio scatter graph or matrix. It defines quadrants
of informativeness based on established evidence-based thresholds: Left upper quadrant, likelihood ratio positive > 10, likelihood ratio neg-
ative < 0.1, confirmation and exclusion, suggesting that the test is useful for confirmation of presence of disease (when positive) and for its
exclusion (when negative). Right upper quadrant, likelihood ratio positive > 10, likelihood ratio negative > 0.1, confirmation only, suggesting
that the test is useful for confirmation of presence of disease (when positive) and not for its exclusion (when negative). Left lower quadrant,
likelihood ratio positive < 10, likelihood ratio negative < 0.1, exclusion only, suggesting that the test is not useful for confirmation of pres-
ence of disease (when positive) but is for its exclusion (when negative). Right lower quadrant, likelihood ratio positive < 10, likelihood ratio
negative > 0.1, no exclusion or confirmation, suggesting that the test is neither useful for confirmation of presence of disease (when pos-
itive) nor for its exclusion (when negative) (44).
This concept is depicted visually with Fagan nomograms (43). plot is a graphical sensitivity analysis of predictive value across
When Bayes theorem is expressed in terms of log-odds, the a prevalence continuum defining low- to high-risk popula-
posterior log-odds are linear functions of the prior log-odds tions (Fig 9). It depicts separate curves for positive and negative
and the log-likelihood ratios. A Fagan plot, as shown in tests. The user draws a vertical line from the selected pretest
Figure 7, consists of a vertical axis on the left with the prior probability to the appropriate likelihood ratio line and then
log-odds, an axis in the middle representing the log-likelihood reads the posttest probability off the vertical scale. General
ratio, and a vertical axis on the right representing the poste- summary statistics have also been introduced when it may be
rior log-odds. Lines are then drawn from the prior probability of interest to evaluate the effect of p on predictive values: un-
on the left through the likelihood ratios in the center and ex- conditional positive- and negative-predictive values, which
tended to the posterior probabilities on the right (Fig 7). permit prevalence heterogeneity (45). These measures are ob-
tained by integrating their corresponding conditional (on p)
Likelihood Ratio Scatter Graph versions with respect to a prior distribution for p. The prior
Informativeness may also be represented graphically by a like- posits assumptions about the risk level in a hypothetical pop-
lihood ratio scatter graph or matrix (44). It defines quadrants ulation of interest, for example, low, high, moderate risk, as
of informativeness based on established evidence-based thresh- well as the heterogeneity in the population. Figure 8 plots
olds. The likelihood ratio scatter graph shows summary point the relationship between pre- and posttest probability based
of likelihood ratios obtained as functions of mean sensitivity on the likelihood of a positive (above diagonal line) or neg-
and specificity (Fig 8) (5). ative (below diagonal line) test result over the 0–1 range of
Predictive Values and Probability-modifying Plot pretest probabilities.
The conditional probability of disease given a positive (or neg-
ative) test, the so-called positive (or negative)-predictive values,
CONCLUSION
is critically important to clinical application of a diagnostic
procedure. It depends not only on sensitivity and specificity The information from systematic reviews and meta-analyses
but also on disease prevalence (p). The probability-modifying is important to clinicians, health policy makers, researchers
586
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
and developers of diagnostic techniques, patients, and the general 7. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-
related bias in studies of diagnostic tests. JAMA 1999; 282:1061–1066.
public. In this review, the procedural and analytic methods 8. McGrath TA, McInnes MDF, Langer FW, et al. Treatment of multiple test
for conducting systematic reviews of diagnostic imaging studies readers in diagnostic accuracy systematic reviews-meta-analyses of
have been discussed. A guide to constructing the research ques- imaging studies. Eur J Radiol 2017; 93:59–64.
9. Buscemi N, Hartling L, Vandermeer B, et al. Single data extraction gen-
tion, literature search strategies, and study selection is provided. erated more errors than double data extraction in systematic reviews. J
Current recommendations for the evidence appraisal process Clin Epidemiol 2006; 59:697–703.
and methodological quality assessment and recommenda- 10. Jones AP, Remmington T, Williamson PR, et al. High prevalence but low
impact of data extraction and reporting errors were found in Cochrane
tions on the use of study quality in quantitative synthesis are systematic reviews. J Clin Epidemiol 2005; 58:741–742.
also discussed. Properties and limitations of the convention- 11. Gotzsche PC, Hrobjartsson A, Maric K, et al. Data extraction errors in
al meta-analytic technique of HSROC curves and mixed- meta-analyses that use standardized mean differences. JAMA 2007;
298:430–437.
effects models, simultaneously synthesized sensitivity and 12. Cook DJ, Sackett DL, Spitzer WO. Methodologic guidelines for system-
specificity pairs, are discussed and summarized. The paper ad- atic reviews of randomized control trials in health care from the Potsdam
dressed the use of meta-regression to investigate unobserved Consultation on meta-analysis. J Clin Epidemiol 1995; 48:167–171.
13. Whiting PF, Rutjes AW, Westwood ME, et al. QUADAS-2: a revised tool
heterogeneity and covariate effects. The graphical and statis- for the quality assessment of diagnostic accuracy studies. Ann Intern
tical elements of a clinically useful report are discussed. Med 2011; 155:529–536.
Challenges confront investigators undertaking these reviews. 14. Moses LE, Shapiro D, Littenberg B. Combining independent studies of
a diagnostic test into a summary ROC curve: data-analytic approaches
However, we encourage radiological investigators to become and some additional considerations. Stat Med 1993; 12:1293–1316.
familiar with these techniques and to collaborate with meth- 15. Midgette AS, Stukel TA, Littenberg B. A meta-analytic method for sum-
odologists who can enhance the design and conduct of marizing diagnostic test performances: receiver-operating-characteristic-
summary point estimates. Med Decis Making 1993; 13:253–257.
diagnostic imaging systematic reviews. 16. van Houwelingen HC, Arends LR, Stijnen T. Advanced methods in meta-
analysis: multivariate approach and meta-regression. Stat Med 2002;
21:589–624.
REFERENCES 17. Macaskill P. Empirical Bayes estimates generated in a hierarchical summary
ROC analysis agreed closely with those of a full Bayesian analysis. J Clin
1. Berman NG, Parker RA. Meta-analysis: neither quick nor easy. BMC Med Epidemiol 2004; 57:925–932.
Res Methodol 2002; 2:10. 18. Reitsma JB, Glas AS, Rutjes AW, et al. Bivariate analysis of sensitivity
2. Whiting P, Rutjes AW, Reitsma JB, et al. Sources of variation and bias and specificity produces informative summary measures in diagnostic
in studies of diagnostic accuracy: a systematic review. Ann Intern Med reviews. J Clin Epidemiol 2005; 58:982–990.
2004; 140:189–202. 19. Chu H, Cole SR. Bivariate meta-analysis of sensitivity and specificity with
3. Whiting P, Rutjes AW, Reitsma JB, et al. The development of QUADAS: sparse data: a generalized linear mixed model approach. J Clin Epidemiol
a tool for the quality assessment of studies of diagnostic accuracy in- 2006; 59:1331–1332, author reply 1332–1333.
cluded in systematic reviews. BMC Med Res Methodol 2003; 3:25. 20. Rutter CM, Gatsonis CA. A hierarchical regression approach to meta-
4. Whiting P, Rutjes AW, Dinnes J, et al. Development and validation of analysis of diagnostic test accuracy evaluations. Stat Med 2001; 20:2865–
methods for assessing the quality of diagnostic accuracy studies. Health 2884.
Technol Assess 2004; 8:iii, 1–234. 21. Arends LR, Hamza TH, van Houwelingen JC, et al. Bivariate random effects
5. Leeflang MM, Deeks JJ, Gatsonis C, et al. Systematic reviews of diag- meta-analysis of ROC curves. Med Decis Making 2008; 28:621–638.
nostic test accuracy. Ann Intern Med 2008; 149:889–897. 22. Harbord RM, Deeks JJ, Egger M, et al. A unification of models for
6. Simes RJ. Publication bias: the case for an international registry of clin- meta-analysis of diagnostic accuracy studies. Biostatistics 2007; 8:239–
ical trials. J Clin Oncol 1986; 4:1529–1541. 251.
587
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
23. McGrath TA, McInnes MD, Korevaar DA, et al. Meta-analyses of diag- 41. Deeks JJ, Macaskill P, Irwig L. The performance of tests of publication
nostic accuracy in imaging journals: analysis of pooling techniques and bias and other sample size effects in systematic reviews of diagnostic
their effect on summary estimates of diagnostic accuracy. Radiology 2016; test accuracy was assessed. J Clin Epidemiol 2005; 58:882–893.
281:78–85. 42. Peters JL, Sutton AJ, Jones DR, et al. Comparison of two methods to
24. Mulrow CD. Rationale for systematic reviews. BMJ 1994; 309:597– detect publication bias in meta-analysis. JAMA 2006; 295:676–680.
599. 43. Fagan TJ. Letter: nomogram for Bayes theorem. N Engl J Med 1975;
25. Lau J, Ioannidis JP, Terrin N, et al. The case of the misleading funnel 293:257.
plot. BMJ 2006; 333:597–600. 44. Stengel D, Bauwens K, Sehouli J, et al. A likelihood ratio approach to
26. Lewis S, Clarke M. Forest plots: trying to see the wood and the trees. meta-analysis of diagnostic studies. J Med Screen 2003; 10:47–51.
BMJ 2001; 322:1479–1480. 45. Li J, Fine JP, Safdar N. Prevalence-dependent diagnostic accuracy mea-
27. Hanley JA. Receiver operating characteristic (ROC) methodology: the state sures. Stat Med 2007; 26:3258–3273.
of the art. Crit Rev Diagn Imaging 1989; 29:307–335. 46. Dwamena B. MIDAS: Meta-analytical integration of diagnostic accura-
28. Ioannidis JP, Patsopoulos NA, Evangelou E. Uncertainty in heterogene- cy studies in Stata, West Coast Stata Users’ Group meetings. University
ity estimates in meta-analyses. BMJ 2007; 335:914–916. of Michigan, 2007. MIDAS Web site. Published August 15, 2007. Avail-
29. Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in able at: http://sitemaker.umich.edu/metadiagnosis/midas_home.
meta-analyses. BMJ 2003; 327:557–560. 47. Dwamena B. MIDAS: Meta-analytical integration of diagnostic accura-
30. Higgins JP, Thompson SG. Quantifying heterogeneity in a meta- cy studies in Stata, North American Stata Users’ Group meetings.
analysis. Stat Med 2002; 21:1539–1558. University of Michigan, 2007. MIDAS Web site. Published August 15, 2007.
31. Zhou Y, Dendukuri N. Statistics for quantifying heterogeneity in univari- Available at: http://sitemaker.umich.edu/metadiagnosis/midas_home.
ate and bivariate meta-analyses of binary data: the case of meta- 48. Dwamena BA. MIDAS: Stata module for meta-analytical integration of
analyses of diagnostic accuracy. Stat Med 2014; 33:2701–2717. diagnostic test accuracy studies. Boston, MA: Boston College Depart-
32. McInnes MD, Hibbert RM, Inacio JR, et al. Focal nodular hyperplasia and ment of Economics, 2008. Available at: http://ideas.repec.org/c/boc/
hepatocellular adenoma: accuracy of gadoxetic acid–enhanced MR bocode/s456880.html.
imaging—a systematic review. Radiology 2015; 277:413–423. 49. Van Houwelingen HC, Zwinderman KH, Stijnen T. A bivariate approach
33. Dinnes J, Deeks J, Kirby J, et al. A methodological review of how het- to meta-analysis. Stat Med 1993; 12:2273–2284.
erogeneity has been examined in systematic reviews of diagnostic test 50. Riley RD, Abrams KR, Lambert PC, et al. An evaluation of bivariate random-
accuracy. Health Technol Assess 2005; 9:1–113, iii. effects meta-analysis for the joint synthesis of two correlated outcomes.
34. Lee J, Kim KW, Choi SH, et al. Systematic review and meta-analysis of Stat Med 2007; 26:78–97.
studies evaluating diagnostic test accuracy: a practical review for clin- 51. Riley RD, Abrams KR, Sutton AJ, et al. Bivariate random-effects meta-
ical researchers—part II. Statistical methods of meta-analysis. Korean analysis and the estimation of between-study correlation. BMC Med Res
J Radiol 2015; 16:1188–1196. Methodol 2007; 7:3.
35. Lijmer JG, Bossuyt PM, Heisterkamp SH. Exploring sources of hetero- 52. Rabe-Hesketh S. GLLAMM manual. University of California-Berkeley, Di-
geneity in systematic reviews of diagnostic tests. Stat Med 2002; 21:1525– vision of Biostatistics, Working Paper Series Paper No. 160. 2004.
1537. 53. Rabe-Hesketh S, Skrondal A, Pickles A. Reliable estimation of general-
36. Egger M, Davey Smith G, Schneider M, et al. Bias in meta-analysis de- ized linear mixed models using adaptive quadrature. Stata J 2002; 2:1–
tected by a simple, graphical test. BMJ 1997; 315:629–634. 21.
37. Sterne JA, Egger M. Funnel plots for detecting bias in meta-analysis: 54. Littenberg B, Moses LE. Estimating diagnostic accuracy from multiple
guidelines on choice of axis. J Clin Epidemiol 2001; 54:1046–1055. conflicting reports: a new meta-analytic method. Med Decis Making 1993;
38. Egger M, Smith GD. Misleading meta-analysis. BMJ 1995; 311:753– 13:313–321.
754. 55. Harbord R, Whitting P, Sterne J. metandi: Stata module for statistically
39. Sterne JA, Egger M, Smith GD. Systematic reviews in health care: in- rigorous meta-analysis of diagnostic accuracy studies. In: Methods for
vestigating and dealing with publication and other biases in meta- evaluating medical tests. Birmingham, UK: Department of Public Health,
analysis. BMJ 2001; 323:101–105. Epidemiology and Biostatistics, University of Birmingham, 2008. 1st Sym-
40. Terrin N, Schmid CH, Lau J. In an empirical evaluation of the funnel plot, posium; July 24–25, 23.
researchers could not visually identify publication bias. J Clin Epidemiol 56. Zamora J, Abraira V, Muriel A, et al. Meta-DiSc: a software for meta-
2005; 58:894–901. analysis of test accuracy data. BMC Med Res Methodol 2006; 6:31.
588
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
TABLE A1. Examples of PICOS (Patient, Intervention, Comparator, Outcome, and Study Design) or in the Cochrane Guidelines
for Diagnostic Accuracy Tests as PICTS (Patient, Index Test, Comparator test, Target Disorder and Study Design) Statements
(PICOS)—Patient,
Population, Problem Intervention Comparator Outcome Study design
(PICTS)—Patient,
Population, Problem Index test Comparator test Target disorder Study design
Symptomatic carotid Computed tomographic Magnetic resonance Sensitivity, specificity, and
stenosis angiography (CTA) angiography (MRA) diagnostic accuracy
Detection and quantification of
carotid stenosis
Known or suspected CT coronary angiography Invasive catheter Sensitivity, specificity, and
coronary artery coronary angiography diagnostic accuracy
disease Identifying one (or more) potentially
or probably hemodynamically
significant (≥50% coronary artery
luminal diameter) stenosis
A solitary pulmonary Dynamic contrast material– Histology Sensitivity, specificity, and
nodule enhanced CT diagnostic accuracy
Dynamic contrast material– Diagnosis of malignancy
enhanced MRI
FDG PET
99m
Tc-depreotide SPECT
Known or suspected Ultrasound MRI Sensitivity, specificity, and
rotator cuff tears diagnostic accuracy
Low-dose CT colonography Optical colonoscopy Sensitivity, specificity, and
(CTC) (OC) diagnostic accuracy
Clinically meaningful colonic
polyps
589
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
590
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
TABLE A3. The Commonly Used Summary Statistics for Test Accuracy Including a 2 × 2 Contingency Table with Sensitivity and
Specificity Positive- and Negative-predictive Values and Accuracy Calculated
Disease
True False
Test outcome Positive True positive False positive → Positive predictive value = TP/(TP + FP)
Negative False negative True negative → Negative predictive value = TN/(TN + FN)
↓ ↓ → Accuracy = (TP +TN)/(TP + FP + FN + TN)
Sensitivity = Specificity = → Prevalence = (TP + FN)/(TP + FP + FN + TN)
True positive rate = True-negative rate =
True positive fraction = True-negative fraction =
Detection rate = TP/(TP + FN) TN/(FP + TN)
FN, false-negative; FP, false positive; TN, true negative; TP, true positive.
Sensitivity = TP/(TP + FN).
Specificity = TN/(TN + FP).
Positive-predictive value = TP/(TP + FP).
Negative-predictive value = TN/(TN + FN).
Accuracy = TP + TN/(TP + FP + FN + TN).
who are nonstatisticians. For statisticians, RevMan is an easy scores, and pretest or posttest probabilities, among other
tool to perform the meta-analyses and generate the graphs such analyses.
as forest plots and funnel plots. RevMan 5 is the software used
for preparing and maintaining Cochrane Reviews. Thou-
metandi
sands of systematic reviews and meta-analyses published on
the Cochrane Library are performed using RevMan. RevMan metandi is a user-written Stata software for meta-analysis of
can be downloaded at http://community.cochrane.org/ diagnostic test accuracy studies, in which both the index test
tools/review-production-tools/revman-5/revman-5-download. under study and the reference test (gold standard) are di-
chotomous (55). It takes as input tp fp fn tn (the number of
dr-ROC true positives, false positives, false negatives, and true nega-
tives) within each study. It fits a two-level mixed logistic
dr-ROC is a highly specialized Microsoft Excel workbook regression model, with independent binomial distributions for
file for meta-analysis of diagnostic tests available commer- the true positives and true negatives conditional on the sen-
cially. The standard version of the software is limited to 25 sitivity and specificity in each study, and a bivariate normal
or fewer studies. A version that handles up to 100 studies is model for the logit transforms of sensitivity and specificity
available, free of charge, to licensees who contact the pub- between studies (19). Estimates are displayed for the param-
lisher. Key strengths of dr-ROC include an easy-to-use eters of both the bivariate model and the hierarchical summary
complete, self-contained solution for meta-analysis, and au- (18) receiver operating characteristic (HSROC) model (20).
tomatic generation of publication quality graphics. The statistical In Stata 10, metandi fits the model using the built-in command
methodology of dr-ROC is based on the SROC approach xtmelogit by default. In Stata 8 or 9, it makes use of the user-
to meta-analysis (14,54). Data analysis options are set using written command gllamm. metandi does not allow covariates
simple pull-down menus on the same page as data entry. Graph to be fitted, that is, meta-regression of diagnostic accuracy is
options are set with check boxes right next to the graphs. An- not supported. metandiplot graphs the results from metandi
alytic options include fixed-effects (Mantel-Haenszel) and on an SROC plot. By default, the display includes a summary
random-effects (DerSimonian-Laird) meta-analysis of DORs, point showing the summary sensitivity and specificity, a con-
pooled sensitivity and specificity with CIs, and Spearman rank fidence contour outlining the confidence region for the
correlation and Pearson product-moment correlations for sen- summary point, one or more prediction contours outlining
sitivity vs specificity, along with their statistical significance. the prediction region for the true sensitivity and specificity
The coefficient of determination, r2, measures the propor- in a future study, and the HSROC curve from the hierar-
tion of variation in specificity that would be accounted for chical summary ROC (HSROC) model. If the optional
by differences in sensitivity. Graphical options include forest variables tp fp fn tn are included on the command line, the
plots of study-by-study sensitivity and specificity, ROC plot plot also includes study estimates, indicating the sensitivity and
comparing SROC, random-effects, and fixed-effects meta- specificity estimated using the data from each study separate-
analysis results and showing random-effects or fixed-effects ly. If the model was fitted using gllamm, post-estimation
results on SROC and logit plot. dr-ROC does heterogene- predictions are obtained using gllapred. If the model was
ity analysis, statistical analysis of user-supplied study quality fitted using xtmelogit, the predictions are obtained using
591
CRONIN ET AL Academic Radiology, Vol 25, No 5, May 2018
predict-see [XT] xtmelogit postestimation-predict. Module evaluation, while taking into account the possibly imperfect
is available at http://ideas.repec.org/c/boc/bocode/ sensitivity and specificity of the reference test. This hierar-
s456932.html. chical model accounts for both within- and between-study
variability. Estimation is carried out using a Bayesian ap-
Metadas
proach, implemented via a Gibbs sampler. The model can be
applied in situations where more than one reference test is
Metadas, a SAS macro, developed as a wrapper for Proc used in the selected studies. It is available at http://cran.r
NLMIXED for implementation of hierarchical or multilev- -project.org/web/packages/HSROC/index.html.
el methods for the meta-analysis of diagnostic accuracy studies.
Metadas reduces the problem of selecting starting values for Meta-DiSc
model parameters in Proc NLMIXED. The macro can run
any number of tests consecutively and has several options, which Meta-DiSc is a Windows-based, user-friendly, freely avail-
include model choice (hierarchical summary receiver oper- able, well-validated (for academic use) software to performing
ating characteristic or bivariate model), predictions based on dedicated diagnostic meta-analysis (56). Zamora et al. de-
the empirical Bayes estimates, covariate inclusion, likeli- scribed Meta-DiSc as (1) performing independent statistical
hood ratio tests, and model checking. The output of the analysis pooling of sensitivities, specificities, likelihood ratios, and DORs
is summarized in a Word document with all parameter esti- using fixed- and random-effects models, both overall and in
mates in a format suitable for input into the Cochrane subgroups; (2) allowing exploration of heterogeneity, with a
Collaboration software, RevMan 5, to produce SROC plots. variety of statistics including chi-square, I-squared, and Spear-
In addition, estimates of summary measures of test accuracy man correlation tests; (3) implementing meta-regression
such as the expected sensitivity, specificity, likelihood ratios, techniques to explore the relationships between study char-
and DORs are produced, as well as relative measures when acteristics and accuracy estimates based on; and (4) producing
there is a covariate in the model. Metadas is a versatile program high-quality figures, including forest plots and linear regression–
that renders meta-analysis of diagnostic accuracy studies in SAS based summary receiver operating characteristic curves that
more accessible. It has no graphical capability in terms of can be exported for use in manuscripts for publication (56).
SROC plots but provides more flexibility in model fitting All computational algorithms have been validated through com-
and result output. It is available from the authors at parison with different statistical tools and published meta-
y.takwoingi@bham.ac.uk. analyses (56). Meta-DiSc has a Graphical User Interface with
roll-down menus, dialog boxes, and online help facilities. The
mada software is publicly available at http://www.hrc.es/
investigacion/metadisc-en.htm. Although Meta-DiSc has already
The specialized software required and technical difficulty as- been used and cited in several meta-analyses published in high-
sociated with using hierarchic models may be a barrier to their ranking journals, there is a note of caution. Meta-DiSc has
use. With the release of the freeware package “mada” in R no capacity to perform hierarchic methods (23). As stated pre-
in 2012, this readily available software with a clear and concise viously, we reiterate the importance that the recommended
user guide should considerably reduce the technical and eco- HSROC or bivariate methods for pooling in meta-analyses
nomic barriers to hierarchic methods of data pooling for research of diagnostic accuracy studies be used rather than traditional
groups. The open-source R-package mada provides some es- univariate methods. McGrath et al. showed in 120 reviews
tablished and some current approaches to diagnostic meta- in which traditional univariate pooling methods for meta-
analysis, as well as functions to produce descriptive statistics analysis were used. The authors performed their analyses with
and graphics. It is assumed that the reader is familiar with central Meta-DiSc. This represented nearly two-thirds of reviews in
concepts of meta-analysis, such as fixed- and random-effects this category (23). Unfortunately, Meta-DiSc remains avail-
models and ideas behind diagnostic accuracy meta-analysis and able online and continues to be touted as a tool for use in
(S)ROC. Once R is installed and an Internet connection is reviews of diagnostic test accuracy (23).
available, the package can be installed from CRAN on most
systems by typing >install.packages(“mada”).
SENSITIVITY AND SPECIFICITY
Development of mada is hosted at http://r-forge.r-project
.org/projects/mada/. Sensitivity and specificity are measures defined conditional on
disease status as they are computed as proportions of the number
HSROC diseased and the number nondiseased, respectively. The sen-
sitivity of a test is defined as the probability that the index
The open-source R-package for joint meta-analysis of diag- test result will be positive in a diseased case. Sensitivity is also
nostic test sensitivity and specificity with or without a gold referred to as detection rate (DR), true-positive rate (TPR),
standard reference is authored by Ian Schiller and Nandini or true-positive fraction (TPF) (see Table A3). The specific-
Dendukuri. This package implements a model for joint meta- ity of a test is defined as the probability that the index test
analysis of sensitivity and specificity of the diagnostic test under result will be negative in a nondiseased case. Specificity is also
592
Academic Radiology, Vol 25, No 5, May 2018 SYSTEMATIC REVIEW AND META-ANALYSIS
593