Professional Documents
Culture Documents
SUMMARY
An important quality of meta-analytic models for research synthesis is their ability to account for
both within- and between-study variability. Currently available meta-analytic approaches for studies of
diagnostic test accuracy work primarily within a 6xed-e8ects framework. In this paper we describe
a hierarchical regression model for meta-analysis of studies reporting estimates of test sensitivity and
speci6city. The model allows more between- and within-study variability than 6xed-e8ect approaches, by
allowing both test stringency and test accuracy to vary across studies. It is also possible to examine the
e8ects of study speci6c covariates. Estimates are computed using Markov Chain Monte Carlo simulation
with publicly available software (BUGS). This estimation method allows ;exibility in the choice of
summary statistics. We demonstrate the advantages of this modelling approach using a recently published
meta-analysis comparing three tests used to detect nodal metastasis of cervical cancer. Copyright ? 2001
John Wiley & Sons, Ltd.
1. INTRODUCTION
The need for systematic review and synthesis of published evidence on the accuracy of diag-
nostic tests has increased in recent years. The information from such reviews is a key element
of clinical and health policy decision making regarding the use of diagnostic tests; it is also
essential for guiding the process of technology development and evaluation in diagnostic
medicine [1; 2].
Statistical methods for meta-analysis of diagnostic test evaluations have focused on the
analysis of studies reporting estimates of test sensitivity and speci6city, which constitute the
majority of studies in diagnostic test evaluation [1–9]. A key goal in the synthesis of such
studies is to derive summary measures of test performance. These measures must account for
the trade-o8 between sensitivity and speci6city as the threshold for positivity varies along
some explicit or latent scale. This trade-o8 has been widely recognized in the evaluation of
∗ Correspondence to: Carolyn M. Rutter, Group Health Cooperative, Center for Health Studies, 1730 Minor Avenue,
Suite 1600, Seattle, WA 98101, U.S.A.
† E-mail: rutter.c@ghc.org
diagnostic tests and has led to the development of receiver operating characteristic (ROC)
curves [10]. Brie;y, the ROC curve for a diagnostic test is the set of all pairs of sensitivity
and speci6city that can be achieved as the positivity threshold varies across the entire range
of possible values.
The experience from research synthesis of diagnostic test evaluations shows that there is
substantial variation among the estimates of a test’s sensitivity and speci6city across published
studies [1; 2]. Di8erences in positivity threshold constitute an important source of variability
across studies (and also across diagnosticians, within a given study). In addition, study char-
acteristics, such as technical aspects of the diagnostic test, patient and disease cohorts, study
settings, experience of readers, and sample size are also potential contributors to between-
studies variations in the estimates of diagnostic performance. Simple averaging or pooling
across studies can provide misleading conclusions, as can be readily seen from a simple
example. If three studies report the following estimates of test sensitivity and speci6city:
(0:10; 0:90); (0:80; 0:80) and (0:90; 0:10), the average pair of sensitivity and speci6city is
(0:60; 0:60) and lies completely outside the domain of the original studies (see also references
[1; 2; 6]).
As in other areas of meta-analysis, 6xed-e8ects regression models have been proposed to
account for between-study variability in diagnostic test evaluations [4; 5]. The use of regression
models provides a ;exible and powerful framework for meta-analysis. However, the number
of covariates that can be accommodated in such models is limited. In addition, these 6xed-
e8ects regression models may not provide realistic accounts of the uncertainty associated with
covariate estimates.
In this paper, we expand on earlier work [5] and present a hierarchical model formulation
of the problem of combining information across studies reporting estimates of test sensitivity
and speci6city. The structure of the model is similar to that of models proposed for the meta-
analysis of treatment studies [11–13]. The observed variation is partitioned into within- and
between-studies components. Each component consists of a systematic part and a random part,
with the former attributed to covariates and the latter to unexplained variation. The hierarchical
model makes it possible to pool information across studies and derive smoothed estimates of
covariate e8ects, components of variance and individual study quantities. In addition, simple
extensions of the hierarchical structure can incorporate patient-level information within each
study, when such information is available.
We present our approach using data from a recently published meta-analysis comparing the
diagnostic performance of three imaging modalities for the detection of lymph node metas-
tasis in women with cervical cancer [14]. In Section 2 we survey 6xed e8ects approaches
to the problem. The hierarchical regression model is presented in Section 3. We take a fully
Bayesian approach to model 6tting and checking and use Markov chain Monte Carlo estima-
tion techniques. Technical issues are discussed in Section 3 and the analysis of the example
and the conclusions we draw are presented in Section 4. The 6nal section summarizes our
methodological conclusions.
The simplest setting for the methods discussed in this paper involves meta-analyses in which
each of m studies contributes a vector zi of study-level covariates (i = 1; : : : ; m) and a 2 × 2
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2867
table of summary data, showing the agreement between the binary test result and the de6nitive
disease information (or reference standard). As noted in the introduction, data of this type are
reported in the vast majority of studies evaluating the performance of diagnostic modalities.
We will use the following notation:
Test
0 = no 1 = yes
Truth no yi00 yi01 ni0
yes yi10 yi11 ni1
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2868 C. M. RUTTER AND C. A. GATSONIS
bias in parameter estimates (15) and summaries of the curve. Further exploration is needed
to determine the e8ect of ignoring error in Sj on both point estimation and coverage rates of
estimated con6dence intervals.
An alternative approach to constructing an SROC curve was proposed by Kardaun and
Kardaun (3), who assumed that (logit(TP i ); logit(FP
i )) follows a bivariate normal distribution
and postulated a linear relationship between the two components of the bivariate mean. Pro6le
likelihood is used to derive estimates of the slope and intercept in this model, which includes
variability in both rates.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2869
degrees of freedom, there are practical limitations on the number of covariates that can be
included in these models.
The binomial regression model (2) has a formal similarity to the logistic two-parameter
item characteristic curve (ICC) model, with studies corresponding to subjects (raters) and
patients to items. ICC models have long been studied in educational testing [19]. However,
unlike the usual educational testing setting, where multiple subjects respond to each item, in
the meta-analysis setting patients are nested within studies.
3.1. Model
Hierarchical regression analysis extends the binomial regression model to more fully account
for both within- and between-study variability in TP and FP rates. The model allows the
inclusion of patient- and study-level covariates, if such information is available, and has the
following structure.
3.1.1. Level I (Within-study variation). The number of positive tests from the ith study,
yi01 and yi11 , are assumed to be independent and to follow binomial distributions, with the
probability of a positive test given by
logit(ij ) = (i + i Xij )e−Xij (3)
where Xij denotes the true disease status for cases in the ijth cell. Under this hierarchical
SROC model (HSROC), both positivity criteria (i ) and accuracy parameters (i ) are allowed
to vary across studies. Because estimation of the scale parameter () requires information
from more than one study, is assumed to be constant across studies. If we were to allow
to vary across studies, then within-study parameters would be identi6able only through
their prior distributions. The assumption of a constant can be relaxed somewhat, as we
demonstrate in the example.
3.1.2. Level II (Between-study variation). Study-level parameters in (3), i and i , are as-
sumed to be Normally distributed, with mean determined by a linear function of study-level
covariates. In the case of a single covariate Z a8ecting both the cutpoint and accuracy pa-
rameters, the model can be written as
i |P; ; Zi ; 2 ∼ N(P + Zi ; 2 )
conditionally independent
i |Q; ; Zi ; 2 ∼ N(Q + Zi ; 2 )
The coeRcients and model systematic di8erences in positivity criteria and accuracy across
studies, due to the covariate Z. However, more general formulations of the model can be
considered in which more than one covariate is included and di8erent covariates are used for
‘cutpoint’ and ‘accuracy’ regression equations.
The assumed conditional independence of i and i re;ects assumptions implicit in ROC
analysis. In the context of ROC analysis, positivity threshold and accuracy are independent
test characteristics that together impose correlation between a test’s sensitivity and speci6city.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2870 C. M. RUTTER AND C. A. GATSONIS
3.1.3. Level III. The speci6cation of the hierarchical model is completed by the choice of
prior distributions for the remaining unknown parameters. In particular, we chose
P ∼ Uniform[1 ; 2 ]; ∼ Uniform[ 1 ; 2 ]; 2 ∼ S−1 ( 1 ; 2 )
∼ Uniform[1 ; 2 ]
The parameters P, Q, , , , 2 and 2 are assumed to be mutually independent. The para-
meters ; ; ; ; ; and are assumed to be 6xed and are chosen to re;ect plausible
ranges. Choice of prior ranges is discussed in the following section, and is demonstrated in
the data example.
Summary ROC (SROC) curves can be derived using the expected values of Q + Z and .
If true disease state is coded 12 for disease positive cases and − 12 for not diseased cases, then
for a given value of the covariate Zi , the model-based true positive rate can be expressed as
TP(FP) = logit −1((logit(FP)eE()=2 + E(Q + Zi ))eE()=2 )
The SROC curve is drawn by plotting (FP; TP(FP)) for FP ∈ [0; 1]. Extrapolation beyond the
available data can be discouraged by plotting the curve only over the observed range of FP.
3.2.1. Choice of priors. Prior ranges for P, Q and should be chosen to re;ect subject matter
knowledge about the diagnostic modalities under review. In general, the interval [−10; 10]
covers all plausible values of P. Similarly, the interval [−5; 5] covers all plausible values of
. Because we expect positive test results, indicating disease, to be more common among
patients with disease, the interval [−2; 20] covers all plausible values of Q.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2871
Selection of the inverse gamma (S−1 ) priors for the between-study variance parameters, 2
and 2 , is more diRcult. The sampler is potentially sensitive to the prior distribution used
to model variance parameters, which can a8ect the width posterior interval estimates. The
inverse gamma prior for variance parameters should be selected to re;ect the expected range
of study-speci6c accuracy and cutpoint parameters. The goal in choosing an appropriate prior
for variance parameters is to select a relatively di8use distribution that does not assign too
much probability to very large (unrealistic) values. Fortunately, choice of the S−1 priors can be
guided by the realistic ranges of cutpoint and accuracy parameters, noted above. The observed
variability of naive cutpoint and accuracy parameter estimates can also guide selection of S−1
priors.
3.2.2. Parameter estimation. The goal of estimation is description of the posterior distribution
of model parameters and summary statistics that are functions of model parameters. Posterior
95 per cent credible intervals were estimated by empirical 2.5 per cent and 97.5 per cent
posterior percentiles of simulated draws. The mode of symmetric posterior distributions was
approximated by the mean value across simulated draws (that is, P, Q and ). The mode
of asymmetric posterior distributions was approximated by the median value across simulated
draws (that is, 2 and 2 ).
3.2.3. Assessing convergence. We based estimation on draws from several chains started at
extreme points in the parameter space. The CODA program [24] was used to evaluate con-
vergence to the target distribution. We used Geweke’s diagnostic to evaluate convergence of
individual chains for symmetrically distributed variables [25]. The Geweke statistic is based
on comparison of the mean of draws early in the sequence of iterates versus the mean of
later draws. If the chain is stationary, then these means should be similar. We also used
Heidelberger and Welch’s method for evaluating stationarity based on the Cramer-von-Mises
statistic [26]. When there was no evidence against convergence, we next used estimates of
scale reduction proposed by Gelman and Rubin to examine whether the multiple chains con-
verged to the same distribution [27]. The scale reduction statistic is essentially the ratio of
the between-chain variance to the within-chain variance. As a 6nal check, we compared pa-
rameter estimates from each individual chain to parameter estimates based on pooling across
chains.
3.2.4. Diagnostics. Diagnostic statistics were used to evaluate possible model misspeci6ca-
tion, overall goodness-of-6t, and to identify outlying and possibly in;uential data points. Our
approach roughly follows the suggestions of Weiss [28]. Because the number of true pos-
itive and false positive results within studies can safely be assumed to follow a binomial
distribution, checks for model misspeci6cation were restricted to evaluation of the prior dis-
tributions for study-speci6c parameters and . Recall that we assume that both statistics are
normally
distributed. Under the exchangeable model (that is, = = 0), the sums of squares
S = i (i − P)2 =2 and S = i (i − Q)2 =2 should follow a " 2 distribution with m degrees of
freedom, where m is the number of studies in the meta-analysis. Under the non-exchangeable
model sums of squares are of the form S = i (i − Q − Zi )2 =2 . Tail probabilities that are
too large (or too small) suggest misspeci6cation of prior distributions.
We assessed the assumption of conditional independence between and by estimating
these parameters directly from the data, then examining scatter plots and correlation. These
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2872 C. M. RUTTER AND C. A. GATSONIS
where yij1 is the number of subjects testing positive i not-diseased ( j = 0) and diseased ( j = 1)
2 2
groups. Dcount is compared to a "2m distribution The second global goodness-of-6t statistics is
based on continuity-corrected log-odds ratios
where log(OR cc )i is the observed continuity corrected log-odds ratio for the ith study. Outliers
and potentially in;uential points can be identi6ed using plots of sensitivity versus speci6city
and by examining chi-squared residuals.
These overall TP and FP rates correspond to the test’s expected operating characteristics,
given the observed data.
Likelihood ratio statistics describe the post-test change in odds of disease. The likelihood
ratio positive (LR + ) estimates the change in the odds of disease following a positive test, that
is, P(D + |T +)=P(D − |T +) = LR + × (pre-test odds). Similarly, the likelihood ratio negative
(LR − ) estimates the change in the odds of disease following a negative test. These statistics
are estimated by LR+ = TP=FP and LR − = (1 − TP)=(1 − FP), where TP and FP are given
by (5).
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2873
Lymph node metastasis a8ects both the prognosis and treatment of invasive cervical cancer
[30; 31]. When disease is limited to the cervix and tumours are relatively small (¡4 cm),
surgical treatment is preferred [32]. When the tumour is larger, has spread to nearby organs,
or involves lymph nodes, then radiation therapy (alone or combined with chemotherapy)
is preferred [33]. Nodal metastasis is particularly diRcult to detect. When nodal metastasis
is discovered during surgery, additional radiation therapy is recommended. However, surgery
results in greater morbidity for these women relative to women who undergo radiation without
prior surgery.
Good preoperative staging, especially identi6cation of lymph node metastasis, can
improve outcomes by improving treatment plans. Three types of diagnostic images have
been widely used to identify lymphadenopathy: lymphangiography (LAG); computed
tomography (CT), and magnetic resonance imaging (MR). LAG provides information
about the drainage patterns from lymph nodes. CT and MR provide information about the
size of lymph nodes. Unfortunately, these imaging techniques are not believed to be sensitive
enough to adequately determine appropriate treatment, prompting some to advocate the routine
use of surgical staging [34; 35]. Health care providers and policy makers who decide whether
to include these imaging tests as part of preoperative staging need the best possible
estimates of the diagnostic information these tests provide to inform their
decisions.
Scheidler and colleagues combined information from several studies to estimate and compare
the ability of LAG, CT and MR to accurately detect lymph node metastasis. They derived
6xed e8ect SROC curves for each modality. Tests were compared at the point on the SROC
where sensitivity equals speci6city using true positive rates (that is, the Q∗ statistic) and
LR statistics. Scheidler et al. examined overall detection of nodal metastasis, and in sub-
analyses examined detection of pelvic nodes and para-aortic nodes. Because both para-aortic
and pelvic node involvement a8ect treatment and prognosis, we focus on accuracy of overall
detection of lymph node metastasis. In the following sections, we describe the data used in
the original meta-analysis, the application of the HSROC model to these data, results from
the HSROC model, and conclusions drawn from the HSROC model and how these relate to
original 6ndings based on the SROC approach.
4.1. Data
Scheidler et al. combined data from 36 studies, of which 17 examined LAG, 19 examined
CT and 10 examined MR. Observed true positive and false positive rates for this data set are
shown in Figure 1.
Nine of the 36 studies examined more than one test. In particular, two studies examined
CT and LAG, four studies examined CT and MR, and two studies examined CT twice. The
two studies that examined CT twice reported data separately for para-aortic and pelvic nodes.
These 6ndings were based on the same group of women, but published information did not
allow combination of women’s para-aortic and pelvic node 6ndings into an overall 2 × 2 table.
Thus, both tables were included in analyses.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2874 C. M. RUTTER AND C. A. GATSONIS
Figure 1. Detection of lymph node metastasis, using lymphangiography (LAG), computed tomography
(CT) or magnetic resonance (MR) imaging: observed true positive (TP) and false positive (FP) rates
from data reported across 37 studies that were originally meta-analysed by Scheidler and colleagues [14].
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2875
Table I. Model diagnostics: estimated sums of squares associated with study-speci6c parameters. Q0:05 is
2
the 5th percentile, Q0:95 is the 95th percentile, "0:05 represents the 5th percentile of the chi-square with the
2
appropriate degrees of freedom (LAG:17, CT:19, MR:10). Similarly, "0:95 represents the 95th percentile of
the appropriate chi-square distribution.
2 2
Q0:05 P(S6"0:05 ) Q0:95 P(S¿"0:95 )
based on multiple sequences with overdispersed starting points. Because of high between-
draw correlation, every 50th iteration was saved from each sequence of 50 000 simulated
draws. We allowed 2500 iterations for burn-in. Eight di8erent chains were run, with starting
points based on ; P and Q: (0) ∈ {−2:5; 1:5}; P(0) ∈ {−5; 5}, and Q(0) ∈ {−1; 10}. Starting
values for covariate parameters were set to zero. Starting values for study-speci6c cutpoint
and accuracy parameters were calculated from continuity-corrected count data using formulae
(4). Starting values for the prior variability of study speci6c cutpoints (2 ) and accuracies
(2 ) were set larger than the observed variance of starting values (all were less than 2).
We set 2(0) = 2(0) = 5 for all three tests and chose S−1 (2:1; 2) priors for variance param-
eters. The S−1 (2:1; 2) distribution has mean 1.82, standard deviation 5.75, with percentiles
P0:05 = 0:41; P0:25 = 0:71; P0:50 = 1:12; P0:75 = 1:93 and P0:95 = 5:05. Neither Geweke statistics
nor results from the Heidelberger and Welch method indicated any failure to converge, and
scale reduction statistics indicated that the independent chains had converged to the same
distribution. Results present estimated posterior modes with 95 per cent credible intervals.
Probabilities corresponding to between-test comparisons were based on estimated posterior
probabilities.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2876 C. M. RUTTER AND C. A. GATSONIS
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2877
Table II. Hierarchical ROC parameter estimates: estimated posterior modes with 95 per cent credible
intervals in parentheses.
LAG CT MR
Z1 = −0:413 Z2 = −0:217 Z1 = 0:587 Z2 = −0:217 Z1 = −0:413 Z2 = 0:783
Figure 3. Estimated summary receiver operating characteristic curves and expected operating points for
lymphangiography (LAG), computed tomography (CT) and magnetic resonance (MR) imaging, based
on hierarchical regression modelling. HSROC curves are plotted over the range of observed FP rates.
criteria were not demonstrated by the earlier 6xed e8ect SROC analysis. In the context of
ROC analysis, cutpoints are often viewed as nuisance parameters. However, because cutpoints
directly a8ect the expected operating point (equation (5)), they also a8ect other estimates of
overall test performance.
As shown in Table III, we found that LAG was more sensitive and less speci6c than
both CT and MR. Point estimates of TP and FP rates for each modality are shown on
Figure 3. Because estimated TP and FP rates are based on highly non-linear functions of
HSROC parameters, they di8er slightly from estimated operating points derived by substituting
estimates of P̂test ; Q̂test and ˆtest into equation (5) and lie near, but not on, SROC curves.
Observed di8erences in sensitivity and speci6city are consistent with our 6nding that LAG
had a higher overall positivity criteria (P) than both CT and MR. Figure 3 demonstrates
that di8erences in sensitivity and speci6city are also a8ected by between test di8erences in
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2878 C. M. RUTTER AND C. A. GATSONIS
Table III. Overall rates and likelihood ratios: posterior modes with 95 per cent credible intervals in parentheses,
break probabilities for between-modality comparisons.
Test type False positive True positive Likelihood ratio Likelihood ratio
rate (FP) rate (TP) negative (LR− ) positive (LR+ )
LAG 0.164 (0:089; 0:259) 0.683 (0:585; 0:774) 0.380 (0:270; 0:501) 3.89 (2:63; 7:55)
CT 0.069 (0:041; 0:106) 0.483 (0:310; 0:655) 0.562 (0:374; 0:733) 6.92 (4:44; 11:4)
MR 0.047 (0:019; 0:093) 0.541 (0:286; 0:771) 0.483 (0:246; 0:743) 9.99 (5:70; 25:2)
Prob(LAG¿CT) 0.99 0.98 0.05 0.08
Prob(LAG¿MR) 0.99 0.86 0.23 0.02
Prob(MR¿CT) 0.17 0.66 0.31 0.88
accuracy (Q) and scale () parameters. We also found between-test di8erences in LR statistics.
LAG had a lower (better) LR − than both CT and MR, but also had a lower (worse) LR +
than both CT and MR. We did not 6nd strong evidence for di8erences in the accuracy of
CT and MR. Though not statistically signi6cant, MR tended to have better accuracy than CT,
with higher sensitivity and speci6city than CT, lower LR − than CT and higher LR+ than CT.
4.7. Discussion of 1ndings about radiologic evaluation for lymph node metastasis
We found no detectable di8erences in the parameters that determine the SROC curves (Q; )
for LAG, CT and MR. However, there was evidence of di8erences in the positivity criteria
(P) and the performance of the three modalities. Lymphangiography (LAG) was the most
sensitive and least speci6c test. There was weak evidence that MR might be more sensitive
and more speci6c than CT. Based on likelihood ratio statistics, LAG was the best modality
for ruling out disease and MR was the best modality for ruling in disease.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2879
SROC analysis allows between-modality comparisons that remove threshold e8ects and
instead capture the overall trade-o8 between sensitivity and speci6city over a speci6c range
of thresholds. In this spirit, Scheidler and colleagues used the Q∗ statistic to remove the e8ect
of threshold on between-study comparisons. Their analysis did not 6nd di8erences in either
sensitivity or LR statistics across tests, which they calculated at Q∗ . We believe this is not
the best approach for these data because it ignores di8erences in the distribution of true and
false positive rates for LAG, CT and MR. Estimates of Q∗ based on original analyses were
0.74 for LAG, 0.80 for CT and 0.85 for MR. These are far di8erent from our estimates of the
expected sensitivity and speci6city of these tests. For MR in particular, Q∗ lies at the edge of
the data range. Because Sheidler and colleagues calculated LR statistics at Q∗ , their estimates
do not re;ect the distribution of studies across the estimated summary curve. Estimates of
LR+ based on original analyses were 2.85 for LAG, 4.00 for CT and 5.67 for MR. Estimates
of LR − based on original analyses were 0.35 for LAG, 0.25 for CT and 0.18 for MR.
Using the expected sensitivity and speci6city, we found that each modality had higher LR
statistics than were found using the 6xed e8ect analysis. That is, the modalities were better at
ruling in disease, but worse at ruling out disease, than the original analyses suggested. Thus,
removing the threshold e8ect can remove important between-study di8erences in expected test
performance.
Summary measures of test accuracy and clinical utility can have important implications for
selection of preoperative diagnostic tests. A recent study comparing diagnostic assessments
of women with cervical cancer to assessments in 1984 and 1990 found that the use of LAG
declined (from 6 per cent to 2.3 per cent) while the use of both CT and MR increased (3.82
per cent to 55.1 per cent for CT and 0.9 per cent to 5.6 per cent for MR) [36]. However,
HSROC analyses suggest that LAG may remain an important modality for ruling out lym-
phadenopathy, that is, identifying women who may be eligible for surgical treatment alone.
Our results also show that MR may be a better diagnostic tool than either LAG or CT for
ruling in disease, that is, identifying women who should have further cytological evaluation
of lymph nodes or who might go to non-surgical treatment without delay. Our study pro-
vides improved estimates of the diagnostic accuracy of these three modalities. Combining this
information about test accuracy with information about costs and treatment e8ectiveness using
decision analysis would provide further insight into the impact of these di8erent modalities
on health outcomes.
5. DISCUSSION
The hierarchical summary ROC (HSROC) model for combining estimated pairs of sensi-
tivity and speci6city from multiple studies extends the 6xed-e8ects summary ROC (SROC)
model, more appropriately incorporating both within- and between-study variability, and allow-
ing ;exibility in estimation of summary statistics. The HSROC model describes within-study
variability using a binomial distribution for the number of positive tests in diseased and not
diseased patients. An underlying ROC model that allows variability in both the positivity cri-
teria and accuracy across studies determines the binomial probabilities. Variation in positivity
criteria and accuracy is modelled using Normal distribution, with a linear regression in the
mean that allows dependence on study-level covariates. More heavy tailed distributions (such
as t or Cauchy) can also be used instead of a Normal in the second level of the hierarchical
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2880 C. M. RUTTER AND C. A. GATSONIS
model. The model could also be extended to allow dependence between positivity criteria and
accuracy. We did not pursue this extension because the independence assumption corresponds
to our underlying ROC assumption that positivity criteria and accuracy are independent test
characteristics. Furthermore, the data we analysed did not demonstrate correlation between
naively estimated cutpoint and accuracy parameters. Even with this possible limitation, our
HSROC model allows more complete accounting of between-study variability than is pos-
sible with 6xed-e8ects formulations. In addition, the HSROC model provides more realistic
accounting of within-study variability than the 6xed-e8ects SROC model (4), which uses a
Normal error distribution and does not account for the measurement error in the primary
covariate.
The HSROC approach provides a ;exible modelling framework that can be extended when
more information is available. For example, when studies report results from more than one
modality, the hierarchical model can be appropriately extended to incorporate within-study
correlation. This extension requires information about the joint distribution of test results, either
from multiple similar pairs across several studies, from cross-tabulation of test results within
studies, or from patient-level data within studies. When patient level information is available,
the within-study (level I) model can be extended to incorporate patient-level covariates. This
extended model can also be applied to data from a single study when results are clustered
within participating institutions and=or readers (see reference (37) for a hierarchical analysis
of ROC data).
The HSROC model assumes that disease status is assessed without error. In single-study
settings, errors in the reference standard used to determine disease state can be addressed
using latent variables (for example, references [38; 39]). A latent variable approach has also
been used in a recent extension of the SROC model [40]. Unfortunately, this extended SROC
method introduces other problems. The two-step estimation process 6rst adjusts rates then
estimates the SROC using these adjusted rates. The method for adjusting rates assumes that
both test characteristics (sensitivity, speci6city) and rates of error in the reference standard are
constant across studies. These assumptions are diRcult to justify, and estimated SROC standard
errors do not re;ect additional variability from this adjustment. On a more basic level, because
meta-analysis intends to combine similar information across studies, a reasonable reference
standard must be available before studies can be combined. Evaluation of a ;awed reference
standard and evaluation of a test compared to a ;awed standard requires in-depth study that
includes additional information, such as additional test results and clinical outcomes.
The fully Bayesian approach to model 6tting, although computationally intensive, leads
to simulated values from the posterior distribution of the parameters, on the basis of which
the analyst can easily calculate summaries of the posterior distribution of a broad range of
functions of the parameters. For example, in our reanalysis of the Scheidler data we derived
estimates of functionals of the posterior distribution of likelihood ratio statistics, and estimated
the posterior probabilities of di8erences between modalities. The 6xed e8ect SROC model
requires selection of a single operating point for calculation of likelihood ratio statistics. In
contrast, likelihood ratios statistics estimated using the HSROC model are calculated at the
average operating point on the summary ROC curve, so that these estimates incorporate the
distribution of data from individual studies across the ROC curve, enabling more accurate
estimation of clinical utility.
The Bayesian model also allows description of sources of variability. The di8erences we
found in the variability of positivity criteria were consistent with the technological development
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2881
of these three tests. At one extreme, LAG was the standard diagnostic test during the meta-
analysed time period. At the other extreme, MR was a new diagnostic approach during this
time. The estimated variability of cutpoint parameters was low for LAG. The variability of CT
cutpoints was about one and a half times the variability of LAG cutpoints, and the variability
of MR cutpoints was more than twice the variability of LAG cutpoints. This suggests that
MR accuracy could be improved over time.
The advantages of the HSROC model come at a price; estimation requires Markov chain
Monte Carlo (MCMC) simulation. MCMC estimation requires programming, simulation, eval-
uation of convergence and model adequacy, and synthesis of simulation results. The proposed
model can be 6t within publicly available software [21]. Implementation of MCMC sim-
ulation will still entail non-trivial analysis tasks including evaluation of convergence and
the adequacy of prior distributions and this requires some statistical expertise. However, the
increased complexity of the proposed analysis must be measured against the advantages from
the approach, including more realistic assumptions, more precise description of the impact of
covariates, and greater ;exibility in choice of descriptive statistics.
The conditional distributions of level II parameters (P; Q; 2 ; 2 ; and ) are standard
distributions or truncated versions of standard distributions. For example, the conditional dis-
tribution
m of Q given ; z; 2 ; and is proportional to a Normal distribution with mean
2 2 2
i=1 (i − Zi )=m and variance =m over the range [ 1 ; 2 ]. Variance parameters and
have conjugate priors, so that
m −1
1
(2 |; z; ; Q; ) ∼ S−1 ( 1 + m=2); (i − Q − Zi )2 + 1= 2
2 i=1
The conditional distributions of P and 2 are analogous. Finally, the conditional distribution
(|Q; 2 ; ; ) is proportional to a normal distribution with mean
i=1 Zi (i − Q)
m 2
i=1 Zi
and variance 2 =( i Zi2 ) over the range [1 ; 2 ].
The conditional distributions of level I parameters (i ; i and ) are not standard. The
conditional distribution of study speci6c accuracies, (i |yi ; ni ; Zi ; Q; 2 ; i ; ), is a binomial-
Normal product
2
(i − (Q + Zi ))2 nij y
exp − ij ij (1 − ij )(nij −yij )
22 j=1 yij
The conditional distribution of i has similar form. The conditional distribution of the scale
parameter, (|y; n; Z; ; ; ), is proportional to the product of 2m binomials
m 2 nij y
ij ij (1 − ij )(nij −yij )
i=1 j=1 y ij
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2882 C. M. RUTTER AND C. A. GATSONIS
model dxmeta;
const
N =46;
var
CT[N],MR[N],fp[N],neg[N],tp[N],pos[N],
theta[N],alpha[N],pi[2,N],t[N],a[N],b[N],
THETA,LAMBDA,beta,gamma[2],lambda[2],bcov[2],
prec[2,3],sigmasq[2,3];
data CT, MR, tp, pos, fp, neg in "dxmeta.dat";
inits in "dxmeta.ini";
{
THETA∼dunif(-10,10);
LAMBDA∼dunif(-2,20);
beta∼dunif(-5,5);
for(i in 1:2){
gamma[i]∼dunif(-10,10);
lambda[i]∼dunif(-10,10);
bcov[i]∼dunif(-5,5);
for(j in 1:3){
prec[i,j] ∼ dgamma(2.1,2); sigmasq[i,j] ¡ − 1.0/prec[i,j];
}
}
for(i in 1:N){
t[i] <- THETA+CT[i]*gamma[1]+MR[i]*gamma[2];
l[i] <- LAMBDA+CT[i]*lambda[1]+MR[i]*lambda[2];
theta[i]∼dnorm(t[i],prec[1,test[i]]);
alpha[i]∼dnorm(l[i],prec[2,test[i]]);
b[i] <- exp((beta+CT[i]*bcov[1]+MR[i]*bcov[2])/2);
logit(pi[1,i]) <- (theta[i] + 0.5*alpha[i])/b[i];
logit(pi[2,i]) <- (theta[i] - 0.5*alpha[i])*b[i];
tp[i] ∼ dbin(pi[1,i],pos[i]);
fp[i] ∼ dbin(pi[2,i],neg[i]);
}
}
REFERENCES
1. Irwig L, Tosteson AN, Gatsonis CA, Lau J, Colditz G, Chalmers TC, Mosteller F. Guidelines for meta-analyses
evaluating diagnostic tests. Annals of Internal Medicine 1994; 120(8):667– 676.
2. Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-analytic methods for diagnostic test accuracy. Journal of
Clinical Epidemiology 1995; 48(1):119 –130.
3. Kardaun JWPF, Kardaun OJWF. Comparative diagnostic performance of three radiological procedures for the
detection of lumbar disk herniation. Methods of Information in Medicine 1990; 29(1):12 – 22.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2883
4. Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary
ROC curve: data-analytic approaches and some additional considerations. Statistics in Medicine 1993; 12(14):
1293 – 1316.
5. Rutter CM, Gatsonis C. Regression methods for meta-analysis of diagnostic test data. Academic Radiology
1995; 2(1):S48 – S56.
6. Hasselblad V, Mosteller F, Littenberg B, Chalmers TC, Hunink MG, Turner JA, Morton SC, Diehr P, Wong JB,
Powe NR. A survey of current problems in meta-analysis. Medical Care 1995; 33(2):202– 220.
7. Hasselblad V, Hedges LV. Meta-analysis of screening and diagnostic tests. Psychological Bulletin 1995;
117(1):167–178.
8. Shapiro D. Issues in combining independent estimates of sensitivity and speci6city of a diagnostic test. Academic
Radiology 1995; 2(1):S37– S47.
9. de Vries SO, Hunink M, Polak J. Summary ROC curves as a technique for meta-analysis of the diagnostic
performance of duplex ultrasonography in peripheral arterial disease. Academic Radiology 1996; 3(4):361– 369.
10. Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Critical Reviews in
Diagnostic Imaging 1989; 29(3):307–335.
11. DuMouchel W. Bayesian meta-analysis. In Statistical Methodology in the Pharmaceutical Sciences, Berry D
(ed.). Dekker: New York, 1990; 509 – 529.
12. Morris C, Normand ST. Hierarchical models for combining information and for meta-analysis. In Bayesian
Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds). Oxford University Press: Oxford, 1992;
321–344.
13. Normand ST. Meta-analysis: formulating, evaluating, combining, and reporting. Statistics in Medicine 1999;
18(3):321–359.
14. Scheidler J, Hricak H, Yu KK, Subak L, Segal MR. Radiological evaluation of lymph node metastases in patients
with cervical cancer: a meta-analysis. Journal of the American Medical Association 1997; 278(13):1096 –1101.
15. Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Chapman and Hall: New York,
1995.
16. Baker FD. Item Response Theory. Marcel Dekker: New York, 1992.
17. McCullagh P. Regression models for ordinal data. Journal of the Royal Statistical Society, Series B 1980;
42(2):109 –142.
18. Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Medical Decision
Making 1988; 8(3):204 –215.
19. Toledano A, Gatsonis CA. Ordinal regression methodology for ROC curves derived from correlated data.
Statistics in Medicine 1996; 15(16):1807– 1826.
20. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. Journal of the American
Statistical Association 1990; 85(419):398 – 409.
21. Spiegelhalter D, Thomas A, Best N, Gilks W. BUGS: Bayesian inference Using Gibbs Sampling, version 0.5,
(version ii). MRC Biostatistics Unit: Cambridge, 1996. Program available at www.mrc-bsu.cam.
ac.uk= bugs=.
22. Gilks WR, Richardson S, Speigelhalter DJ. Derivative-free adaptive rejection sampling for gibbs sampling. In
Bayesian Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds). Clarendon Press: Oxford, 1992;
641– 650.
23. Ritter C, Tanner MA. Facilitating the Gibbs sampler: the Gibbs stopper and the Griddy–Gibbs sampler. Journal
of the American Statistical Association 1992; 87(419):861– 868.
24. Best N, Cowles MK, Vines K. CODA: Convergence Diagnostics and Output Analysis Software for
Gibbs sampling output, Version 0.20. MRC Biostatistics Unit: Cambridge, 1995. Program available at
www.mrc-bsu.cam.ac.uk= bugs=.
25. Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In Bayesian
Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds). Clarendon Press: Oxford, 1992; 169 –194.
26. Heidelberger P, Welch P. Simulation run length control in the presence of an initial transient. Operations
Research 1983; 31(6):1109 –1144.
27. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Sciences 1992;
7(4):457– 511.
28. Weiss RE. Bayesian model checking with applications to hierarchical models. Technical Report, UCLA
Department of Biostatistics, 1996.
29. Boyko EJ. Ruling out or ruling in disease with the most sensitive or speci6c diagnostic test: short cut or wrong
turn? Medical Decision Making 1994; 14(2):175 –179.
30. Kristensen GB, Abeler VM, Risberg B, TropWe, Bryen M. Tumor size, depth of invasion, and grading of the
invasive tumor front are the main prognostic factors in early squamous cell cervical carcinoma. Gynecologic
Oncology 1999; 74(2):245 –251.
31. Delgado G, Bundy B, Zaino R, Sevin BU, Creasman WT, Major F. Prospective surgical-pathological study of
disease-free interval in patients with stage IB squamous cell carcinoma of the cervix: a Gynecologic Oncology
Group study. Gynecologic Oncology 1990; 38(3):352–357.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2884 C. M. RUTTER AND C. A. GATSONIS
32. Eifel PJ, Burke TW, Delclos L, Wharton JT, Oswald MJ. Early stage I adenocarcinoma of the uterine cervix:
treatment results in patients with tumors less than or equal to 4 cm in diameter. Gynecologic Oncology 1991;
41(3):199–205.
33. Morris M, Eifel PJ, Lu J, Grigsby PW, Levenback C, Stevens RE, Rotman M, Gershenson DM, Mutch DG.
Pelvic radiation with concurrent chemotherapy compared with pelvic and para-aortic radiation for high-risk
cervical cancer. New England Journal of Medicine 1999; 340(15):1137–1143.
34. Go8 BA, Muntz HG, Paley PJ, Tamimi HK, Koh WJ, Greer BE. Impact of surgical staging in women with
locally advanced cervical cancer. Gynecologic Oncology 1999; 74(3):436–442.
35. Holcomb K, Abula6a O, Matthews RP, Gabbur N, Lee YC, Buhl A. The impact of pretreatment staging
laparotomy on survival in locally advanced cervical carcinoma. European Journal of Gynaecological Oncology
1999; 20(2):90 – 93.
36. Russell AH, Shingleton HM, Jones WB, Fremgen A, Winchester DP, Clive R, Chmiel JS. Diagnostic assessments
in patients with invasive cancer of the cervix: a National Patterns of Care Study of the American College of
Surgeons. Gynecologic Oncology 1996; 63(2):159–165.
37. Gatsonis CA. Random e8ects models for diagnostic accuracy data. Academic Radiology 1995; 2(1):S14 –S21.
38. Qu Y, Hadgu A. A model for evaluating sensitivity and speci6city for correlated diagnostic test in eRcacy studies
with an imperfect reference test. Journal of the American Statistical Association 1998; 93(443):920 – 928.
39. Joseph L, Gyorkos TW. Inference for likelihood ratios in the absence of a ‘gold standard’. Medical Decision
Making 1996; 16(4):412 – 417.
40. Walter SD, Irwig L, Glasziou PP. Meta-analysis of diagnostic tests with imperfect reference standards. Journal
of Clinical Epidemiology 1999; 52(10):943 – 951.
Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884