You are on page 1of 20

STATISTICS IN MEDICINE

Statist. Med. 2001; 20:2865–2884 (DOI: 10.1002/sim.942)

A hierarchical regression approach to meta-analysis


of diagnostic test accuracy evaluations

Carolyn M. Rutter1; ∗; † and Constantine A. Gatsonis2


1 Group Health Cooperative; Center for Health Studies; 1730 Minor Avenue; Suite 1600; Seattle;
WA 98101; U.S.A.
2 Center for Statistical Sciences; Brown University; Box G-H; Providence; RI 02912; U.S.A.

SUMMARY
An important quality of meta-analytic models for research synthesis is their ability to account for
both within- and between-study variability. Currently available meta-analytic approaches for studies of
diagnostic test accuracy work primarily within a 6xed-e8ects framework. In this paper we describe
a hierarchical regression model for meta-analysis of studies reporting estimates of test sensitivity and
speci6city. The model allows more between- and within-study variability than 6xed-e8ect approaches, by
allowing both test stringency and test accuracy to vary across studies. It is also possible to examine the
e8ects of study speci6c covariates. Estimates are computed using Markov Chain Monte Carlo simulation
with publicly available software (BUGS). This estimation method allows ;exibility in the choice of
summary statistics. We demonstrate the advantages of this modelling approach using a recently published
meta-analysis comparing three tests used to detect nodal metastasis of cervical cancer. Copyright ? 2001
John Wiley & Sons, Ltd.

1. INTRODUCTION

The need for systematic review and synthesis of published evidence on the accuracy of diag-
nostic tests has increased in recent years. The information from such reviews is a key element
of clinical and health policy decision making regarding the use of diagnostic tests; it is also
essential for guiding the process of technology development and evaluation in diagnostic
medicine [1; 2].
Statistical methods for meta-analysis of diagnostic test evaluations have focused on the
analysis of studies reporting estimates of test sensitivity and speci6city, which constitute the
majority of studies in diagnostic test evaluation [1–9]. A key goal in the synthesis of such
studies is to derive summary measures of test performance. These measures must account for
the trade-o8 between sensitivity and speci6city as the threshold for positivity varies along
some explicit or latent scale. This trade-o8 has been widely recognized in the evaluation of

∗ Correspondence to: Carolyn M. Rutter, Group Health Cooperative, Center for Health Studies, 1730 Minor Avenue,
Suite 1600, Seattle, WA 98101, U.S.A.
† E-mail: rutter.c@ghc.org

Received January 1999


Copyright ? 2001 John Wiley & Sons, Ltd. Accepted January 2001
2866 C. M. RUTTER AND C. A. GATSONIS

diagnostic tests and has led to the development of receiver operating characteristic (ROC)
curves [10]. Brie;y, the ROC curve for a diagnostic test is the set of all pairs of sensitivity
and speci6city that can be achieved as the positivity threshold varies across the entire range
of possible values.
The experience from research synthesis of diagnostic test evaluations shows that there is
substantial variation among the estimates of a test’s sensitivity and speci6city across published
studies [1; 2]. Di8erences in positivity threshold constitute an important source of variability
across studies (and also across diagnosticians, within a given study). In addition, study char-
acteristics, such as technical aspects of the diagnostic test, patient and disease cohorts, study
settings, experience of readers, and sample size are also potential contributors to between-
studies variations in the estimates of diagnostic performance. Simple averaging or pooling
across studies can provide misleading conclusions, as can be readily seen from a simple
example. If three studies report the following estimates of test sensitivity and speci6city:
(0:10; 0:90); (0:80; 0:80) and (0:90; 0:10), the average pair of sensitivity and speci6city is
(0:60; 0:60) and lies completely outside the domain of the original studies (see also references
[1; 2; 6]).
As in other areas of meta-analysis, 6xed-e8ects regression models have been proposed to
account for between-study variability in diagnostic test evaluations [4; 5]. The use of regression
models provides a ;exible and powerful framework for meta-analysis. However, the number
of covariates that can be accommodated in such models is limited. In addition, these 6xed-
e8ects regression models may not provide realistic accounts of the uncertainty associated with
covariate estimates.
In this paper, we expand on earlier work [5] and present a hierarchical model formulation
of the problem of combining information across studies reporting estimates of test sensitivity
and speci6city. The structure of the model is similar to that of models proposed for the meta-
analysis of treatment studies [11–13]. The observed variation is partitioned into within- and
between-studies components. Each component consists of a systematic part and a random part,
with the former attributed to covariates and the latter to unexplained variation. The hierarchical
model makes it possible to pool information across studies and derive smoothed estimates of
covariate e8ects, components of variance and individual study quantities. In addition, simple
extensions of the hierarchical structure can incorporate patient-level information within each
study, when such information is available.
We present our approach using data from a recently published meta-analysis comparing the
diagnostic performance of three imaging modalities for the detection of lymph node metas-
tasis in women with cervical cancer [14]. In Section 2 we survey 6xed e8ects approaches
to the problem. The hierarchical regression model is presented in Section 3. We take a fully
Bayesian approach to model 6tting and checking and use Markov chain Monte Carlo estima-
tion techniques. Technical issues are discussed in Section 3 and the analysis of the example
and the conclusions we draw are presented in Section 4. The 6nal section summarizes our
methodological conclusions.

2. META-ANALYTIC MODELS FOR DIAGNOSTIC TEST DATA

The simplest setting for the methods discussed in this paper involves meta-analyses in which
each of m studies contributes a vector zi of study-level covariates (i = 1; : : : ; m) and a 2 × 2

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2867

table of summary data, showing the agreement between the binary test result and the de6nitive
disease information (or reference standard). As noted in the introduction, data of this type are
reported in the vast majority of studies evaluating the performance of diagnostic modalities.
We will use the following notation:
Test
0 = no 1 = yes
Truth no yi00 yi01 ni0
yes yi10 yi11 ni1

i = yi11 =ni1 and


The observed rates of true and false positive test results are then de6ned as TP

FPi = yi01 =ni0 . In some meta-analyses more than one 2 × 2 table is available from each study.
For example, patients may be examined using several tests in a study, leading to correlated
binary test results studies.

2.1. Review of summary ROC (SROC) curve


In the absence of patient and study level covariates, a simple and commonly used graphical
summary of test performance is the summary ROC curve (SROC) [4]. The curve is constructed
by computing the quantities
i ) − logit(FP
Di = logit(TP i ) and i ) + logit(FP
Si = logit(TP i )

for each study and 6tting the linear model


Di = a + bSi + ei (1)
where ei is random error and logit(p) = ln(p=(1 − p)). Using the estimates of a and b, a
plot of the summary ROC curve can be drawn, with FP on the x-axis and TP on the y-axis.
The SROC model corresponds to the assumption that the observed di8erences across studies
result from di8erent thresholds for test positivity. When b = 0 the log-odds ratio Di will be
constant across studies and the SROC curve will be symmetric. Although SROC and ROC
curves are formally similar, they do not have identical interpretations. However, borrowing
from ROC analysis, appropriately de6ned partial areas under the SROC have been proposed
as summary measures. These partial areas can be rescaled to represent the average sensitivity
over particular ranges of speci6city. The Q∗ statistic has also been proposed as a summary
measure. This statistic corresponds to the estimated true positive rate at the point on the SROC
curve where the sensitivity is equal to the speci6city of the test.
Some aspects of between-study variability can be handled within the SROC approach. Study
level covariates can be incorporated in straightforward manner into equation (1) to provide an
exploratory analysis of the e8ects of study characteristics. The model can be estimated using
ordinary or weighted least squares, or robust regression methods. Weights can be used to
account for between-study di8erences in overall sample size or precision. However, weights
cannot simultaneously capture di8erences in sample size within the disease-positive (ni1 ) and
disease-negative (ni0 ) groups. These two sample sizes a8ect the precision of estimated TP
and FP rates independently. In practice, weighted and unweighted models can produce very
di8erent results, and there is no clear way to choose between these models. It should also
be noted that the SROC model does not account for error in Sj , which can be a source of

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2868 C. M. RUTTER AND C. A. GATSONIS

bias in parameter estimates (15) and summaries of the curve. Further exploration is needed
to determine the e8ect of ignoring error in Sj on both point estimation and coverage rates of
estimated con6dence intervals.
An alternative approach to constructing an SROC curve was proposed by Kardaun and
Kardaun (3), who assumed that (logit(TP i ); logit(FP
i )) follows a bivariate normal distribution
and postulated a linear relationship between the two components of the bivariate mean. Pro6le
likelihood is used to derive estimates of the slope and intercept in this model, which includes
variability in both rates.

2.2. Binomial regression model


A regression model for the meta-analysis of (TP i ; FP
i ) pairs was 6rst discussed in reference
[5] and was motivated by the ordinal regression formulation of ROC analysis [16–18]. In
brief, if W denotes the degree of suspicion about the presence of an abnormality, elicited
on an ordinal categorical scale with J categories, the parametric ROC model is equivalent
to the ordinal regression model g(P[W ¿j |X ]) = (j + X ) exp(−X ), where X is a covariate
denoting the (binary) true disease status. The conceptual basis of the model is an assumption
that the observed responses W represent a categorization of a latent variable, with distribution
corresponding to the link function, g(·). The probit link implies a Gaussian latent variable
and is commonly used for single-study receiver operating characteristic analysis [10]. We use
the logit link throughout this article because under the logit model, regression parameters
estimate log-odds ratios. In the binomial setting with J = 2, use of the logit link does not
require assumption of an underlying log-normal latent variable.
As discussed in reference [5], an ordinal regression model with a logistic link and J = 2
can be used in the meta-analysis of studies reporting pairs of (TP i ; FP
i ). Under this model,
the numbers of positive tests from each study, yij1 ; i = 1; : : : ; m; j = 0; 1 are assumed to
follow independent binomial distributions, yij1 ∼ Binomial(nij ; ij ), in which the probability
of a positive test modelled as
logit(ij ) = (i + Xij )e−Xij (2)
As in the ROC context, we call the i ’s ‘cutpoint parameters’ (or ‘positivity criteria’), since
both TP and FP increase with increasing . Thus,  models the trade-o8 between TP and
FP. We call  the ‘accuracy parameter’ because it measures the di8erence between TP and
FP. When  = 1,  models the constant log-odds ratio of a positive test for disease positive
versus disease negative patients. We call  the ‘scale parameter’, since it allows di8erences
in the variance of outcomes in disease negative and disease positive populations. When  = 1,
TP and FP increase at di8erent rates as  increases, allowing asymmetric ROC curves. The
corresponding log-odds ratio varies with i . The binomial regression model is estimated by
maximum likelihood and accounts for error in both TP i and FP
i rates.
The binomial regression model (2) implies a linear relationship between logit(TP) and
logit(FP). This linearity is a basic assumption in the two SROC models discussed earlier and
implies a natural correspondence between SROC and the binary regression analysis. Like the
SROC model, the simple binomial regression model (2) assumes that across studies, tests
have the same accuracy and scale parameter, with between-study di8erences resulting from
di8erent positivity thresholds (i ). These binomial regression models can be formulated to
allow study level covariate e8ects [5]. However, because each study contributes only two

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2869

degrees of freedom, there are practical limitations on the number of covariates that can be
included in these models.
The binomial regression model (2) has a formal similarity to the logistic two-parameter
item characteristic curve (ICC) model, with studies corresponding to subjects (raters) and
patients to items. ICC models have long been studied in educational testing [19]. However,
unlike the usual educational testing setting, where multiple subjects respond to each item, in
the meta-analysis setting patients are nested within studies.

3. HIERARCHICAL REGRESSION ANALYSIS

3.1. Model
Hierarchical regression analysis extends the binomial regression model to more fully account
for both within- and between-study variability in TP and FP rates. The model allows the
inclusion of patient- and study-level covariates, if such information is available, and has the
following structure.

3.1.1. Level I (Within-study variation). The number of positive tests from the ith study,
yi01 and yi11 , are assumed to be independent and to follow binomial distributions, with the
probability of a positive test given by
logit(ij ) = (i + i Xij )e−Xij (3)
where Xij denotes the true disease status for cases in the ijth cell. Under this hierarchical
SROC model (HSROC), both positivity criteria (i ) and accuracy parameters (i ) are allowed
to vary across studies. Because estimation of the scale parameter () requires information
from more than one study,  is assumed to be constant across studies. If we were to allow
 to vary across studies, then within-study parameters would be identi6able only through
their prior distributions. The assumption of a constant  can be relaxed somewhat, as we
demonstrate in the example.

3.1.2. Level II (Between-study variation). Study-level parameters in (3), i and i , are as-
sumed to be Normally distributed, with mean determined by a linear function of study-level
covariates. In the case of a single covariate Z a8ecting both the cutpoint and accuracy pa-
rameters, the model can be written as

i |P; ; Zi ; 2 ∼ N(P + Zi ; 2 )
conditionally independent
i |Q; ; Zi ; 2 ∼ N(Q + Zi ; 2 )
The coeRcients  and  model systematic di8erences in positivity criteria and accuracy across
studies, due to the covariate Z. However, more general formulations of the model can be
considered in which more than one covariate is included and di8erent covariates are used for
‘cutpoint’ and ‘accuracy’ regression equations.
The assumed conditional independence of i and i re;ects assumptions implicit in ROC
analysis. In the context of ROC analysis, positivity threshold and accuracy are independent
test characteristics that together impose correlation between a test’s sensitivity and speci6city.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2870 C. M. RUTTER AND C. A. GATSONIS

3.1.3. Level III. The speci6cation of the hierarchical model is completed by the choice of
prior distributions for the remaining unknown parameters. In particular, we chose
P ∼ Uniform[1 ; 2 ];  ∼ Uniform[ 1 ;  2 ]; 2 ∼ S−1 ( 1 ; 2 )

Q ∼ Uniform[ 1 ;  2 ];  ∼ Uniform[1 ; 2 ]; 2 ∼S−1


( 1 ; 2 )

 ∼ Uniform[1 ; 2 ]

The parameters P, Q, , , , 2 and 2 are assumed to be mutually independent. The para-
meters  ;   ;  ;   ;  ;  and  are assumed to be 6xed and are chosen to re;ect plausible
ranges. Choice of prior ranges is discussed in the following section, and is demonstrated in
the data example.
Summary ROC (SROC) curves can be derived using the expected values of Q + Z and .
If true disease state is coded 12 for disease positive cases and − 12 for not diseased cases, then
for a given value of the covariate Zi , the model-based true positive rate can be expressed as
TP(FP) = logit −1((logit(FP)eE()=2 + E(Q + Zi ))eE()=2 )
The SROC curve is drawn by plotting (FP; TP(FP)) for FP ∈ [0; 1]. Extrapolation beyond the
available data can be discouraged by plotting the curve only over the observed range of FP.

3.2. HSROC model 1tting


Inference from the HSROC model is based on the posterior distributions of model parameters.
Because the models we consider are not conjugate, closed form expressions for posterior
distribution do not exist. Posterior quantities are estimated by simulating observations from
the posterior distribution using Markov chain Monte Carlo (MCMC) simulation [20]. These
simulated values from the full posterior distribution are used to estimate marginal distribution
of interest, such as posterior distributions of particular parameters or functions of parameters.
Conditional distributions are given in Appendix A.
Analyses can be carried out using BUGS version 0.6 [21], publicly available software for
Markov chain Monte Carlo sampling. The program uses derivative-free adaptive rejection
sampling [22] to draw from log-concave distributions (that is, P, Q, 2 , 2 , , , i and
i ). The ‘Griddy–Gibbs’ method is used to estimate draws from non-log-concave distributions
(that is, ) [23]. Essentially, the Griddy–Gibbs method simulates draws from an unknown
distribution by applying the inverse transformation method to an approximate conditional
cumulative distribution function.
To enable estimation, disease status and each covariate included in the level II model must
be centred at zero. When estimating the 6xed e8ect binomial regression model, this type of
covariate centring is required for model identi6ability [16]. Under the hierarchical regression
model, centring the disease status and covariate vectors helps to reduce correlation between
consecutive draws.

3.2.1. Choice of priors. Prior ranges for P, Q and  should be chosen to re;ect subject matter
knowledge about the diagnostic modalities under review. In general, the interval [−10; 10]
covers all plausible values of P. Similarly, the interval [−5; 5] covers all plausible values of
. Because we expect positive test results, indicating disease, to be more common among
patients with disease, the interval [−2; 20] covers all plausible values of Q.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2871

Selection of the inverse gamma (S−1 ) priors for the between-study variance parameters, 2
and 2 , is more diRcult. The sampler is potentially sensitive to the prior distribution used
to model variance parameters, which can a8ect the width posterior interval estimates. The
inverse gamma prior for variance parameters should be selected to re;ect the expected range
of study-speci6c accuracy and cutpoint parameters. The goal in choosing an appropriate prior
for variance parameters is to select a relatively di8use distribution that does not assign too
much probability to very large (unrealistic) values. Fortunately, choice of the S−1 priors can be
guided by the realistic ranges of cutpoint and accuracy parameters, noted above. The observed
variability of naive cutpoint and accuracy parameter estimates can also guide selection of S−1
priors.

3.2.2. Parameter estimation. The goal of estimation is description of the posterior distribution
of model parameters and summary statistics that are functions of model parameters. Posterior
95 per cent credible intervals were estimated by empirical 2.5 per cent and 97.5 per cent
posterior percentiles of simulated draws. The mode of symmetric posterior distributions was
approximated by the mean value across simulated draws (that is, P, Q and ). The mode
of asymmetric posterior distributions was approximated by the median value across simulated
draws (that is, 2 and 2 ).

3.2.3. Assessing convergence. We based estimation on draws from several chains started at
extreme points in the parameter space. The CODA program [24] was used to evaluate con-
vergence to the target distribution. We used Geweke’s diagnostic to evaluate convergence of
individual chains for symmetrically distributed variables [25]. The Geweke statistic is based
on comparison of the mean of draws early in the sequence of iterates versus the mean of
later draws. If the chain is stationary, then these means should be similar. We also used
Heidelberger and Welch’s method for evaluating stationarity based on the Cramer-von-Mises
statistic [26]. When there was no evidence against convergence, we next used estimates of
scale reduction proposed by Gelman and Rubin to examine whether the multiple chains con-
verged to the same distribution [27]. The scale reduction statistic is essentially the ratio of
the between-chain variance to the within-chain variance. As a 6nal check, we compared pa-
rameter estimates from each individual chain to parameter estimates based on pooling across
chains.

3.2.4. Diagnostics. Diagnostic statistics were used to evaluate possible model misspeci6ca-
tion, overall goodness-of-6t, and to identify outlying and possibly in;uential data points. Our
approach roughly follows the suggestions of Weiss [28]. Because the number of true pos-
itive and false positive results within studies can safely be assumed to follow a binomial
distribution, checks for model misspeci6cation were restricted to evaluation of the prior dis-
tributions for study-speci6c parameters  and . Recall that we assume that both statistics are
normally
 distributed. Under the exchangeable model (that is,  =  = 0), the sums of squares
S = i (i − P)2 =2 and S = i (i − Q)2 =2 should follow a " 2 distribution with m degrees of
freedom, where m is the number of studies in the meta-analysis. Under the non-exchangeable
model sums of squares are of the form S = i (i − Q − Zi )2 =2 . Tail probabilities that are
too large (or too small) suggest misspeci6cation of prior distributions.
We assessed the assumption of conditional independence between  and  by estimating
these parameters directly from the data, then examining scatter plots and correlation. These

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2872 C. M. RUTTER AND C. A. GATSONIS

estimates are based on formula (3) with  = 0

i(0) = (logit(TPicc ) − logit(FPicc ))=2


(4)
i(0) = logit(TPicc ) − logit(FPicc )

where the superscript ‘cc’ denotes continuity correction.


We evaluated global goodness-of-6t using two chi-square discrepancy statistics. The 6rst is
based on estimated counts

2   (yij1 − E(yij1 |model; data))2


Dcount =
i j E(yij1 |model; data)

where yij1 is the number of subjects testing positive i not-diseased ( j = 0) and diseased ( j = 1)
2 2
groups. Dcount is compared to a "2m distribution The second global goodness-of-6t statistics is
based on continuity-corrected log-odds ratios

2  (log(OR cc )i − E(log(OR cc )i |model; data))2


Dlog(or) = √
i {var(log(OR cc )i |model; data)}

where log(OR cc )i is the observed continuity corrected log-odds ratio for the ith study. Outliers
and potentially in;uential points can be identi6ed using plots of sensitivity versus speci6city
and by examining chi-squared residuals.

3.3. Summary statistics


MCMC estimation allows estimation of functions parameters, so that results can be presented
using a variety of summary measures. Summary statistics were calculated for each draw of
the sampler, and these values were used to estimate their posterior modes and 95 per cent
credible intervals. We used summaries that are often used when evaluating diagnostic test
performance: TP rate, FP rate, and likelihood ratio statistics [29]. In the meta-analysis context,
these measures are estimated posterior means that combine information across studies. Overall
TP and FP rates are estimated from

TP = logit−1 [(P + Q=2)e−=2 ]


(5)
FP = logit−1 [(P − Q=2)e=2 ]

These overall TP and FP rates correspond to the test’s expected operating characteristics,
given the observed data.
Likelihood ratio statistics describe the post-test change in odds of disease. The likelihood
ratio positive (LR + ) estimates the change in the odds of disease following a positive test, that
is, P(D + |T +)=P(D − |T +) = LR + × (pre-test odds). Similarly, the likelihood ratio negative
(LR − ) estimates the change in the odds of disease following a negative test. These statistics
are estimated by LR+ = TP=FP and LR − = (1 − TP)=(1 − FP), where TP and FP are given
by (5).

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2873

4. EXAMPLE: EVALUATION OF LYMPH NODE METASTASIS IN WOMEN WITH


INVASIVE CERVICAL CANCER

Lymph node metastasis a8ects both the prognosis and treatment of invasive cervical cancer
[30; 31]. When disease is limited to the cervix and tumours are relatively small (¡4 cm),
surgical treatment is preferred [32]. When the tumour is larger, has spread to nearby organs,
or involves lymph nodes, then radiation therapy (alone or combined with chemotherapy)
is preferred [33]. Nodal metastasis is particularly diRcult to detect. When nodal metastasis
is discovered during surgery, additional radiation therapy is recommended. However, surgery
results in greater morbidity for these women relative to women who undergo radiation without
prior surgery.
Good preoperative staging, especially identi6cation of lymph node metastasis, can
improve outcomes by improving treatment plans. Three types of diagnostic images have
been widely used to identify lymphadenopathy: lymphangiography (LAG); computed
tomography (CT), and magnetic resonance imaging (MR). LAG provides information
about the drainage patterns from lymph nodes. CT and MR provide information about the
size of lymph nodes. Unfortunately, these imaging techniques are not believed to be sensitive
enough to adequately determine appropriate treatment, prompting some to advocate the routine
use of surgical staging [34; 35]. Health care providers and policy makers who decide whether
to include these imaging tests as part of preoperative staging need the best possible
estimates of the diagnostic information these tests provide to inform their
decisions.
Scheidler and colleagues combined information from several studies to estimate and compare
the ability of LAG, CT and MR to accurately detect lymph node metastasis. They derived
6xed e8ect SROC curves for each modality. Tests were compared at the point on the SROC
where sensitivity equals speci6city using true positive rates (that is, the Q∗ statistic) and
LR statistics. Scheidler et al. examined overall detection of nodal metastasis, and in sub-
analyses examined detection of pelvic nodes and para-aortic nodes. Because both para-aortic
and pelvic node involvement a8ect treatment and prognosis, we focus on accuracy of overall
detection of lymph node metastasis. In the following sections, we describe the data used in
the original meta-analysis, the application of the HSROC model to these data, results from
the HSROC model, and conclusions drawn from the HSROC model and how these relate to
original 6ndings based on the SROC approach.

4.1. Data
Scheidler et al. combined data from 36 studies, of which 17 examined LAG, 19 examined
CT and 10 examined MR. Observed true positive and false positive rates for this data set are
shown in Figure 1.
Nine of the 36 studies examined more than one test. In particular, two studies examined
CT and LAG, four studies examined CT and MR, and two studies examined CT twice. The
two studies that examined CT twice reported data separately for para-aortic and pelvic nodes.
These 6ndings were based on the same group of women, but published information did not
allow combination of women’s para-aortic and pelvic node 6ndings into an overall 2 × 2 table.
Thus, both tables were included in analyses.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2874 C. M. RUTTER AND C. A. GATSONIS

Figure 1. Detection of lymph node metastasis, using lymphangiography (LAG), computed tomography
(CT) or magnetic resonance (MR) imaging: observed true positive (TP) and false positive (FP) rates
from data reported across 37 studies that were originally meta-analysed by Scheidler and colleagues [14].

4.2. HSROC model


The HSROC model was estimated using the combined set of data from all modalities but did
not explicitly include correlation terms for data derived from studies that compared two or all
three of the modalities. Although it is possible to extend the model to cover such correlations,
the cross-tabulated data from studies evaluating more than one modality were not available.
Because we expect positive correlation between diagnostic test results, we expect that ignoring
this correlation could cause a slight conservative bias in comparisons between tests.
Our model allowed the expected cutpoint and accuracy parameters to depend on the test
modality via a three-level covariate, expressed via two dummy variables: Z1 = 0:587 if CT and
−0:413 otherwise, Z2 = 0:783 if MR and −0:217 otherwise. (These covariates were centred at
0 for model 6tting.) We relaxed our model assumptions and allowed the scale parameter () to
depend on modality so that logit(ij ) = (i + i Xij )e−(+%1 Z1i +%2 Z2i )Xij , thus allowing di8erences
in the shape of the ROC curve across modality. Uniform priors were used for covariates:
U[−10; 10] for cutpoint (k ) and accuracy (k ) covariates and U[−5; 5] for scale covariates
(%k ). We also allowed di8erent variance parameters (2 ; 2 ) for each modality. The BUGS
[20] code used for model estimation is included in Appendix B. To aid interpretation, we
present 6ndings for each modality in terms of calculated SROC parameters, for example, for
LAG we present PLAG = P − 0:4131 − 0:2172 , QLAG = Q − 0:4131 − 0:2172 and LAG =
 − 0:413%1 − 0:217%2 .

4.3. HSROC computations


The sampler was run using eight independent MCMC chains. Experimental runs showed that
the sampler was slowly mixing. To ensure coverage of the target distribution, estimation was

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2875

Table I. Model diagnostics: estimated sums of squares associated with study-speci6c parameters. Q0:05 is
2
the 5th percentile, Q0:95 is the 95th percentile, "0:05 represents the 5th percentile of the chi-square with the
2
appropriate degrees of freedom (LAG:17, CT:19, MR:10). Similarly, "0:95 represents the 95th percentile of
the appropriate chi-square distribution.
2 2
Q0:05 P(S6"0:05 ) Q0:95 P(S¿"0:95 )

SLAG 9.15 0.232 27.10 0.002


SLAG 6.08 0.039 19.70 0.043
2
"17 8.67 0.050 27.59 0.050
SCT 8.45 0.064 25.90 0.024
SCT 9.81 0.116 27.50 0.013
2
"19 10.12 0.050 30.14 0.050
SMR 3.92 0.029 16.80 0.032
SMR 4.37 0.054 17.00 0.028
2
"10 3.94 0.050 18.31 0.050

based on multiple sequences with overdispersed starting points. Because of high between-
draw correlation, every 50th iteration was saved from each sequence of 50 000 simulated
draws. We allowed 2500 iterations for burn-in. Eight di8erent chains were run, with starting
points based on ; P and Q: (0) ∈ {−2:5; 1:5}; P(0) ∈ {−5; 5}, and Q(0) ∈ {−1; 10}. Starting
values for covariate parameters were set to zero. Starting values for study-speci6c cutpoint
and accuracy parameters were calculated from continuity-corrected count data using formulae
(4). Starting values for the prior variability of study speci6c cutpoints (2 ) and accuracies
(2 ) were set larger than the observed variance of starting values (all were less than 2).
We set 2(0) = 2(0) = 5 for all three tests and chose S−1 (2:1; 2) priors for variance param-
eters. The S−1 (2:1; 2) distribution has mean 1.82, standard deviation 5.75, with percentiles
P0:05 = 0:41; P0:25 = 0:71; P0:50 = 1:12; P0:75 = 1:93 and P0:95 = 5:05. Neither Geweke statistics
nor results from the Heidelberger and Welch method indicated any failure to converge, and
scale reduction statistics indicated that the independent chains had converged to the same
distribution. Results present estimated posterior modes with 95 per cent credible intervals.
Probabilities corresponding to between-test comparisons were based on estimated posterior
probabilities.

4.4. Model checks


We computed goodness-of-6t measures separately for each modality. There was no evidence of
signi6cant lack of 6t in the HSROC model. Estimated TP and FP rates were close to observed
2 2 2
values (LAG: "34 = 5:4; CT: "38 = 9:1; MR: "20 = 4:0; p¿0:999 for all tests) as were estimated
2 2 2
log-odds ratios (LAG: "17 = 9:2; p = 0:93; CT: "19 = 11:0; p = 0:92; MR: "10 = 2:7; p = 0:99).
Normal distributions seemed to reasonably approximate the distribution of cutpoint parameters.
Table I compares the estimated distribution of sums of squares (S ; S ) to their expected chi-
square distributions. Comparisons were based on the proportion of draws that were in the
upper and lower tails of the expected chi-square distributions. Because there were fewer large
sums of squares than expected, and in some cases more small sums of squares than expected,
we examined a model with t-priors for study-level parameters i and i . Results from the
model with t-priors are outlined in Section 4.6.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2876 C. M. RUTTER AND C. A. GATSONIS

Figure 2. Plots of naively estimated cutpoint parameters versus accuracy


parameters for each of the three tests: lymphangiography (LAG, ), computed
tomography (CT, ) and magnetic resonance (MR, ×) imaging.

We examined the conditional independence assumption for study-speci6c model parameters


indirectly, via plots of starting values i(0) versus i(0) . As shown in Figure 2, there does not
appear to be a strong association between cutpoint and accuracy parameters. The observed
correlation between (0) and (0) was 0.12 (not signi6cantly di8erent from zero based on a
t-test). We concluded that the independence model is reasonable for these data.
Finally, we examined our data for outlying data points. None of the studies was identi6ed
as in;uential based on Weiss’ chi-square residuals. However, plots of raw data (Figure 1)
showed two LAG studies with outlying FP rates. Analyses were re-run with these points
excluded. Results from these analyses were similar to results based on the full data set.

4.5. Results from meta analyses


Table II shows parameter estimates based on MCMC estimation. Although 95 per cent
credible intervals for all three scale parameters covered zero, we included scale parame-
ters in the model to allow asymmetry in estimated ROC curves. The largest estimated dif-
ference in scale parameters was between LAG and CT (ˆLAG − ˆCT = 1:15, 95 per cent
CI(−0:04; 2:37)). Figure 3 shows how di8erences in estimated scale parameters a8ect the
shapes of estimated SROC curves. These SROC curves were based on estimated expected
values of (QLAG ; LAG ); (QCT ; CT ) and (QMR ; MR ), and were plotted over the observed ranges
of false positive rates for each test to avoid extrapolation beyond the data. We did not 6nd
evidence of di8erences in accuracy parameters (Q) or variance parameters (2 and 2 ) across
tests, but there was evidence of di8erences in the positivity criteria. LAG tended to have a
higher positivity criteria than either CT (P̂LAG − P̂CT = 1.71, with 95 per cent CI (0:70; 2:88))
or MR (P̂LAG − P̂MR = 1:71, with 95 per cent CI (0:44; 3:17)). These di8erences in positivity

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2877

Table II. Hierarchical ROC parameter estimates: estimated posterior modes with 95 per cent credible
intervals in parentheses.

Parameter Test type

LAG CT MR
Z1 = −0:413 Z2 = −0:217 Z1 = 0:587 Z2 = −0:217 Z1 = −0:413 Z2 = 0:783

P +  1 Z1 +  2 Z2 −0.093 (−0:686; 0:533) −1.800 (−2:806; −1:028) −1:807 (−3:121; −0:692)


2 0.456 (0:233; 1:000) 0.765 (0:350; 1:777) 1.014 (0:413; 2:952)
Q +  1 Z1 + 2 Z2 2.326 (1:642; 3:028) 3.491 (2:169; 5:318) 3.920 (2:305; 6:166)
2 1.060 (0:443; 2:667) 0.604 (0:267; 1:559) 0.929 (0:352; 2:962)
 + % 1 Z1 + % 2 Z2 0.600 (−0:261; 1:535) −0.553 (−1:434; 0:218) −0:350 (−1:352; 0:583)

Figure 3. Estimated summary receiver operating characteristic curves and expected operating points for
lymphangiography (LAG), computed tomography (CT) and magnetic resonance (MR) imaging, based
on hierarchical regression modelling. HSROC curves are plotted over the range of observed FP rates.

criteria were not demonstrated by the earlier 6xed e8ect SROC analysis. In the context of
ROC analysis, cutpoints are often viewed as nuisance parameters. However, because cutpoints
directly a8ect the expected operating point (equation (5)), they also a8ect other estimates of
overall test performance.
As shown in Table III, we found that LAG was more sensitive and less speci6c than
both CT and MR. Point estimates of TP and FP rates for each modality are shown on
Figure 3. Because estimated TP and FP rates are based on highly non-linear functions of
HSROC parameters, they di8er slightly from estimated operating points derived by substituting
estimates of P̂test ; Q̂test and ˆtest into equation (5) and lie near, but not on, SROC curves.
Observed di8erences in sensitivity and speci6city are consistent with our 6nding that LAG
had a higher overall positivity criteria (P) than both CT and MR. Figure 3 demonstrates
that di8erences in sensitivity and speci6city are also a8ected by between test di8erences in

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2878 C. M. RUTTER AND C. A. GATSONIS

Table III. Overall rates and likelihood ratios: posterior modes with 95 per cent credible intervals in parentheses,
break probabilities for between-modality comparisons.

Test type False positive True positive Likelihood ratio Likelihood ratio
rate (FP) rate (TP) negative (LR− ) positive (LR+ )

LAG 0.164 (0:089; 0:259) 0.683 (0:585; 0:774) 0.380 (0:270; 0:501) 3.89 (2:63; 7:55)
CT 0.069 (0:041; 0:106) 0.483 (0:310; 0:655) 0.562 (0:374; 0:733) 6.92 (4:44; 11:4)
MR 0.047 (0:019; 0:093) 0.541 (0:286; 0:771) 0.483 (0:246; 0:743) 9.99 (5:70; 25:2)
Prob(LAG¿CT) 0.99 0.98 0.05 0.08
Prob(LAG¿MR) 0.99 0.86 0.23 0.02
Prob(MR¿CT) 0.17 0.66 0.31 0.88

accuracy (Q) and scale () parameters. We also found between-test di8erences in LR statistics.
LAG had a lower (better) LR − than both CT and MR, but also had a lower (worse) LR +
than both CT and MR. We did not 6nd strong evidence for di8erences in the accuracy of
CT and MR. Though not statistically signi6cant, MR tended to have better accuracy than CT,
with higher sensitivity and speci6city than CT, lower LR − than CT and higher LR+ than CT.

4.6. Sensitivity analysis: choice of priors


Because the sums of squares statistics S and S suggested that a heavier tailed prior for
study-level parameters  and  might be appropriate, we 6t a model using a t-distribution
with 2 degrees of freedom for  and . We undertook this as a sensitivity check. Estimation
was based on a single long chain, with estimates from the normal-prior model (Table II)
used as starting points. The resulting model parameters were similar to estimates from the
normal-based model, and did not change the conclusions drawn from the model results. These
6ndings suggest that the normal-model provides a reasonable approximation to these data.
We also explored the sensitivity of our model to the inverse gamma prior used for variance
parameters. As a sensitivity check, we 6t a model using a S−1 (2:1; 5) prior. Estimation based
on the S−1 (2:1; 5) prior demonstrates the sensitivity of our model under fairly extreme prior
assumptions. The S−1 (2:1; 5) puts little weight on small values, with a 5th percentile of 1.02.
We based estimation on a single long chain, with 2(0) = 2(0) = 5 for all three tests and
estimates from the main model (Table II) used as starting points for remaining parameters.
Using the S−1 prior, our estimated standard deviations increased and credible intervals for
estimated parameters widened. However, these changes did not a8ect any of our conclusions
about between-test comparisons. While variance parameters were somewhat sensitive to choice
of variance prior, other parameter estimates were not strongly a8ected by the choice of priors.

4.7. Discussion of 1ndings about radiologic evaluation for lymph node metastasis
We found no detectable di8erences in the parameters that determine the SROC curves (Q; )
for LAG, CT and MR. However, there was evidence of di8erences in the positivity criteria
(P) and the performance of the three modalities. Lymphangiography (LAG) was the most
sensitive and least speci6c test. There was weak evidence that MR might be more sensitive
and more speci6c than CT. Based on likelihood ratio statistics, LAG was the best modality
for ruling out disease and MR was the best modality for ruling in disease.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2879

SROC analysis allows between-modality comparisons that remove threshold e8ects and
instead capture the overall trade-o8 between sensitivity and speci6city over a speci6c range
of thresholds. In this spirit, Scheidler and colleagues used the Q∗ statistic to remove the e8ect
of threshold on between-study comparisons. Their analysis did not 6nd di8erences in either
sensitivity or LR statistics across tests, which they calculated at Q∗ . We believe this is not
the best approach for these data because it ignores di8erences in the distribution of true and
false positive rates for LAG, CT and MR. Estimates of Q∗ based on original analyses were
0.74 for LAG, 0.80 for CT and 0.85 for MR. These are far di8erent from our estimates of the
expected sensitivity and speci6city of these tests. For MR in particular, Q∗ lies at the edge of
the data range. Because Sheidler and colleagues calculated LR statistics at Q∗ , their estimates
do not re;ect the distribution of studies across the estimated summary curve. Estimates of
LR+ based on original analyses were 2.85 for LAG, 4.00 for CT and 5.67 for MR. Estimates
of LR − based on original analyses were 0.35 for LAG, 0.25 for CT and 0.18 for MR.
Using the expected sensitivity and speci6city, we found that each modality had higher LR
statistics than were found using the 6xed e8ect analysis. That is, the modalities were better at
ruling in disease, but worse at ruling out disease, than the original analyses suggested. Thus,
removing the threshold e8ect can remove important between-study di8erences in expected test
performance.
Summary measures of test accuracy and clinical utility can have important implications for
selection of preoperative diagnostic tests. A recent study comparing diagnostic assessments
of women with cervical cancer to assessments in 1984 and 1990 found that the use of LAG
declined (from 6 per cent to 2.3 per cent) while the use of both CT and MR increased (3.82
per cent to 55.1 per cent for CT and 0.9 per cent to 5.6 per cent for MR) [36]. However,
HSROC analyses suggest that LAG may remain an important modality for ruling out lym-
phadenopathy, that is, identifying women who may be eligible for surgical treatment alone.
Our results also show that MR may be a better diagnostic tool than either LAG or CT for
ruling in disease, that is, identifying women who should have further cytological evaluation
of lymph nodes or who might go to non-surgical treatment without delay. Our study pro-
vides improved estimates of the diagnostic accuracy of these three modalities. Combining this
information about test accuracy with information about costs and treatment e8ectiveness using
decision analysis would provide further insight into the impact of these di8erent modalities
on health outcomes.

5. DISCUSSION

The hierarchical summary ROC (HSROC) model for combining estimated pairs of sensi-
tivity and speci6city from multiple studies extends the 6xed-e8ects summary ROC (SROC)
model, more appropriately incorporating both within- and between-study variability, and allow-
ing ;exibility in estimation of summary statistics. The HSROC model describes within-study
variability using a binomial distribution for the number of positive tests in diseased and not
diseased patients. An underlying ROC model that allows variability in both the positivity cri-
teria and accuracy across studies determines the binomial probabilities. Variation in positivity
criteria and accuracy is modelled using Normal distribution, with a linear regression in the
mean that allows dependence on study-level covariates. More heavy tailed distributions (such
as t or Cauchy) can also be used instead of a Normal in the second level of the hierarchical

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2880 C. M. RUTTER AND C. A. GATSONIS

model. The model could also be extended to allow dependence between positivity criteria and
accuracy. We did not pursue this extension because the independence assumption corresponds
to our underlying ROC assumption that positivity criteria and accuracy are independent test
characteristics. Furthermore, the data we analysed did not demonstrate correlation between
naively estimated cutpoint and accuracy parameters. Even with this possible limitation, our
HSROC model allows more complete accounting of between-study variability than is pos-
sible with 6xed-e8ects formulations. In addition, the HSROC model provides more realistic
accounting of within-study variability than the 6xed-e8ects SROC model (4), which uses a
Normal error distribution and does not account for the measurement error in the primary
covariate.
The HSROC approach provides a ;exible modelling framework that can be extended when
more information is available. For example, when studies report results from more than one
modality, the hierarchical model can be appropriately extended to incorporate within-study
correlation. This extension requires information about the joint distribution of test results, either
from multiple similar pairs across several studies, from cross-tabulation of test results within
studies, or from patient-level data within studies. When patient level information is available,
the within-study (level I) model can be extended to incorporate patient-level covariates. This
extended model can also be applied to data from a single study when results are clustered
within participating institutions and=or readers (see reference (37) for a hierarchical analysis
of ROC data).
The HSROC model assumes that disease status is assessed without error. In single-study
settings, errors in the reference standard used to determine disease state can be addressed
using latent variables (for example, references [38; 39]). A latent variable approach has also
been used in a recent extension of the SROC model [40]. Unfortunately, this extended SROC
method introduces other problems. The two-step estimation process 6rst adjusts rates then
estimates the SROC using these adjusted rates. The method for adjusting rates assumes that
both test characteristics (sensitivity, speci6city) and rates of error in the reference standard are
constant across studies. These assumptions are diRcult to justify, and estimated SROC standard
errors do not re;ect additional variability from this adjustment. On a more basic level, because
meta-analysis intends to combine similar information across studies, a reasonable reference
standard must be available before studies can be combined. Evaluation of a ;awed reference
standard and evaluation of a test compared to a ;awed standard requires in-depth study that
includes additional information, such as additional test results and clinical outcomes.
The fully Bayesian approach to model 6tting, although computationally intensive, leads
to simulated values from the posterior distribution of the parameters, on the basis of which
the analyst can easily calculate summaries of the posterior distribution of a broad range of
functions of the parameters. For example, in our reanalysis of the Scheidler data we derived
estimates of functionals of the posterior distribution of likelihood ratio statistics, and estimated
the posterior probabilities of di8erences between modalities. The 6xed e8ect SROC model
requires selection of a single operating point for calculation of likelihood ratio statistics. In
contrast, likelihood ratios statistics estimated using the HSROC model are calculated at the
average operating point on the summary ROC curve, so that these estimates incorporate the
distribution of data from individual studies across the ROC curve, enabling more accurate
estimation of clinical utility.
The Bayesian model also allows description of sources of variability. The di8erences we
found in the variability of positivity criteria were consistent with the technological development

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2881

of these three tests. At one extreme, LAG was the standard diagnostic test during the meta-
analysed time period. At the other extreme, MR was a new diagnostic approach during this
time. The estimated variability of cutpoint parameters was low for LAG. The variability of CT
cutpoints was about one and a half times the variability of LAG cutpoints, and the variability
of MR cutpoints was more than twice the variability of LAG cutpoints. This suggests that
MR accuracy could be improved over time.
The advantages of the HSROC model come at a price; estimation requires Markov chain
Monte Carlo (MCMC) simulation. MCMC estimation requires programming, simulation, eval-
uation of convergence and model adequacy, and synthesis of simulation results. The proposed
model can be 6t within publicly available software [21]. Implementation of MCMC sim-
ulation will still entail non-trivial analysis tasks including evaluation of convergence and
the adequacy of prior distributions and this requires some statistical expertise. However, the
increased complexity of the proposed analysis must be measured against the advantages from
the approach, including more realistic assumptions, more precise description of the impact of
covariates, and greater ;exibility in choice of descriptive statistics.

APPENDIX A: CONDITIONAL DISTRIBUTIONS

The conditional distributions of level II parameters (P; Q; 2 ; 2 ;  and ) are standard
distributions or truncated versions of standard distributions. For example, the conditional dis-
tribution
m of Q given ; z; 2 ;  and   is proportional to a Normal distribution with mean
2 2 2
i=1 (i − Zi )=m and variance  =m over the range [ 1 ;  2 ]. Variance parameters  and 
have conjugate priors, so that
  m −1 
1 
(2 |; z; ; Q;  ) ∼ S−1 ( 1 + m=2); (i − Q − Zi )2 + 1= 2
2 i=1

The conditional distributions of P and 2 are analogous. Finally, the conditional distribution
(|Q; 2 ; ;  ) is proportional to a normal distribution with mean

i=1 Zi (i − Q)
m 2
i=1 Zi

and variance 2 =( i Zi2 ) over the range [1 ; 2 ].
The conditional distributions of level I parameters (i ; i and ) are not standard. The
conditional distribution of study speci6c accuracies, (i |yi ; ni ; Zi ; Q; 2 ; i ; ), is a binomial-
Normal product
  2  
(i − (Q + Zi ))2  nij y
exp − ij ij (1 − ij )(nij −yij )
22 j=1 yij

The conditional distribution of i has similar form. The conditional distribution of the scale
parameter, (|y; n; Z; ; ;  ), is proportional to the product of 2m binomials
 

m  2 nij y
ij ij (1 − ij )(nij −yij )
i=1 j=1 y ij

with positive probability over the range [1 ; 2 ].

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2882 C. M. RUTTER AND C. A. GATSONIS

APPENDIX B: BUGS CODE

model dxmeta;
const
N =46;
var
CT[N],MR[N],fp[N],neg[N],tp[N],pos[N],
theta[N],alpha[N],pi[2,N],t[N],a[N],b[N],
THETA,LAMBDA,beta,gamma[2],lambda[2],bcov[2],
prec[2,3],sigmasq[2,3];
data CT, MR, tp, pos, fp, neg in "dxmeta.dat";
inits in "dxmeta.ini";
{
THETA∼dunif(-10,10);
LAMBDA∼dunif(-2,20);
beta∼dunif(-5,5);
for(i in 1:2){
gamma[i]∼dunif(-10,10);
lambda[i]∼dunif(-10,10);
bcov[i]∼dunif(-5,5);
for(j in 1:3){
prec[i,j] ∼ dgamma(2.1,2); sigmasq[i,j] ¡ − 1.0/prec[i,j];
}
}
for(i in 1:N){
t[i] <- THETA+CT[i]*gamma[1]+MR[i]*gamma[2];
l[i] <- LAMBDA+CT[i]*lambda[1]+MR[i]*lambda[2];
theta[i]∼dnorm(t[i],prec[1,test[i]]);
alpha[i]∼dnorm(l[i],prec[2,test[i]]);
b[i] <- exp((beta+CT[i]*bcov[1]+MR[i]*bcov[2])/2);
logit(pi[1,i]) <- (theta[i] + 0.5*alpha[i])/b[i];
logit(pi[2,i]) <- (theta[i] - 0.5*alpha[i])*b[i];
tp[i] ∼ dbin(pi[1,i],pos[i]);
fp[i] ∼ dbin(pi[2,i],neg[i]);
}
}

REFERENCES
1. Irwig L, Tosteson AN, Gatsonis CA, Lau J, Colditz G, Chalmers TC, Mosteller F. Guidelines for meta-analyses
evaluating diagnostic tests. Annals of Internal Medicine 1994; 120(8):667– 676.
2. Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-analytic methods for diagnostic test accuracy. Journal of
Clinical Epidemiology 1995; 48(1):119 –130.
3. Kardaun JWPF, Kardaun OJWF. Comparative diagnostic performance of three radiological procedures for the
detection of lumbar disk herniation. Methods of Information in Medicine 1990; 29(1):12 – 22.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
HIERARCHICAL SUMMARY ROC CURVE ESTIMATION 2883

4. Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test into a summary
ROC curve: data-analytic approaches and some additional considerations. Statistics in Medicine 1993; 12(14):
1293 – 1316.
5. Rutter CM, Gatsonis C. Regression methods for meta-analysis of diagnostic test data. Academic Radiology
1995; 2(1):S48 – S56.
6. Hasselblad V, Mosteller F, Littenberg B, Chalmers TC, Hunink MG, Turner JA, Morton SC, Diehr P, Wong JB,
Powe NR. A survey of current problems in meta-analysis. Medical Care 1995; 33(2):202– 220.
7. Hasselblad V, Hedges LV. Meta-analysis of screening and diagnostic tests. Psychological Bulletin 1995;
117(1):167–178.
8. Shapiro D. Issues in combining independent estimates of sensitivity and speci6city of a diagnostic test. Academic
Radiology 1995; 2(1):S37– S47.
9. de Vries SO, Hunink M, Polak J. Summary ROC curves as a technique for meta-analysis of the diagnostic
performance of duplex ultrasonography in peripheral arterial disease. Academic Radiology 1996; 3(4):361– 369.
10. Hanley JA. Receiver operating characteristic (ROC) methodology: the state of the art. Critical Reviews in
Diagnostic Imaging 1989; 29(3):307–335.
11. DuMouchel W. Bayesian meta-analysis. In Statistical Methodology in the Pharmaceutical Sciences, Berry D
(ed.). Dekker: New York, 1990; 509 – 529.
12. Morris C, Normand ST. Hierarchical models for combining information and for meta-analysis. In Bayesian
Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds). Oxford University Press: Oxford, 1992;
321–344.
13. Normand ST. Meta-analysis: formulating, evaluating, combining, and reporting. Statistics in Medicine 1999;
18(3):321–359.
14. Scheidler J, Hricak H, Yu KK, Subak L, Segal MR. Radiological evaluation of lymph node metastases in patients
with cervical cancer: a meta-analysis. Journal of the American Medical Association 1997; 278(13):1096 –1101.
15. Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Chapman and Hall: New York,
1995.
16. Baker FD. Item Response Theory. Marcel Dekker: New York, 1992.
17. McCullagh P. Regression models for ordinal data. Journal of the Royal Statistical Society, Series B 1980;
42(2):109 –142.
18. Tosteson ANA, Begg CB. A general regression methodology for ROC curve estimation. Medical Decision
Making 1988; 8(3):204 –215.
19. Toledano A, Gatsonis CA. Ordinal regression methodology for ROC curves derived from correlated data.
Statistics in Medicine 1996; 15(16):1807– 1826.
20. Gelfand AE, Smith AFM. Sampling-based approaches to calculating marginal densities. Journal of the American
Statistical Association 1990; 85(419):398 – 409.
21. Spiegelhalter D, Thomas A, Best N, Gilks W. BUGS: Bayesian inference Using Gibbs Sampling, version 0.5,
(version ii). MRC Biostatistics Unit: Cambridge, 1996. Program available at www.mrc-bsu.cam.
ac.uk= bugs=.
22. Gilks WR, Richardson S, Speigelhalter DJ. Derivative-free adaptive rejection sampling for gibbs sampling. In
Bayesian Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds). Clarendon Press: Oxford, 1992;
641– 650.
23. Ritter C, Tanner MA. Facilitating the Gibbs sampler: the Gibbs stopper and the Griddy–Gibbs sampler. Journal
of the American Statistical Association 1992; 87(419):861– 868.
24. Best N, Cowles MK, Vines K. CODA: Convergence Diagnostics and Output Analysis Software for
Gibbs sampling output, Version 0.20. MRC Biostatistics Unit: Cambridge, 1995. Program available at
www.mrc-bsu.cam.ac.uk= bugs=.
25. Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In Bayesian
Statistics 4, Bernardo JM, Berger JO, Dawid AP, Smith AFM (eds). Clarendon Press: Oxford, 1992; 169 –194.
26. Heidelberger P, Welch P. Simulation run length control in the presence of an initial transient. Operations
Research 1983; 31(6):1109 –1144.
27. Gelman A, Rubin DB. Inference from iterative simulation using multiple sequences. Statistical Sciences 1992;
7(4):457– 511.
28. Weiss RE. Bayesian model checking with applications to hierarchical models. Technical Report, UCLA
Department of Biostatistics, 1996.
29. Boyko EJ. Ruling out or ruling in disease with the most sensitive or speci6c diagnostic test: short cut or wrong
turn? Medical Decision Making 1994; 14(2):175 –179.
30. Kristensen GB, Abeler VM, Risberg B, TropWe, Bryen M. Tumor size, depth of invasion, and grading of the
invasive tumor front are the main prognostic factors in early squamous cell cervical carcinoma. Gynecologic
Oncology 1999; 74(2):245 –251.
31. Delgado G, Bundy B, Zaino R, Sevin BU, Creasman WT, Major F. Prospective surgical-pathological study of
disease-free interval in patients with stage IB squamous cell carcinoma of the cervix: a Gynecologic Oncology
Group study. Gynecologic Oncology 1990; 38(3):352–357.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884
2884 C. M. RUTTER AND C. A. GATSONIS

32. Eifel PJ, Burke TW, Delclos L, Wharton JT, Oswald MJ. Early stage I adenocarcinoma of the uterine cervix:
treatment results in patients with tumors less than or equal to 4 cm in diameter. Gynecologic Oncology 1991;
41(3):199–205.
33. Morris M, Eifel PJ, Lu J, Grigsby PW, Levenback C, Stevens RE, Rotman M, Gershenson DM, Mutch DG.
Pelvic radiation with concurrent chemotherapy compared with pelvic and para-aortic radiation for high-risk
cervical cancer. New England Journal of Medicine 1999; 340(15):1137–1143.
34. Go8 BA, Muntz HG, Paley PJ, Tamimi HK, Koh WJ, Greer BE. Impact of surgical staging in women with
locally advanced cervical cancer. Gynecologic Oncology 1999; 74(3):436–442.
35. Holcomb K, Abula6a O, Matthews RP, Gabbur N, Lee YC, Buhl A. The impact of pretreatment staging
laparotomy on survival in locally advanced cervical carcinoma. European Journal of Gynaecological Oncology
1999; 20(2):90 – 93.
36. Russell AH, Shingleton HM, Jones WB, Fremgen A, Winchester DP, Clive R, Chmiel JS. Diagnostic assessments
in patients with invasive cancer of the cervix: a National Patterns of Care Study of the American College of
Surgeons. Gynecologic Oncology 1996; 63(2):159–165.
37. Gatsonis CA. Random e8ects models for diagnostic accuracy data. Academic Radiology 1995; 2(1):S14 –S21.
38. Qu Y, Hadgu A. A model for evaluating sensitivity and speci6city for correlated diagnostic test in eRcacy studies
with an imperfect reference test. Journal of the American Statistical Association 1998; 93(443):920 – 928.
39. Joseph L, Gyorkos TW. Inference for likelihood ratios in the absence of a ‘gold standard’. Medical Decision
Making 1996; 16(4):412 – 417.
40. Walter SD, Irwig L, Glasziou PP. Meta-analysis of diagnostic tests with imperfect reference standards. Journal
of Clinical Epidemiology 1999; 52(10):943 – 951.

Copyright ? 2001 John Wiley & Sons, Ltd. Statist. Med. 2001; 20:2865–2884

You might also like