You are on page 1of 9

Biostatistics (2001), 2, 4, pp.

463471
Printed in Great Britain

On the bias produced by quality scores in


meta-analysis, and a hierarchical view of proposed
solutions
SANDER GREENLAND
Department of Epidemiology, UCLA School of Public Health and Department of Statistics, UCLA
College of Letters and Science, 22333 Swenson Drive, Topanga CA 90290, USA
lesdomes@ucla.ed
KEITH OROURKE
Department of Surgery and Clinical Epidemiology Unit, University of Ottawa, Ottawa, Canada
S UMMARY
Results from better quality studies should in some sense be more valid or more accurate than results
from other studies, and as a consequence should tend to be distributed differently from results of other
studies. To date, however, quality scores have been poor predictors of study results. We discuss possible
reasons and remedies for this problem. It appears that quality (whatever leads to more valid results) is
of fairly high dimension and possibly non-additive and nonlinear, and that quality dimensions are highly
application-specific and hard to measure from published information. Unfortunately, quality scores are
often used to contrast, model, or modify meta-analysis results without regard to the aforementioned
problems, as when used to directly modify weights or contributions of individual studies in an ad hoc
manner. Even if quality would be captured in one dimension, use of quality scores in summarization
weights would produce biased estimates of effect. Only if this bias were more than offset by variance
reduction would such use be justified. From this perspective, quality weighting should be evaluated
against formal bias-variance trade-off methods such as hierarchical (random-coefficient) meta-regression.
Because it is unlikely that a low-dimensional appraisal will ever be adequate (especially over different
applications), we argue that response-surface estimation based on quality items is preferable to quality
weighting. Quality scores may be useful in the second stage of a hierarchical response-surface model, but
only if the scores are reconstructed to maximize their correlation with bias.
Keywords: Empirical Bayes; Hierarchical regression; Meta-analysis; Mixed models; Multilevel modeling; Pooling;
Quality scores; Random-coefficient regression; Random effects; Risk assessment; Risk regression; Technology
assessment.

1. I NTRODUCTION
A common objection to meta-analytic summaries is that they combine results from studies of disparate
quality. Study quality is not given formal definition in the literature, but quality appraisals usually
involve classifying the study according to a number of traits or items that are reported in or determinable
from the published paper. These traits are presumed to predict the accuracy of study results, where
accuracy is a function of both systematic and random error (Rothman and Greenland, 1998). Often,
To whom correspondence should be addressed

c Oxford University Press (2001)




464

S. G REENLAND AND K. OROURKE

each of these quality items is assigned a number of points based on the a priori judgment of clinical
investigators, then summed into a quality score that purports to capture the essential features of the
multidimensional quality space (Chalmers et al., 1981). This score is then used as a weighting factor in
averaging across studies (Fleiss and Gross, 1991; Berard and Bravo, 1998; Moher et al., 1998), or used as
a covariate for stratifying or predicting study results.
Despite scientific objections to amalgamating quality items into a univariate score (Greenland,
1994a,b), and empirical findings of deficiencies in such scores (Emerson et al., 1990; Juni et al., 1999),
quality scoring and weighting remains common in meta-analysis (Cho and Bero, 1994; Moher et al., 1995,
1998). Rigorous justifications for the weighting by quality scores are lacking or in error (Detsky et al.,
1992; Greenland, 1994a,b). Moher et al. (1998) claimed that weighting by quality scores resulted in the
least statistical heterogeneity, but the measure of heterogeneity used (deviance) was arbitrarily multiplied
by the quality weights; for instance, if the quality scores were all equal to 0.5, the pooled estimate would
not change but this new measure of heterogeneity would drop by 50%! Given the lack of valid justification
for quality weighting, it is worrisome that some statisticians as well as clinicians continue to use and even
promote such scores, even though alternatives have long been available; for example, a weighting scheme
based on precision and on magnitude of bias was proposed by Cox (1982).
We will show that, in general, quality-score weighting methods produce biased effect estimates. This
is so even when their component quality items capture all traits contributing to bias in study-specific
estimates. Furthermore, one cannot rationalize this bias as being a worthwhile trade-off against variance
without assessment of its size and its performance against other methods. We argue that quality-score
weighting methods need to be replaced by direct modeling of quality dimensions, in part because the best
estimate may not be any weighted average of study results. As an alternative to current uses of quality
scores we propose mixed-effects (hierarchical) regression analysis of quality items with a summary quality
score in the second stage. For this to work well the quality score needs to be constructed to maximize its
correlation with bias.
2. A FORMAL EVALUATION OF QUALITY SCORING
The following evaluation is based on a response-surface model for meta-analysis (Rubin, 1990). The
model posits that the expected study-specific effect estimate is a function of the true effect in the study,
plus bias terms that are functions of quality items. To illustrate, let i = 1, . . . , I index studies, and let X
be a row vector of J quality items; for example, X could contain indicators of study design such as use of
blinding (of patients, of treatment administration, of outcome evaluation) in meta-analyses of randomized
trials, or type of sampling (cohort, case-control, cross-sectional) in meta-analyses of observational studies.
Also define i = expected study-specific effect estimate, i = true effect of study treatment, xi = value
of X for study i. The study-specific estimate i is usually the observed difference in an outcome measure
among treated and untreated groups in study i, such as a difference in means, proportions, or log odds (in
the latter case the expectations and biases discussed here are asymptotic); i may also be a fitted coefficient
from a regression of outcome on treatment type. In contrast, i represents the object of inquiry, the actual
impact of treatment on the outcome measure (Rubin, 1990; Vanhonacker, 1996).
Let x0 be the value of X that is a priori taken to represent the best quality with respect to the
measured quality items. The response-surface approach models i as a regression function of i and xi .
As a simple example, consider the additive model
i = i + bi + i

(1)

where bi is the unknown bias produced by deviation of xi from the ideal x0 . A common (implicit)
assumption in quality scoring is that this ideal value is specified accurately, in the sense that there would
be no bias contributed by measured quality items in a study with the ideal value x0 . While this assumption

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 465
may be reasonable, there will still be the bias component i of i that is due to other, unmeasured quality
items.
Model 1 has 3I parameters; because there are only I observations 1 , . . . , I , one needs some severe
constraints to identify effects. Many analyses employ models in which neither bi nor i is present and in
which the i are constrained to be either a constant or random effects drawn from a normal distribution
with unknown mean and variance. These models have been criticized for ignoring known factors that
lead to variation in the effects i across studies (heterogeneity of effects), such as differences in treatment
protocols and subjects (e.g. differences in the age and sex composition of the study populations), as well
as for ignoring known sources of bias and for increasing sensitivity of the analysis to publication bias
(Cox, 1982, p. 48; LAbbe et al., 1987, p. 231; Greenland, 1994a; Poole and Greenland, 1999). Those
critics advocate instead a meta-regression approach to effect variation, in which measured study-specific
covariates are used to model variation in i or i (e.g. Vanhonacker, 1996; Greenland, 1998).
Although we agree with the aforementioned criticisms, there is some uncertainty fixed-effect
summaries do not capture, and the addition of a fictional residual random effect to the model, however
unreal, is often the only attempt made to capture that uncertainty. Hence, we begin by assuming that the
i can be treated as random draws from a distribution with mean and variance 2 , and that the i are
zero (i.e. X captures all bias information), as do many authors (e.g. Berard and Bravo, 1998). With these
assumptions, model 1 simplifies to
i = i + bi = + bi + i

(2)

where E(i ) = , E(i ) = 0 and Var(i ) = 2 . Here, is the average effect of treatment in a hypothetical
superpopulation of studies, of which the included studies are supposed to be a random sample.
Under model 2, bi may be viewed as the perfect but unknown quality score, in that if it were known one
could unbiasedly estimate from the bias-adjusted estimates i bi , assuming that the random effects i
and the sampling-error residuals i i are uncorrelated. In the same manner, unbiased estimation of the
bi would also suffice for unbiased estimation of , assuming that errors in estimating bi are independent
of other errors and of model parameters. Conversely, if the average study-specific bias was nonzero,
information on the average bi would be necessary as well as sufficient for unbiased estimation of under
model 2.
Quality-scoring methods
Quality-scoring methods replace the vector of quality items X with a unidimensional scoring rule s(X ),
where s(x) is a fixed function specified a priori that typically varies from 0 for a useless study to s(x0 )
for a perfect one (e.g. 0100%). Define si = s(xi ) for i = 0, 1, . . . , I . The two main uses of the quality
scores si are
Quality weighting: average the study-specific estimates i after multiplying the usual inversevariance weight by a function of si . For example, if vi = var(i |i ), the unconditional
variance of i is vi + 2 , the usual random-effects weight for i is w i = 1/(vi + 2 ), and the
usual quality-adjusted weight is si w i (Berard and Bravo, 1998).
Quality-score stratification or regression: stratify or regress the estimates i on the si , then
estimate as the predicted value of at s0 . For example, assuming bias to proportional to
s0 si , fit the submodel of model 2,
i = + (s0 si ) + i ,

(3)

then take , the predicted estimate for a perfect study, as the estimate of the average effect
.

466

S. G REENLAND AND K. OROURKE

A problem quality weighting shares with most weighting schemes is that it produces biased estimates,
even if X contains all bias-related covariates (as in model 2) and our quality score is perfectly predictive
of bias. To see this, let u i be any study-specific weight and let a subscript u denote averaging over studies
using the u i as weights: for example,
u u i i /i u i .

(4)

u = u + bu = + u + bu .

(5)

Under model 2,

In words, u equals the u-average effect u plus the u-average bias bu . Thus, any unbiased estimator of u
will be biased both for the conditional average effect u of the studies observed and for the superpopulation
average effect (unless of course no study is biased or there is a fortuitous cancellation of biases). This
problem arises whether we use the usual estimator, for which u i = wi 1/(vi + 2 ), or the qualityweighted estimator, for which u i = wi si .
The best one can do with quality weighting is reduce bias by shifting weight to less biased studies.
Ordinarily, however, this shift would increase the variance of the weighted estimator, possibly not far
enough to achieve the optimal bias-variance trade-off. To appreciate the issue here, if bias reduction were
our only goal and the bias bi declined monotonically with increasing quality score si , the minimum-bias
estimator based on the scores would have weights u 1 = 1 for si = max(si ), 0 otherwise. In other words,
if bias is all that matters and we trust our quality ranking, we should restrict our average solely to studies
with the maximal score (even if there is only one such study) and throw out the rest (ORourke, 2001).
That this restriction is not made must reflect beliefs that there is useful information in studies of less-thanmaximal quality, and that some degree of bias is tolerable if it comes with a large variance reduction. This
is a pragmatic viewpoint, especially if one considers the inevitable inaccuracies of any real score for bias
prediction. It also recognizes that errors due to bias are not necessarily more costly than errors due to
chance, and are certainly not infinitely more costly (as is implicitly assumed by limiting ones attention to
unbiased estimators).
Under model 2, the quality-score regression estimator from model 3 will incorporate a bias .
This summary bias will be zero if the s0 si are proportional to the true biases bi ; otherwise can easily
be far from , even if the scores are a monotone function of the bias bi and so properly rank the studies
with respect to bias. As a simple numerical example, suppose the i are log relative risks; there is no true
effect (i = = i = 0), so i = bi ; that one-quarter of studies have si = s0 = 100, bi = 0; one-half
have si = 90, bi = ln(0.6); and that one-quarter have si = 60, bi = ln(0.5). If vi is constant across
studies, then (from a weighted least squares regression of i on s0 si ) we get = 0.225, which
results in a geometric mean relative risk from the quality regression of exp( ) = 0.80, rather than the
true relative risk of 1.00.
3. P ROBLEMS WITH QUALITY SCORES AS BIAS PREDICTORS
When a quality scoring rule contains many items irrelevant to bias (e.g. an indicator of whether power
calculations were reported), these items will disproportionately reduce some of the resulting scores.
Inclusion of such items can distort rankings as well. The bias can be viewed as arising from
misspecification of the true bias function b(X ) by a one-parameter model [s0 s(X )] . Given the
multidimensional nature of X , one should see that this surrogate involves an incredibly strict constraint
on b(X ), with misspecification all but guaranteed.
The scoring rule s(X ) is usually a weighted sum Xq =  j q j X j of the J quality items in X , where
q is a column vector of quality weights specified a priori. Inspection of common scoring rules reveals

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 467
that the weights q j are fairly arbitrary item scores assigned by a few clinical experts, without reference to
data on the relative importance of the items in determining bias. Score validations are typically circular,
in that the criteria for validation hinge on score agreement with opinions about which studies are high
quality, not on actual measurements of bias.
Perhaps the worst problem with common quality scores is that they fail to account for the direction of
bias induced by quality deficiencies (Greenland, 1994b). This failing can virtually nullify the value of a
quality score in regression analyses. As a simple numerical example, suppose that trials are given 50 points
out of 100 for having placebo controls and 50 points out of 100 for having validated outcome assessment,
and that lack of outcome validation results in nondifferential misclassification. Suppose also that lack
of placebo controls induces a bias in the risk difference of 0.1 (because then the placebo effect occurs
only in the treated group), that lack of outcome validation induces a bias of 0.1 (because the resulting
misclassification induces a bias towards the null), and that the biases are approximately additive. If n 1
studies lack placebo controls but have validated outcomes and n 2 lack validated outcomes but have
placebo controls, the average biases among studies with quality scores of 0, 50, and 100 will be 0,
(n 1 n 2 )/(n 1 + n 2 ), and 0. Thus, the score will not even properly rank studies with respect to bias,
even though it will properly rank studies with respect to number of deficiencies.
While the preceding example is grossly oversimplified, we think such bias cancellation phenomena are
inevitable when quality scores are composed of many items combined without regard to bias direction (e.g.
those in Cho and Bero (1994) or Moher et al. (1995)). To avoid this problem, a quality score would have
to be reconstructed as a bias score with signs attached to the contributing items. The resulting summary
score should then be centered so that zero indicates no bias expected a priori, and values above and below
zero should indicate relative amounts of expected positive and negative bias in i . Thus, if the i are log
relative risks but expected biases are elicited on the relative-risk scale, the latter expectations must be
translated to the log scale when constructing the quality scores.
4. B EYOND QUALITY SCORES
Expert opinions may be helpful determining items related to bias (i.e. what to include in X ), although
some checklist items are indefensible (e.g. study power conveys no information beyond that in the variance
vi , which is already used in the study weight). Expert opinions may even provide a rough idea of the
rankings of the items in importance. Nonetheless, a more robust approach, one less dependent on the
vagaries and prejudices of such opinions, would employ a more flexible surrogate for bias than that
provided by quality scores.
Quality-item regression
If the number of quality items is not too large relative to the number and variety of studies in the metaanalysis (which may often be the case upon dropping items irrelevant to bias), one can fit a quality-item
regression model, such as
i = + (x0 xi ) + i

(6)

where is a vector of unknown parameters. This approach is equivalent to treating the item weight vector
q as an unknown parameter (with absorbed into q). Conversely, quality-score regression (model 3) is
just a special case of model 6 with specified a priori up to an unknown scalar multiplier , as may be
seen from
(s0 si ) = (x0 q xi q) = (x0 xi )q .

(7)

468

S. G REENLAND AND K. OROURKE

In other words, quality-score regression employs model 6 with a constraint = q , with unknown.
From a Bayesian perspective, this constraint is a dogmatic prior distribution on that is concentrated
entirely on the line q in the J -dimensional parameter space for .
Model 6 also corresponds to a J -parameter linear model for the bias function b(X ). Under model 2,
the bias inherent in this assumption is . Because model 3 is a submodel of model 6, however, the
bias will be no more than the bias inherent in the quality-score regression; on the other
hand, the variance may be much larger than that of .
A compromise model
The possible bias-variance trade-off between and suggests use of an empirical-Bayes compromise.
One such approach relaxes the dogmatism of the quality-score constraint = q by using this
constraint as a second-stage mean for . For example, we could treat as a random-coefficient vector
with mean q and covariance matrix 2 D 2 , where and 2 are second-stage parameters and D is a
diagonal matrix of prespecified scale factors. D is often implicitly left as an identity matrix, but detailed
subject-matter specifications are possible; see Witte et al. (1994) for an example. Reparametrizing by
= q + where E() = 0, cov() = 2 D 2 , the compromise model is then
i = + (x0 xi )(q + ) + i = + (s0 si ) + (x0 xi ) + i .

(8)

This model may be fit by penalized quasi-likelihood (Breslow and Clayton, 1993; Greenland, 1997),
generalized least squares (Goldstein, 1995), or Markov chain Monte Carlo methods (which require explicit
specification of distributional families for and i ). One can also pre-specify 2 based on subject matter
(Greenland, 1993, 2000); models 3 and 6 can then be seen as limiting cases of model 8 as 2 0 and
2 .
Theory, simulations, and applications suggest that the compromise model (8) should produce more
accurate estimators than the extremes of models 3 or 6 if J  4 and the score s(X ) correlates well with
the bias b(X ); otherwise, if the score correlates poorly with bias, the unconstrained model (6) or one that
shrinks coefficients to zero might do better (Morris, 1983; Greenland, 1993, 1997; Witte and Greenland,
1996; Breslow et al., 1998). Hence, second-stage use of quality scores requires reconstructing the scores
to maximize their correlation with bias, as discussed above. That reconstruction, a critical point of subjectmatter and logical input to the meta-analytic model, thus appears to us as essential for any intelligent use
of quality scores.
5. D ISCUSSION
Models 3, 6 and 8 are based on an assumption that the true effect i varies in a simple random manner
across studies; this assumption rarely has any support or even credibility in the light of background
knowledge (Greenland, 1994a, 1998; Poole and Greenland, 1999). Nonetheless, this problem can be
addressed by adding potential effect modifiers (factors that affect i ) to the model. Let Z be a row vector
of modifiers, such as descriptors of the treatment protocol and the patient population, and let z i be the
value of Z in study i. The true effects are now modeled as a function of Z , such as
i = + z i + i ,

(9)

i = + z i + (s0 si ) + (x0 xi ) + i .

(10)

which expands model 8 to

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 469
This model may be extended further by specifying a second-stage model for , so that (like ) may
have both fixed and random components.
All the models described above are special cases of mixed-effects meta-regression models described
and studied by earlier authors (Cochran, 1937; Yates and Cochran, 1938; Cox, 1982; Raudenbush and
Bryk, 1985; Stram, 1996; Platt et al., 1999). From this perspective, common quality-score methods are not
only biased; they also do not exploit the full flexibility of these models. This flexibility offers an alternative
to the strict reliance an expert judgment inherent in common methods, and should be sufficient to justify a
shift in current practice. There are several major obstacles to this shift, however. Among them are a general
lack of familiarity with mixed models, especially in the semi-Bayesian form that is arguably most useful
in epidemiology and medical research (Greenland, 1993, 2000), and a lack of generalized-linear mixed
modeling modules within most packages used by health researchers. We regard as unrealistic the response
we commonly get from academic statisticians, to the effect that everyone should just be using BUGS or
some such Markov chain Monte Carlo software; this response seems oblivious to the subtle specification
and convergence issues that arise in such use. Much more transparent and stable methods are available,
such as mixed modeling with data augmentation priors (Bedrick et al., 1996), which can be implemented
easily with ordinary regression software by adding prior data to the data set (Greenland, 2001; Greenland
and Christensen, 2001). If necessary, one may use extensions of likelihood methods to model variation in
heterogeneity (Smyth and Verbyla, 1999).
Finally, we note that the overall validity of any meta-analysis is limited by the detail of information that
can be obtained on included studies, and by the completeness of study ascertainment (Light and Pillemer,
1984; Begg and Berlin, 1988). Such problems will be best addressed by improved editorial requirements
for reporting (Meinert, 1998) and by efforts to identify all studies and enter them into accessible online
registries.

R EFERENCES
B E RARD , A. AND B RAVO , G. (1998). Combining studies using effect sizes and quality scores: application to bone
loss in postmenopausal women. Journal of Clinical Epidemiology 51, 801807.
B EDRICK , E. J., C HRISTENSEN , R. AND J OHNSON , W. (1996). A new perspective on generalized linear models.
Journal of the American Statistical Association 91, 14501460.
B EGG , C. B. AND B ERLIN , J. A. (1988). Publication bias: a problem in interpreting medical data. Journal of the
Royal Statistical Society, Series A 151, 419463.
B RESLOW , N. E. AND C LAYTON , D. G. (1993). Approximate inference in generalized linear mixed models. Journal
of the American Statistical Association 88, 925.
B RESLOW , N., L EROUX , B. AND P LATT , R. (1998). Approximate hierarchical modelling of discrete data
in epidemiology. Statistical Methods in Medical Research 7, 4962.
C HALMERS , T. C., S MITH , H. J R ., B LACKBURN , B., S ILVERMAN , B., S CHROEDER , B., R EITMAN , D. AND
A MBROZ , A. (1981). A method for assessing the quality of a randomized control trial. Controlled Clinical Trials
2, 3149.
C HO , M. K. AND B ERO , L. A. (1994). Instruments for reassessing the quality of drug studies published in the
medical literature. Journal of the American Medical Association 272, 101104.
C OCHRAN , W. G. (1937). Problems arising in the analysis of a series of similar experiments. Journal of the Royal
Statistical Society 4, 102118.
C OX , D. R. (1982). Combination of data. In Kotz, S. and Johnson, N. L. (eds), Encyclopedia of Statistical Science 2,
New York: Wiley.

470

S. G REENLAND AND K. OROURKE

D ETSKY , A. S., NAYLOR , C. D., OROURKE , K., M C G EER , J. A. AND LA BBE , K. A. (1992). Incorporating
variations in the quality of individual randomized trials into meta-analysis. Journal of Clinical Epidemiology 45,
255265.
E MERSON , J. D., B URDICK , E., H OAGLIN , D. C., M OSTELLER , F. AND C HALMERS , T. C. (1990). An empirical
study of the possible relation of treatment differences to quality scores in controlled randomized clinical trials.
Controlled Clinical Trials 11, 339352.
F LEISS , J. L. AND G ROSS , A. J. (1991). Meta-analysis in epidemiology, with special reference to studies of the
association between exposure to environmental tobacco smoke and lung cancer: a critique. Journal of Clinical
Epidemiology 44, 127139.
G OLDSTEIN , H. (1995). Multilevel Statistical Models, 2nd edn. London: Edward Arnold.
G REENLAND , S. (1993). Methods for epidemiologic analyses of multiple exposures: a review and a comparative
study of maximum-likelihood, preliminary testing, and empirical-bayes regression. Statistics in Medicine 12, 717
736.
G REENLAND , S. (1994a). A critical look at some popular meta-analytic methods. American Journal of Epidemiology
140, 290296.
G REENLAND , S. (1994b). Quality scores are useless and potentially misleading. American Journal of Epidemiology
140, 300301.
G REENLAND , S. (1997). Second-stage least squares versus penalized quasi-likelihood for fitting hierarchical models
in epidemiologic analysis. Statistics in Medicine 16, 515526.
G REENLAND , S. (1998). Meta-analysis. In Rothman, K. J. and Greenland, S. (eds), Modern Epidemiology,
Chapter 32. Philadelphia: Lippincott, pp. 643672.
G REENLAND , S. (2000). When should epidemiologic regressions use random coefficients? Biometrics 56, 915921.
G REENLAND , S. (2001). Putting background information about relative risks into conjugate priors. Biometrics 57, in
press.
G REENLAND , S. AND C HRISTENSEN , R. (2001). Data-augmentation priors for bayesian and semi-bayes analyses of
conditional logistic and proportional hazards regression. Statistics in Medicine 20, in press.
J UNI , P., W ITSCHI , A., B LOCH , R. AND E GGER , M. (1999). The hazards of scoring the quality of clinical trials for
meta-analysis. Journal of the American Medical Association 282, 10541060.
LA BBE , K. A., D ETSKY , A. S. AND OROURKE , K. (1987). Meta-analysis in clinical research. Annals of Internal
Medicine 107, 224233.
L IGHT , R. J. AND P ILLEMER , D. B. (1984). Summing Up: The Science of Reviewing Research. Cambridge, MA:
Harvard University Press.
M EINERT , C. L. M. (1998). Beyond CONSORT: the need for improved reporting standards for clinical trials. Journal
of the American Medical Association 279, 14871489.
M OHER , D., JADA , A. R., N ICHOL , G., P ENMAN , M., T UGWELL , P. AND WALSH , S. (1995). Assessing the quality
of randomized controlled trials: an annotated bibliography of scales and checklists. Controlled Clinical Trials 16,
6273.
M OHER , D., P HAM , B., J ONES , A., C OOK , D. J., JADAD , A. R., M OHER , M., T UGWELL , P. AND K LASSEN ,
T. P. (1998). Does quality of reports of randomised trials affect estimates of intervention efficacy reported in
meta-analyses? Lancet 352, 609613.
M ORRIS , C. N. (1983). Parametric empirical bayes: theory and applications (with discussion). Journal of the
American Statistical Association 78, 4765.
OROURKE , K. (2001). Meta-analysis: conceptual issues of addressing apparent failure of individual study replication

Bias produced by quality scores in meta-analysis, and a hierarchical view of proposed solutions 471
or inexplicable heterogeneity. In Ahmed, S. E. and Reid, N. (eds), Empirical Bayes and Likelihood Inference,
New York: Springer, pp. 161183.
P LATT , R. W., L EROUX , B. G. AND B RESLOW , N. (1999). Generalized linear mixed models for meta-analysis.
Statistics in Medicine 18, 643654.
P OOLE , C. AND G REENLAND , S. (1999). Random-effects meta-analysis are not always conservative. American
Journal of Epidemiology 150, 469475.
R AUDENBUSH , S. W.
10, 7598.

AND

B RYK , A. S. (1985). Empirical-bayes meta-analysis. Journal of Educational Statistics

ROTHMAN , K. J. AND G REENLAND , S. (1998). Accuracy considerations in study design. In Rothman, K. J. and
Greenland, S. (eds), Modern Epidemiology, 2nd edn. Philadelphia: Lippincott, pp. 135145.
RUBIN , D. B. (1990). A new perspective. In Wachter, K. W. and Straf, M. L. (eds), The Future of Meta-Analysis,
New York: Russell Sage, pp. 155166.
S CHULTZ , K. F., C HALMERS , I., H AYES , R. J. AND A LTMAN , D. G. (1995). Empirical evidence of bias
dimensions of methodological quality associated with estimates of treatment effects in controlled trials. Journal
of the American Medical Association 273, 408412.
S ENN , S. (1996). The AB/BA cross-over: how to perform the two-stage analysis if you cant be persuaded that
you shouldnt. In Hansen and de Ridder (eds), Liber Amicorum Roel van Strik, Rotterdam: Erasmus University,
pp. 93100.
S MYTH , G. K. AND V ERBYLA , A. P. (1999). Adjusted likelihood methods for modeling dispersion in generalized
linear models. Environmetrics 10, 695709.
S TRAM , D. O. (1996). Meta-analysis of published data using a linear mixed-effects model. Biometrics 52, 536554.
VANHONACKER , W. R. (1996). Meta-analysis and response surface extrapolation: a least squares approach. American
Statistician 50, 294299.
W ITTE , J. S., G REENLAND , S., H AILE , R. W. AND B IRD , C. L. (1994). Hierarchical regression analysis applied
to a study of multiple dietary exposures and breast cancer. Epidemiology 5, 612621.
W ITTE , J. S. AND G REENLAND , S. (1996). Simulation study of hierarchical regression. Statistics in Medicine 15,
11611170.
YATES , F. AND C OCHRAN , W. G. (1938). The analysis of groups of experiments. Journal of Agricultural Sciences
28, 556580.
[Received 6 November, 2000; revised 4 April, 2001; accepted for publication 6 April, 2001]

You might also like