You are on page 1of 10

Journal of Clinical Epidemiology 75 (2016) 6e15

Interpreting GRADE’s levels of certainty or quality of the


evidence: GRADE for statisticians, considering review
information size or less emphasis on imprecision?
Holger J. Sch€
unemann*
Department of Clinical Epidemiology & Biostatistics, McMaster University Health Sciences Centre, Room 2C16, 1280 Main Street West, Hamilton,
ON L8N 4K1, Canada
Accepted 29 February 2016; Published online 6 April 2016

Abstract
This article responds to issues raised by Antilla et al. in the Journal of Clinical Epidemiology about the Grading of Recommendations
Assessment, Development and Evaluation (GRADE) Working Group’s approach to rating imprecision and GRADE’s use of statistics. They
argue that GRADE confuses statistical terms and should provide a stepwise rating of imprecision for making decisions. Here, a clarification
of those perceptions is provided. GRADE’s ratings of imprecision and other quality of evidence domains is an iterative process that may or
may not consider people important thresholds of effects when systematic review authors rate imprecision. Regardless of ratings in system-
atic reviews, those suggesting decisions such as guideline panels, should consider if they agree or need to revise these suggested thresholds
to make informed ratings about imprecision. Decision relevant thresholds are the result of a complex interplay between critical outcomes
for a decision-making. The certainty in the evidence of one critical outcome and the resulting possible certainty range, which I conceptu-
alize in this article, may influence ratings of other outcomes. To relieve systematic review authors of the often challenging burden of
defining worthwhile or important effects for judging precision based on the optimal information size (OIS), a modified OIS or review in-
formation size (RIS) could be used to rate imprecision at the systematic review stage. The RIS focuses only on plausible rather plausible
and worthwhile effects. The advantages of using the RIS include avoiding the reliance on statistical significance alone and the varying
thresholds resulting from the importance and the baseline risk of the outcome on which the OIS relies. Finally, I argue that GRADE’s cer-
tainty in the evidence is related to the statistical definition of accuracy but given GRADE’s broad application to other ratings of certainty
such as qualitative evidence, statistical accuracy does not serve as a definition for GRADE’s quality or certainty in the evidence. Ó 2016
Elsevier Inc. All rights reserved.

Keywords: GRADE; Certainty of evidence; Optimal information size; Quality of evidence; Review information size; Systematic reviews

1. What issues are raised by Anttila et al.? alternative ways to conceptualize, interpret, and optimize
the GRADE domains used to assess the quality of evidence,
The article by Anttila et al. in this issue of the Journal of
now preferentially labeled as ‘‘certainty in the evidence’’
Clinical Epidemiology provides a welcome opportunity to
(but also called ‘‘confidence in the effect estimates’’; [1]).
reflect, from the perspective of an individual member of
This reflection is based on lectures and workshops given
the Grading of Recommendations Assessment, Develop- over the last 5 or more years. Although I am co-chair of
ment and Evaluation (GRADE) working group, on the
the GRADE working group, this article, written in response
to an invitation to comment on the paper by Anttila et al.,
also reflects my experience with using GRADE in many
Conflict of interest: The author is co-chair of the GRADE working guidelines and systematic reviews. Formal endorsement
group. He has no direct financial conflict of interest. of some of the unpublished concepts described here will
Part of the work has been presented scientific conferences and at require discussion and debate with the GRADE working
GRADE meetings by the author. However, the article is not an official group.
statement of the GRADE Working Group.
* Corresponding author. Tel.: þ1-905-525-9140x24931.
In an attempt to clarify terminology, GRADE now uses
E-mail address: schuneh@mcmaster.ca the term ‘‘criteria’’ for elements that lead to grading the
http://dx.doi.org/10.1016/j.jclinepi.2016.03.018
0895-4356/Ó 2016 Elsevier Inc. All rights reserved.

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15 7

strength and direction of a recommendation or decision in for most other GRADE users. The mixing of the steps in
the GRADE Evidence to Decision (EtD) Frameworks the evidence assessment is more problematic and I will
[2e4]. Domains describe the overarching considerations attempt to clarify this confusion. Fourth, there are inconsis-
within these criteria. For example, the eight domains (lim- tencies in the line of argument and interpretation of Antti-
itations in the detailed study design and execution/risk of la’s et al. figures that cause possibly more confusion which
bias, imprecision, indirectness, etc.) determine the ‘‘cer- I will try to clarify, too (e.g., there is no guidance about how
tainty in the evidence’’ as one important criterion in an to rate imprecision without meaningful effect sizes).
EtD framework. Individual items, for example, conceal- Finally, I will present some suggestions for how
ment of randomization as part of the risk of bias or I2 as GRADE can move forward in response to these questions.
part of the inconsistency domains, are considered by raters Although the GRADE approach expands beyond interven-
to make judgments within these domains. Here, we are pri- tions, here we focus on interventions rather than prognostic
marily concerned with the GRADE domains comprised in certainty or certainty in test accuracy.
the GRADE criterion ‘‘certainty in the evidence’’ (quality
of evidence), although they are inevitably tied to other
criteria as I will describe in the following paragraphs. 2. Clarification about concepts and use of GRADE
And, in the context of the certainty in the evidence crite-
To begin, several matters should be considered by those
rion, Anttila et al. focused on the challenges systematic re-
using GRADE and by those looking for statistical under-
view authors (or those working on other evidence syntheses
pinnings in GRADE. I represent these matters as back-
including health technology assessments) face with judging
ground to my commentary:
imprecision in the context of health care interventions. I
gather from their article that they raise three key issues 1. GRADE attempts to be practical for systematic re-
about the GRADE approach as follows: (1) When, in the view authors and decision makers when they evaluate
GRADE process, to assess imprecision?; (2) How GRADE the certainty in the evidence. This evaluation will
relates to statistical concepts of bias, precision and accu- support a claim of high, moderate, low, or very low
racy?; and (3) Reduce possible confusion about the process confidence in the estimates of an effect, an associa-
of assessing imprecision. I will address these questions in tion or to support a decision.
general. My responses and the figures in this article are 2. Even after 16 years of intense work, GRADE’s devel-
based on prior lectures but the concept of the modified opment is not completed and the GRADE Working
optimal information size (OIS), or review information size Group’s ‘‘modus operandi’’ includes focus on dialog
(RIS), on which I will expand here has not been previously about methodological challenges related to assessing
discussed elsewhere. health evidence and applying this evidence to devel-
First, I believe that Anttila et al. are right in identifying oping recommendations and making decisions. I
challenges in rating imprecision related to thresholds and reflect that what Anttila et al. describe as confusing
GRADE’s responsibility of making its approach more (the determination of imprecision) seems to be
transparent. But, second, as I will argue in the following largely captured in GRADE’s current alternative ap-
paragraphs, there is possible misunderstanding about proaches to addressing imprecision as part of the sys-
GRADE’s indented use of the domain imprecision, its tematic review and decision-preparation (e.g.,
impact on the certainty of the evidence and when and guideline development) contexts. Further debate of
how often imprecision is considered and rated in the appli- the authors with the GRADE working group may
cation of GRADE. GRADE’s approach is not in conflict but alleviate concerns or stimulate additional work. The
requires further guidance or revision that results from the authors correctly point out that identifying ‘‘impor-
practical application of GRADE. I will argue that the alter- tant benefits and harms’’ is challenging for systematic
native suggested by Anttila et al. is already considered in review authors. In fact, this challenge has been
GRADE’s stepwise approach of supporting decision mak- debated and there remain alternative believes whether
ing. Third, I believe that the key message of the article or not systematic review authors are generally equip-
by Anttila et al. is a repackaging of GRADE’s principles ped to do this. The debate is likely a result of the
in statistical terms but mixing of: (1) evidence assessment many different users of GRADE: those who can
without the context of decision making about a specific determine thresholds for important benefit and harm
Population, Intervention, Comparator(s) and Outcomes (PI- during the systematic review process and those who
CO) question and (2) evidence assessment in the context of find it impossible. Rather than burdening systematic
actual decision making about a specific Population, Inter- review authors and force them to identify thresholds
vention, Comparators and Outcomes question. The repack- and judging what is suggested as ‘‘conclusiveness,’’
aging may be helpful to achieve translation of principles for they should choose if they want and can complete a
certain target audiences, but the use of statistical terms to very final assessment of imprecision. GRADE leaves
describe its approach has not been a primary goal of that choice by providing alternative (sometimes
GRADE and may neither clarify nor alter its principles called ‘‘rule of thumb’’) options [5].

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
8 H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15

Fig. 1. Iterative use of GRADE: judgments about imprecision and other domains take place at various stages during the process of assessing ev-
idence. For decision making and in the preparation of recommendations, reconsideration of the work of systematic review authors and those of other
guideline developers (during guideline adaptation) take place in an iterative process. Systematic review authors will summarize the evidence in an
evidence profile or summary of findings table which will require review by decision makers. These decision makers will make or inform final judg-
ments about the certainty in the evidence by (re)judging imprecision, indirectness, and other domains (and produce final evidence summary and
rating, e.g., in a summary of findings table, that support a recommended decision). *Usually applicable to non-randomized studies; &For instance a
guideline panel or other decision makers.

3. Regardless of the approach systematic review authors a judgment about an outcome might be influenced by
take to determining or using thresholds for important the assessment of other outcomes. It emphasizes that
benefit and harm, the assessments of certainty in the the evaluation of margins and CI can take place during
evidence must be reviewed or finalized by those devel- several steps in the application of GRADE. Let us as-
oping recommendations or making decisions if they sume that a rating of imprecision by systematic review
use that systematic review. Occasionally, this final authors has occurred (e.g., by using the OIS criteria
group includes authors of the original systematic re- currently described in GRADE; [5,11]). As a side
views but, more commonly, it happens later in the note, the OIS is particularly useful for systematic re-
use of evidence syntheses as presented in Fig. 1. view authors, but setting the delta for its calculation
Fig. 1 emphasizes that the assessment of the GRADE remains challenging. The delta (effect size) to calcu-
domains is iterative and that a final rating of what Ant- late the OIS is originally defined by Pogue and Yusuf
tila et al. call conclusiveness is, in the practical appli- as ‘‘the minimal effect of treatment that would be
cation of GRADE, indeed taking place already. It considered to be worthwhile and biologically plau-
often happens when balancing desirable and undesir- sible’’ [12]. Review authors often struggle with deter-
able consequences using GRADE EtD Frameworks mining what is worthwhile. However, a rating of
[2,6,7], formerly called decision tables [7e10]. imprecision by systematic review authors should facil-
Fig. 2 shows a hypothetical example how in GRADE itate the understanding that decision makers must

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15 9

Fig. 2. Influence of ratings of outcomes on ratings of others during the preparation of decisions such as guideline development. The rating for one
outcome is influenced by the rating of other outcomes, in particular because of issues with imprecision. Owing to thresholds set for serious adverse
events after considering the plausible reduction in mortality, decision makers could alter the certainty in the rating for serious adverse events. This
would also influence the overall certainty of evidence used for decision making. It would become moderate given that there is remaining uncertainty
for balancing health benefits and harms. Hypothetical example.

carefully evaluate thresholds and margins based on the and harm and the emphasis is on avoiding net harm
context they are dealing with. Moreover, the rating of [13], determining the OIS based on an acceptable delta
a single outcome that users of GRADE are focused on becomes also complex for a decision maker such as a
in their initial rating of the certainty in the evidence guideline developer. The preceding ratings of the sys-
(e.g., when preparing an evidence profile), may be tematic review author (for the purpose of the review)
revised later because it is influenced by the magnitude based on the OIS remains uninfluenced but could be
of the effect of other outcomes. Returning to Fig. 2, let altered if the systematic review author accepts the
us further assume systematic review authors are deter- new delta in future revisions. Hence, rating impreci-
mining the delta (effect size) and OIS for a presumed sion is an iterative process as is the rating of several
harm resulting from an intervention that comes with of the other GRADE domains that influence the over-
several benefits. Results of the true effects for the ben- all certainty in the evidence. For guidance develop-
efits can influence a final judgment about the certainty ment, it begins with those who conduct the evidence
in the harm because precision may be judged differ- synthesis making initial judgments and is comple-
ently (based on CI) when revised with real estimates mented by those making decisions such as a multidis-
(rather than assumptions) about the benefits. In fact, ciplinary guideline development processes or alike.
the certainty in the evidence for the harmful outcome 4. Applying GRADE may help resolve confusion. Taking
could be rated down by a guideline developer because GRADE as scripture without witnessing its application
of imprecision owing to shifting thresholds for accept- will leave many questions unanswered and may lead to
able harm. Shifting of thresholds can result from better confusion by those who read GRADE guidance without
estimates about benefits or differing importance rat- real life application. In fact, Anttila et al. present a real
ings of the benefits (i.e. values attached to the benefits) but somewhat idealistic situation by focusing on sys-
(Fig. 2). More precise evidence would increase the tematic reviews out of context. Systematic reviews are
guideline developer’s certainty about the balance of intended to inform decision making but disconnecting
health benefits and harms. Knowing the effect for that the task of rating certainty from making decisions pre-
harm (with certainty) could also influence the rating sents real challenges for conceptualization of GRADE’s
for the imprecision of benefits. In fact, once an inter- usefulness. This is because in many, certainly not all, sit-
vention has more than one associated critical benefit uations the separation of doing the review from using

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
10 H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15

the review for decisions is void of setting appropriate and conclusiveness are statistical and interpretative
thresholds for what is a person important effect despite expressions that GRADE has captured in its certainty
being intimately connected to the latter. Anttila et al. domains and iterative process. In fact, GRADE at-
may be appropriately confused as GRADE has not suf- tempts to label the different types of sources of bias
ficiently emphasized the requirement for repeated and random error (imprecision) in its downgrading
assessment of the domains, in particular the imprecision domains and acknowledges that judgments about
domain, in the evidence synthesis to decision to evi- each of these factors are strongly context dependent
dence update cycle (Fig. 1). Furthermore, judgments and require repeated, iterative assessment. The
about an individual certainty domain can and should GRADE working group also emphasizes that the
not be made in isolation from another certainty domain magnitude and direction of uncertainty introduced
[14]. For example, judgments about imprecision or indi- by downgrading are largely unknown (at present).
rectness may be influenced by judgments about incon- This direction and magnitude is better understood
sistency and that is not solved by making a final for the domain imprecision, but I will elaborate in
judgment about conclusiveness based on confidence in- the following paragraphs on the limits of the assess-
tervals (CIs) and margins because this assessment ment of imprecision in relation to the other domains.
would require a reconsideration of the importance of 7. Clarity is required when referring to effect sizes and
indirectness and other elements. Furthermore, those CIs with regard to if one is referring to relative or ab-
applying GRADE in the context of decision making solute effect estimates. GRADE has emphasized that
realize that the GRADE approach to assessing certainty judgments about imprecision and should focus on ab-
or quality is an iterative and, moreover, repetitive pro- solute effects despite the use of relative estimates for
cess that is influenced by the nuances of the question a plausible effects (they are applied to baseline risks to
systematic review author or guideline developer at- derive absolute effects) [5].
tempts to answer. I believe that we should not loose sight 8. Most of the domains influencing GRADE’s certainty
of this important recognition and fundamental principle of evidence ratings are quantitatively unexplored
of the GRADE approach. (e.g., magnitude of the uncertainty resulting from
5. Despite the aforementioned description of the applica- serious risk of bias or indirectness). This uncertainty
tion of GRADE, the use of the OIS [12] with its original makes quantitative estimates of the degree of bias
use in GRADE requires further clarification and challenging and requires judgments that are often
perhaps alteration. I will expand on this issue with sug- influenced by the importance of the health outcomes.
gestions for discussion and resolution in detail in the The issue of whether or not imprecision and various
following. The use of the OIS for rating imprecision sources of bias should be part of an overall certainty
is also currently debated by the GRADE Working in the evidence rating is a (clinical) epidemiologic
Group and official guidance will be forthcoming. and policy but not a statistical question. GRADE
6. Anttila et al. are probably correct that GRADE’s defi- makes (at present) no attempt of pressing its certainty
nition of certainty in the evidence resembles the sta- in the evidence domains into a purely quantitative
tistical concept of ‘‘accuracy,’’ but for GRADE it is approach, but I will offer an interpretation of
either accuracy of the effect estimate or accuracy suf- GRADE’s certainty in the evidence that is semi-
ficient for decision making after the final assessment quantitative and encompasses imprecision and other
of all domains by decision makers. The latter involves certainty domains. Decision makers should under-
thresholds in a different way and sometimes accuracy stand this uncertainty.
is sufficient when it leads to allow certainty that an ef- 9. The GRADE working group carries the responsibility,
fect crosses a threshold rather than knowing the exact together with users of GRADE, to provide clear guid-
effect. Under those circumstances, it is really a ance about the entire process of using GRADE.
greater degree of uncertainty that is tolerated. This
uncertainty can be a result of imprecision (random er-
ror) or bias (systematic error). GRADE provides that
definition of the certainty in the evidence. I will
expand on the conceptualization of certainty ranges
rather than focusing on imprecision alone. Pressing
the GRADE domains into simplistic statistical or
mathematical concepts or formulas is unlikely to
solve the conceptual issues GRADE has dealt with
in supporting development of thousands of recom- Fig. 3. Downgrading would occur both for a summary effect of A and B
if the threshold effect size corresponded to the red vertical line. (For
mendations, in particular it will not solve conceptual interpretation of the references to color in this figure legend, the
issues in the context of qualitative evidence [15]. reader is referred to the Web version of this article.) Adapted from
What Anttila et al. label as bias, precision, accuracy, Fig. 1 by Anttila et al.

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15 11

GRADE has made progress in that field with its to the confusion, the concept of imprecision deployed in
DECIDE project [2e4,16]. But, despite prolific the Grade framework is not well defined’’). Although clar-
writing by members of the GRADE working group, ification of the use of the imprecision criterion is helpful,
important gaps remain. GRADE probably never attempted to use imprecision only
in the context of its presumed statistical meaning. GRADE
3. Comments on figures is not for statisticians alone, nor is it a pure statistical
method. It is a health research application method and
Anttila et al. should have probably indicated if they are approach. It contains elements of statistics and benefits
referring to relative or absolute effects (for simplicity I will from the discipline but cannot depend on it. Part of the
use the figures the authors provided). GRADE describes reason for using concepts the way GRADE describes them
how this consideration will be important for decision mak- is the reality its users face in decision making and making
ing [5]. For their Fig. 1, the authors state ‘‘Assume that the judgments, for example, about thresholds that differ based
OIS criterion is met for both results A and B. A excludes on the specific clinical or public health questions which
‘‘no effect,’’ while B includes it. Therefore, the systematic imprecision in the statistical sensu strictu often fails to
reviewer should rate down for imprecision in case B, but consider. GRADE also is for qualitative approaches to as-
not in case A.’’ Given my arguments and elaboration previ- sessing evidence [15].
ously, this statement is not correct without knowing where
the review author sets the delta or effect size. If the effect
size is set to a vertical line touching CIs of A and B (indi- 5. Relation of judgments about imprecision to judg-
cated by red vertical line), downgrading would occur in ments about other domains
both instances (Fig. 3). Anttila et al. state, related to their Anttila et al. are correctly stating ‘‘The quality of
Fig. 1, ‘‘It is obviously confusing that the reviewer should evidence, therefore, appears to be the reviewers’ degree
have more trust in the closeness in case A than he or she of confidence in the closeness of a parameter value to
does in B.’’ A review author may not find it at all confusing an estimated value.’’ However, because of the confusion
depending on where she or he sets the effect that is defined that is caused by the word confidence and its use in CIs,
by the delta of a plausible effect. If the plausible effect is GRADE suggests using certainty. In fact, I believe cer-
included in the CI, with regard to imprecision, she might tainty intervals that are independent of the domain that
appropriately lower the certainty in the evidence regardless is being assessed to describe the conceptual underpinnings
of the width of the CI. Whether a decision maker would of certainty in the effect based on all domains may be
agree with this judgment depends on the required magni- helpful. This approach helps de-emphasizing the apparent
tude of the effect for decision making and if thresholds importance of the imprecision domain. Any of the cer-
are crossed. That is the effect lies beyond a set threshold tainty domains influence the possible probability distribu-
which is dependent on the effects an intervention has on tion and certainty in the true estimate of effect. However,
other outcomes and other criteria in the EtD framework. only for imprecision is the distribution of that certainty in-
As to Anttila et al.’s Fig. 2, given the definition of the terval known (given by the CI). And, importantly, it is on-
OIS previously, the OIS may or may not be met in D and ly known because we usually assume a Gaussian (normal)
F because it depends on the question if harm or benefit distribution or some other known distribution. For the
are the outcome of interest and how much harm or benefit other domains, neither the shape nor the direction or de-
is required or acceptable for decision making (in D the gree of distortion (bias) is definitely known, and alterna-
delta for harm is not crossed). It underlines the importance tives are conceivable. As soon as a reason for lowering
of balancing health outcomes and desirable and undesirable the certainty in the evidence is added to imprecision, also
consequences during decision preparation or making. In the distribution and width of the CI becomes unknown.
that context and with clear assumptions and descriptions, The direction of bias and distribution for the other do-
I do think confusion is not present. The issue is not if the mains could be surmised sometimes, for example, one
estimate is close to the parameter estimate but where to can assume the direction (overestimation) of the effect
set the thresholds for accepting imprecision or uncertainty. when publication bias based on small, for profit-funded
Furthermore, the use of the OIS is of primary relevance in studies that indicate an effect is present. Similarly, judg-
the context of large effects [5]. ments about indirectness, inconsistency, and risk of bias
may be directional. But they can rarely (until now) be
quantified to the degree, we quantify judgments about
4. GRADE’s use of terms
imprecision. Consider the update of a systematic review
Anttila et al. describe GRADE’s use of statistical terms assessing the impact of heparin in patients with solid can-
(‘‘The Grade framework contains terms familiar from clas- cer and no other indication for anticoagulation on the out-
sical statistics, but these terms are used in nonstandard comes death, symptomatic venous thromboembolism
ways. Notably, ‘‘imprecision’’ does not have the meaning (VTE) and bleeding. Nine randomized controlled trials
in the GRADE framework that it has in statistics. Adding in almost 6,000 patients indicated that the relative risk

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
12 H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15

Fig. 4. Conceptualizing certainty ranges based on GRADE’s certainty in evidence domains. Hypothetical modification of the certainty for the impact
of heparins on venous thromboembolism based on nine randomized controlled trials. Each GRADE domain can lead to lowering the certainty.
Except for the imprecision domain the width and distribution of the certainty range is unknown at present (for imprecision it is known due to dis-
tribution assumptions such as Gaussian distribution). The confidence intervals are widened and transformed to certainty ranges by serious or very
serious concerns about any of the downgrading domains. (A) presents a symmetrical distribution for different levels of certainty. (B) presents (the
more likely) asymmetrical distributions for different levels of certainty.

for VTE is reduced by 43% (95% CI, 19%e60%; [17]). unknown distributions (see Fig. 4B for a hypothetical
For patients with a plausible baseline risk of approxi- example). The lower the certainty in the evidence, the less
mately 4.6% per year, this relative effect suggests that will be known about the shape and width of this certainty
heparin leads to an absolute risk reduction of 20 fewer range. However, decision makers must consider that un-
VTEs (95% CI, 9e27 fewer) per year [17]. Figs. 4A certainty related to the effect measure. Furthermore, it
and 4B represent the effect estimate and a distribution makes evident that the CI (and imprecision) is only one
of the likelihood of the effects based on different levels domain that influences overall uncertainty. Uncertainty re-
of certainty. Now, consider that (hypothetically) the re- sulting from imprecision (although it can be calculated)
view authors would have lowered the certainty in the ev- may be not different from that of indirectness, or any
idence as a result of indirectness. Although the CI would other domain, in the context of decision making. Concep-
remain unchanged, the certainty in that CI and in the point tualizing the certainty in the evidence using this approach
estimate will be lowered. In fact, a certainty range of un- has been suggested for consideration to the GRADE
known shape and width results. The certainty range will working group but will require further operationalization
take a different shape than the CI, resulting in mostly and discussion.

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15 13

6. Resolving challenges with rating imprecision original OIS) in a second step. Another, not so dogmatic,
alternative is to focus on plausible effect sizes and use a
A viable, and frequently discussed, alternative to judg-
sample size calculation approach that would be entirely
ments about imprecision based on thresholds is to focus
based on a plausible (and not worthwhile) delta or effect
solely on the width of the CIs that indicates the presence
size, perhaps called a modified OIS or RIS. These plausible
of an effect or focus on the probability of chance suggesting
effects or RIS would be based on existing assumptions and
the effect (P values). This latter approach could be imple-
evidence (sometimes already done for sample size calcula-
mented by simple reliance on conventional statistical sig-
tions in trials). For instance, when evaluating the effects of
nificance about the effect of an intervention on an
a new oral anticoagulant as replacement for warfarin for pa-
outcome at the systematic review stage. This approach tients with atrial fibrillation the most likely delta or effect
would relieve the systematic review author of the burden
size for the RIS should be based on the effects of estab-
of determining effect sizes that the OIS requires. GRADE
lished anticoagulants, not on a 25% standard relative risk
believes the reliance on statistical significance alone is
reduction. In this case, the assumption may be equivalence
dangerous, in part because of spuriously significant and
in a head-to-head comparison. In the hypothetical case
large effects [5]. Thus, the current approach is to rely, in
(because it would be inappropriate to not provide treatment
part, during an initial rating on the OIS. However, the
to the patients at elevated risk) of a comparison against no
OIS requires specifying a delta or effect size (in addition
specific treatment such as placebo the plausible effect size
to a realistic baseline event rate and the alpha and beta er- would be based on the comparison of warfarin against pla-
ror) to calculate the sample size for a single trial. As a
cebo (approximately a 67% relative risk reduction for the
reminder, the delta (effect size) to calculate the OIS is
occurrence of stroke). This avoids reference to important
‘‘the minimal effect of treatment that would be considered
benefits and harms which must be judged on the basis of
to be worthwhile and biologically plausible’’ [12]. Thus,
detailed knowledge about the baseline risk and outcome
the delta, unfortunately, in the original definition of the
importance for the target population. It focuses on plausible
OIS, includes the element of ‘‘importance’’ by focusing
effects. The source and derivation of this plausible effect, of
on a worthwhile effect. This worthiness, in the final balance
course, has been a topic of existing debates including for
of health benefits and harms, is strongly dependent on the clinical trials. At the stage of evaluating the imprecision
magnitude of the intervention effect on other outcomes.
domain in the context of a systematic review this may be
Thus, while challenging for decision makers, arriving at
a reasonable solution to reduce burden on authors who
the best delta is even more challenging for systematic re-
are not in a position to determine the OIS based on impor-
view authors who may be little involved in health care de-
tance or worthiness of the effect. For users of GRADE, the
cisions (they should consult with relevant clinicians, of
solution should be flexible, iterative and, most importantly,
course). GRADE suggests, beyond CIs, using a 25% rela-
transparent (by stating the RIS explicitly), correct, and
tive risk reduction (which is empirically based) to facilitate
helpful. GRADE (probably correctly) assumes, as
the judgment of systematic review authors about impreci- mentioned previously, that some systematic review authors
sion and calculation of the OIS. GRADE acknowledges
will be able to make assumptions about important effects
and emphasizes that this as a simple, perhaps overly simple,
(the original OIS), but others, perhaps many, will not.
‘‘rough guide’’ [5]. Perhaps GRADE has failed by empha-
The second alternative of using an RIS (both for excluding
sizing the use of arbitrary relative intervention effects as
an effect and concluding that there is an effect during the
such rough guides. In fact, although this suggestion is based
systematic review process) based on plausible effect sizes
on commonly observed (relative) effect sizes, it may be
alone is reasonable and possibly already practiced by some
misleading because the choice of an important relative ef-
systematic review authors. Thus, contrary to the approach
fect in the OIS is tied to the baseline risk and the impor- of rating conclusiveness only after the assessment of the ev-
tance of the outcome (leading to an absolute rather than a
idence for imprecision (how?) and bias as suggested by
relative effect). With the OIS inevitably tied to ‘‘impor-
Anttila et al. precluding systematic review authors from us-
tance’’ of the outcome and considerations about baseline
ing important or plausible effect sizes to judge imprecision
risk, the use of a 25% or similar relative risk reduction or
seems unrealistic. GRADE must deal with the true defini-
relative risk increase as guidance might be insufficiently
tion and implications of using the original OIS definition
context specific. The important relative risk reduction
that includes worthiness or importance of the effect. The
would be smaller for outcomes with higher baseline risk
GRADE approach should consider those being able to
and greater importance and vice versa. While meant as make assumptions about the effect size based on impor-
facilitator, the question arises if GRADE should abandon
tance and plausibility to use the original OIS and describe
this ‘‘rule of thumb’’ or rough guidance because it is too
their assumptions. Those who feel uncomfortable with
confusing. As a dogmatic alternative, the systematic review
making judgments about the importance of an effect can
author could simply rely on statistical significance of exist-
use the plausible (rather than plausible and worthwhile) ef-
ing evidence, and users of the systematic review could
fect size in a modified OIS or RIS approach. Thus, system-
assess imprecision on the basis of important effects (the
atic review authors would assess imprecision by evaluating

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
14 H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15

the RIS based on plausible estimates and CI. When plau- evidence was rated as high by the systematic review authors
sible estimates cannot be obtained, as per GRADE guid- as there were no serious concerns about any of the GRADE
ance they can stay with the rough guide of a 25% relative domains for the certainty of evidence. All studies strongly
risk reduction or increase in the absence of other plausible focused on patients with solid tumors. What if a guideline
evidence. In any case, raters, however, must make their as- group would ask about the impact of heparin in patients with
sumptions about effect sizes to judging imprecision trans- hematologic tumors based on this evidence (no direct evi-
parent. Guideline users can then apply the OIS and other dence exists). GRADE would acknowledge that bias (and/
concepts describe previously to judge imprecision. or imprecision) exists in applying the results of the nine ran-
domized trials to patients with hematologic cancer. Thus,
guideline developers would identify population indirectness
7. No alternative framework
and probably downgrade the evidence to moderate or even
Anttila et al. suggest an alternative approach to GRADE low. What does this downgrading imply? The downgrading
that implies that bias and imprecision can be simply calcu- implies that the effect estimate described previously would
lated as accuracy. Anttila et al.’s Fig. 3 presents a misunder- be altered as a result of possible bias resulting from indirect-
standing of the GRADE approach in that it makes it appear ness. In many instances, the direction and magnitude of this
as if the certainty is not influenced by how conclusive the bias is unknown. A systematic review author will not be able
evidence is based on margins that are set. In fact, final judg- to make this judgment without consultation. In fact, I doubt
ments about imprecision and its rating depend on the type that for this and other reasons described previously ‘‘Conclu-
of question and the situation one faces. I do not argue that, siveness could also replace quality of evidence as the final
as is the case of the statistical approach to GRADE that step for a systematic reviewer’’ as suggested in the article.
Anttila et al. describe, imprecision (as conclusiveness) An additional problem is that the size and direction of the bias
would be reconsidered after other domains (statistically remains unknown and, therefore, a systematic review author
summed up as accuracy) have been assessed. I argue that could express uncertainty, but cannot express the degree of
correct application of GRADE in the decision-making uncertainty. I believe that with sufficient empirical evidence
context already includes this consideration. In fact, an we will, over time, be able to estimate the magnitude and di-
assessment of imprecision before the decision-making situ- rection based on large registries of meta-epidemiologic data.
ation may have a different purpose (indicating that more
research should be done). Through the use of the OIS and
8. Way forward
related guidance, GRADE attempts supporting systematic
review authors to make judgments about this domain for Anttila et al. state that ‘‘Our analysis suggests that in the
single outcomes. It also allows systematic review authors GRADE guideline articles the key notions of quality of ev-
to provide initial judgments to support those developing idence and imprecision are, independently, but especially
recommendations. But the final consideration of impreci- when taken together, a source of serious confusion, and that
sion (in an evidence profile or summary of findings table) this may impede the practical process of evidence forma-
that informs a recommendation or decision is question tion.’’ I emphasize the importance of expressing concerns
dependent, depends on the total number and effect sizes but the success of GRADE and its application in probably
of other outcomes, and the overall evaluation of the desir- hundreds of guidelines suggests that GRADE has not failed
able and undesirable consequences, which sometimes can completely. GRADE, as much else in the biomedical sci-
be expressed as ‘‘time, effect, cost, quality’’ products ences, can and should be improved. This is why GRADE
(Fig. 2). In fact, some GRADE members would argue that continues to exist and develop. Empirical data about the au-
CI margins are entirely dependent on the balance of all thors’ suggested confusion rather than theoretical concerns
criteria in the GRADE EtD frameworks and, therefore, are required. The issues addressed here are not new in terms
the final certainty rating depends on that balance. Given of challenges when applying imprecision. In fact, the
the many nuances in balancing these criteria, a strictly sta- detailed considerations required to judge imprecision (that
tistical approach seems, at present, infeasible. Furthermore, are described by the GRADE working group) in the differ-
Anttila et al. include imprecision in the rating exercise, they ence contexts are testimony to the complexity of the con-
do not provide an alternative in how to judge it. That is, it is cepts. However, I agree that additional guidance will be
unclear how a judgment about imprecision would be made helpful for review authors and those applying GRADE in
in their alternative framework (Anttila et al.’s Fig. 3). Is it the decision-making context. Anttila et al. state ‘‘Several
by considering statistical significance and, if so (or even other issues in the GRADE guidance need to be discussed,
if not), what criteria would be considered? but not here. Possibly, the most important issue concerns
Anttila et al. state ‘‘In this example bias, precision, and ac- the difference between evaluating evidence in a systematic
curacy, are all explicitly expressed, and the relationships be- review and giving guidelines [10], and specifically, the ques-
tween the concepts are clear.’’ I do not think they are clear tion how an evidential value is to be transformed into a value
beyond the formulas. Consider the example of heparin in used for regulatory purposes.’’ Peer-reviewed exchanges are
solid cancers previously mentioned. The certainty in the one approach to moving the field forward. However, they

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.
H.J. Sch€unemann / Journal of Clinical Epidemiology 75 (2016) 6e15 15

seem a bit unwieldy and somewhat counterintuitive to [3] Alonso-Coello P, Oxman AD, Moberg J, Brignardello-Petersen R,
GRADE’s ethos of work. GRADE prides itself of its inclu- Akl E, Davoli M, et al. GRADE Evidence to Decision frameworks:
2. Clinical practice guidelines. BMJ. in press.
siveness. In fact, the foremost principle of the GRADE [4] Schuenemann HJ, Mustafa R, Brozek J, Santesso N, alonso-Coello P,
working group is to enable participation and discussion of Guyatt G, et al. GRADE Guidelines: 16. GRADE evidence to deci-
ideas. GRADE meetings, held at least twice yearly, are open sion frameworks for tests in clinical practice and public health. J Clin
to suggestions and agenda items from the scientific commu- Epidemiol 2016. in press.
nity and many of us believe that the progress in GRADE has [5] Guyatt GH, Oxman AD, Kunz R, Brozek J, Alonso-Coello P, Rind D,
et al. GRADE guidelines 6. Rating the quality of evidenceeimprecision.
been fundamentally influenced by this approach. Most con- J Clin Epidemiol 2011;64:1283e93.
clusions in GRADE are based on careful consideration that [6] Schunemann HJ, Wiercioch W, Etxeandia I, Falavigna M,
took place over the past decades, whereas GRADE members Santesso N, Mustafa R, et al. Guidelines 2.0: systematic development
have been involved in systematic review, health technology, of a comprehensive checklist for a successful guideline enterprise.
biostatistics, and guideline development methodology and CMAJ 2014;186(3):E123e42.
[7] Andrews JC, Schunemann HJ, Oxman AD, Pottie K, Meerpohl JJ,
application. I would encourage the authors to take advantage Coello PA, et al. GRADE guidelines: 15. Going from evidence to
of the existing opportunities. They will inform thinking of recommendation-determinants of a recommendation’s direction and
other members of the GRADE Working Group and the com- strength. J Clin Epidemiol 2013;66:726e35.
munity at large. [8] Schunemann HJ, Oxman AD, Akl EA, Brozek JL, Montori VM,
Heffner J, et al. Moving from evidence to developing recommenda-
tions in guidelines: article 11 in Integrating and coordinating efforts
in COPD guideline development. An official ATS/ERS workshop
9. Summary
report. Proc Am Thorac Soc 2012;9(5):282e92.
In conclusion, it is possible that GRADE’s certainty in [9] Andrews J, Guyatt G, Oxman AD, Alderson P, Dahm P, Falck-
Ytter Y, et al. GRADE guidelines: 14. Going from evidence to recom-
the evidence assessment contains elements of statistical ac- mendations: the significance and presentation of recommendations. J
curacy, but the judgments if the accuracy is sufficient for Clin Epidemiol 2013;66:719e25.
decision making are influenced by context that extends [10] Santesso N, Schunemann H, Blumenthal P, De Vuyst H, Gage J,
beyond statistical CIs and margins. For GRADE, it will Garcia F, et al. World Health Organization guidelines: use of cryo-
be prudent to continue to consider the challenges of setting therapy for cervical intraepithelial neoplasia. Int J Gynaecol obstet-
rics 2012;118(2):97e102.
thresholds for important benefits and harms as highlighted [11] Sch€unemann HJ, Brozek JL, Oxman AD, Guyatt GH. Handbook for
by Anttila et al. An alternative approach for systematic re- Grading the Quality of Evidence and the Strength of Recommenda-
view authors (but not for those making decisions and aggre- tions Using the GRADE Approach. GRADEpro; 2015. Available
gating the effects across outcomes) are to use plausible (but from http://gdt.guidelinedevelopment.org/central_prod/_design/clien
not worthwhile) effects to judge imprecision, such as a RIS. t/handbook/handbook.html. Accessed February 5, 2016.
[12] Pogue JM, Yusuf S. Cumulating evidence from randomized trials:
The choice of the thresholds for imprecision and certainty utilizing sequential monitoring boundaries for cumulative meta-anal-
ranges are influenced by the final considerations about ysis. Control Clin Trials 1997;18:580e93. discussion 661e6.
importance of the effects made by those who suggest deci- [13] Schunemann HJ. Guidelines 2.0: do no net harm-the future of prac-
sions (e.g., in recommendations), together with other con- tice guideline development in asthma and other diseases. Curr Al-
siderations about how the evidence applies to the lergy asthma Rep 2011;11(3):261e8.
[14] Guyatt G, Oxman AD, Sultan S, Brozek J, Glasziou P, Alonso-
question that is posed. Expressing the results as certainty Coello P, et al. GRADE guidelines: 11. Making an overall rating of
ranges eliminates the artificial concentration on CIs and confidence in effect estimates for a single outcome and for all out-
places importance on all GRADE domains and allows comes. J Clin Epidemiol 2013;66:151e7.
setting certainty thresholds that allow expressing certainty [15] Lewin S, Glenton C, Munthe-Kaas H, Carlsen B, Colvin CJ,
in the decision. Gulmezoglu M, et al. Using qualitative evidence in decision making
for health and social interventions: an approach to assess confidence
in findings from qualitative evidence syntheses (GRADE-CERQual).
References Plos Med 2015;12(10):e1001895.
[16] Treweek S, Oxman AD, Alderson P, Bossuyt PM, Brandt L,
[1] Guyatt GH, Oxman AD, Vist GE, Kunz R, Falck-Ytter Y, Alonso- Brozek J, et al. Developing and Evaluating Communication Strate-
Coello P, et al. GRADE: an emerging consensus on rating quality of ev- gies to Support Informed Decisions and Practice Based on Evi-
idence and strength of recommendations. BMJ 2008;336:924e6. dence (DECIDE): protocol and preliminary results.
[2] Alonso-Coello P, Schunemann H, Moberg J, Brignardello-Petersen R, Implementation Sci 2013;8:6.
Akl E, Davoli M, et al. GRADE Evidence to Decision frameworks: 1. [17] Akl EA, Schunemann HJ. Routine heparin for patients with cancer?
Introduction. BMJ. in press. One answer, more questions. N Engl J Med 2012;366:661e2.

Descargado para Anonymous User (n/a) en ClinicalKey Espanol Colombia, Ecuador & Peru Flood Relief de ClinicalKey.es por Elsevier en mayo 09, 2017.
Para uso personal exclusivamente. No se permiten otros usos sin autorización. Copyright ©2017. Elsevier Inc. Todos los derechos reservados.

You might also like