Professional Documents
Culture Documents
Guideline
for
Development
2nd edition
Acknowledgements 238
References 239
17. Developing guideline
recommendations for tests or
diagnostic tools
17.1 Introduction
The prior chapters of the WHO Handbook for Guideline Development (1)
have addressed developing recommendations for interventions. While tests
can be considered interventions, typically research that concerns tests is in
the form of accuracy studies (2,3). Thus, this chapter will address how to
develop guideline recommendations for tests from accuracy studies when
direct evidence about the test’s effect on patient important outcomes is lack-
ing. This chapter is applicable to screening, monitoring and diagnostic tests,
but for clarity we will refer to tests as “diagnostic tests” and test accuracy
studies as “diagnostic accuracy studies”.
215
WHO handbook for guideline development
and NPV, unlike sensitivity and specificity, are affected by the prevalence
of a disease.
■ “True positives” refers to the number of people that the index test
correctly identified as having the condition (i.e. the number of par-
ticipants that tested positive with both the index test and reference
standard).
■ “False positives” refers to the number of people that the index test
incorrectly identified as having the condition (i.e. the number of
people that tested positive with the index test but negative with the
reference standard).
■ “True negatives” refers to the number of people that the index test
correctly identified as not having the condition (a negative result with
both the index and the reference standard test).
■ “False negatives” refers to the number of people that the index test
incorrectly identified as not having the condition (a negative result on
the index test but positive with the reference standard).
216
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
217
WHO handbook for guideline development
218
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
219
WHO handbook for guideline development
220
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
221
Figure 2. Example of a GRADE evidence profile for diagnostic tests
222
WHO handbook for guideline development
Image from: World Health Organization guideline: The use of molecular line probe assays for the detection of resistance to isoniazid and rifampicin. 2016 (16).
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
There are no specific rules for what proportion of low, high or unclear risk
of bias assessments with the QUADAS-2 tool constitutes very serious, seri-
ous or no serious limitations for the GRADE assessment of the certainty of the
body of evidence for each outcome. This is a judgement that needs to be made
in the context of individual situations, however the reasons for the decision
need to be explicit, transparent and clear. In Figure 3, for example, the body of
evidence was assessed as having serious limitations because: “The sampling of
patients (participants) was often not stated”, reflecting the unclear risk of bias
in the majority of included studies (approximately 60%) in the patient selec-
tion domain of the QUADAS-2 tool (16). In addition, “it was often not stated
whether investigators were blinded between the index and reference standard”
(unclear risk of bias in the majority of included studies (approximately 70%)
in the index test and reference standard domains of the QUADAS-2 tool) (16).
The proportion of individual studies with low, high or unclear risk of bias across the domains. Image from: World Health
Organization. WHO Guideline: The use of molecular line probe assays for the detection of resistance to isoniazid and
rifampicin. 2016 (16).
17.5.2.2 Indirectness
The assessment of the indirectness of a body of evidence refers to two
concepts.
1. How well does the underlying body of evidence match the clinical or
public health question (PIRT)?
2. How certain can the reviewer be that the consequences of using the
test will lead to improvement in patient-important outcomes?
223
WHO handbook for guideline development
indirectness. For instance, if the underlying body of evidence does not match
the clinical or public health PIRT, the evidence should be downgraded, even
if there are no concerns regarding question 2 (and vice versa). To address
the first question, the Guideline Development Group needs to compare the
PIRT question they generated in Section 17.3 to the PIRT of the underlying
body of evidence (from their systematic search). Differences in the partici-
pants, index and/or reference standard tests and setting between the PIRT
developed a priori by the guideline developers and the PIRT of the under-
lying evidence may mean that the results from the underlying evidence do
not translate to the intended patients or populations. For instance, con-
sider if a Guideline Development Group commissioned a systematic review
to determine the accuracy of brain natriuretic peptide (BNP) to diagnose
heart failure in patients with signs and symptoms of heart failure present-
ing to general practice, compared to a reference standard of echocardiogra-
phy. The Guideline Development Group may consider downgrading if the
available evidence only included patients with shortness of breath and no
other symptoms of heart failure, did not use echocardiography as a reference
standard (be careful not to double count this in the risk of bias assessment),
or included patients presenting to the emergency department. Evidence may
be additionally downgraded if it was conducted in a high-resource setting
and the guideline is intended for low-resource settings.
It should be noted that PIRT questions that include two or more index
tests present an exceptional situation. In these circumstances, an additional
concept pertaining to indirectness should be considered: were the two index
tests compared directly in the included studies using the same reference
standard? If this is not the case, downgrading should occur.
Furthermore, the outcomes from diagnostic accuracy studies, sensitiv-
ity and specificity, are inherently indirect: improved diagnostic accuracy
does not always translate into improvements for patients or populations (2).
Improved diagnostic accuracy can lead to benefits for patients but relies on
the following assumptions:
224
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
17.5.2.3 Inconsistency
This domain in the GRADE approach refers to an assessment of the incon-
sistency of effect estimates (in our case sensitivity and specificity) across
included studies. Substantial heterogeneity among different test accuracy
225
WHO handbook for guideline development
1. How similar are the point estimates across the primary studies?
2. Do the confidence intervals overlap?
17.5.2.4 Imprecision
This component guides panels to judge whether the pooled sensitivity and
specificity are precise enough to support a recommendation. This assessment
226
Figure 4. Example of a forest plot of sensitivity and specificity
Reproduced from Mustafa R. et al. Systematic reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic acid, cytology, and colposcopy. Int J Gynaecol Obstet. 2016
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
227
Mar;132(3):259-65. doi: 10.1016/j.ijgo.2015.07.024. Epub 2015 Nov 12.(22). Licences under the Creative Commons Attribution, Non-Commercial, No Derivative works licence (CC BY-NC-ND 4.0); http://
creativecommons.org/licenses/by-nc-nd/4.0/). Note that the I2 and Q estimates have been removed from the original figure. Pooled estimates are derived from a hierarchical model.
WHO handbook for guideline development
228
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
229
WHO handbook for guideline development
230
Table 1. Components of the GRADE evidence-to-decision framework
Question Judgements
Is the problem a priority Don’t know Varies No Probably no Probably yes Yes
Test Accuracy Don’t know Varies Very inaccurate Inaccurate Accurate Very accurate
Desirable effects Don’t know Varies Trivial Small Moderate Large
Undesirable effects Don’t know Varies Large Moderate Small Trivial
Certainty of test accuracy No studies Very low Low Moderate High
Certainty of critical or important outcomes No studies Very low Low Moderate High
Certainty of the management guided by test No studies Very low Low Moderate High
results
Certain of link between test results and No studies Very low Low Moderate High
management
Overall certainty about effects of test No studies Very low Low Moderate High
Value Important uncertainty Possibly important Probably no important No important
or variability uncertainty uncertainty uncertainty
Balance of desirable and undesirable effects Don’t know Varies Favours the comparison Probably favours the Does not favour either Probably favours the Favours the index
comparison index test test
Resource requirements Don’t know Varies Large costs Moderate costs Negligible costs or savings Moderate savings Large savings
Certainty of resource requirements? No included studies Very low Low Moderate High
Cost-effectiveness No included studies Varies Favours the comparison Probably favours the Does not favour either Probably favours the Favours the index
comparison index test test
Equity Don’t know Reduced Probably reduced Probably no impact Probably increased Increased
Acceptability Don’t know Varies No Probably no Probably Yes Yes
Feasibility Don’t know Varies No Probably no Probably Yes Yes
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
231
WHO handbook for guideline development
232
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
17.6.2.5 What is the certainty of the evidence for any critical or important
outcomes?
Critical or important outcomes in the GRADE approach are those related to
harms and benefits caused to patients (26) and these are specified a priori by
guideline developers (26,28). This component prompts guideline developers
to assess the quality (certainty) of evidence on the direct benefits and harms
of the test (part of the assessment in section 17.6.2.3). Direct benefits can
include faster diagnosis, for example, whereas direct harms refer to specific
harms from the test, for instance allergic reactions to radioactive contrast
dye. Often these outcomes data will be found in the diagnostic accuracy
studies included in the systematic review, however, they may also come from
other primary studies or systematic reviews. Panels will have to make a
judgement on the quality of evidence from these additional studies (very low,
low, moderate or high).
233
WHO handbook for guideline development
treatment (this is particularly relevant for persons that test FP). This
evidence should be addressed in a systematic review of RCTs or obser-
vational studies, with a corresponding GRADE evidence profile for
interventions.
2. The evidence supporting the natural history (prognosis) of the target
condition: Improvement or deterioration without treatment or fur-
ther management is relevant to those that test FN. The evidence on
the natural history of the target condition should generally come
from the control arms of RCTs or observational studies and the qual-
ity of evidence is judged using the GRADE approach for questions of
prognosis (29).
17.6.2.7 How certain is the link between test results and management
decisions?
Guideline Development Groups must make a judgement about the likelihood
that the appropriate management (such as treatment decisions) will follow
on from test results. Important features of a test, such as test turnaround
time and interpretability of results can pose barriers to patients receiving
the appropriate treatment after obtaining a test result. Further, there may
be factors external to the test that reduce the likelihood of patients receiv-
ing appropriate management after a test such as out-of-pocket expenses;
access to quality, coordinated services; health literacy; among many others.
Guideline Development Groups should consider the literature broadly and
seek and include relevant contextual knowledge about the target healthcare
settings for potential barriers that prevent appropriate follow up and man-
agement after a test. Although it is highly preferable to consider published
research studies to inform this judgement, often guideline developers have
to rely on their own experience concerning the likelihood a test result is
managed appropriately and this can still be considered “high” certainty. For
instance, the 2018 American Society of Haematology Guideline for manage-
ment of venous thromboembolism: diagnosis of venous thromboembolism (30)
considered that “with [a pulmonary embolus] (PE) diagnosis, positive results
will be treated with anticoagulation (regardless of the chances of false posi-
tives)” (30). They assumed this because the “intervention is relatively simple
to apply and few patients would be missed in health care systems that are
equipped to offer testing” (26,30). The guideline panel thus considered the
certainty of the link between the diagnostic test results and management
decisions as high.
234
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
17.6.2.8 What is the overall certainty of the evidence about the effects of
the test?
This component requires Guideline Development Groups to make an over-
all judgement about the certainty (quality) of the evidence, considering
the lowest quality of the previous four sections (17.6.2.4 to 17.6.2.7) which
reflect the entire test-to-treatment pathway: from the accuracy of the test, to
the likelihood patients that test positive get treated, to the effectiveness of
treatment.
235
WHO handbook for guideline development
17.6.2.10 Does the balance between desirable and undesirable effects favour
the index test or the comparison?
Guideline Development Groups are prompted to make an overall judgement
about the benefits and harms of the test. This assessment is based on the
accuracy of the test, direct benefits (e.g. faster diagnosis) and harms (e.g.
radiation) from the test, the benefits and harms from management follow-
ing the test results, and the certainty of the bodies of evidence informing
these assessments. For instance, a WHO guideline on the use of line probe
assays to diagnosis multi-drug resistant TB judged the balance of benefits
and harms to favour the line probe assay because of its high sensitivity and
specificity and subsequent small numbers of FN and FP results, coupled
with documented reductions in diagnostic and treatment delays (16). To aid
decision-making, modelling can be performed to determine the number of
TP, FP, FN and TN based on different disease prevalences (pre-test prob-
abilities). Similarly, the number or rate of harmful effects of tests can be
determined and compared and assessed. In the absence of data to support a
decision, an assessment should usually be made by the Guideline Develop-
ment Group. In this situation an assessment of “probably favours” either the
intervention or the comparison should be made.
236
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
237
WHO handbook for guideline development
egies (32). This study concluded that “test accuracy is rarely, if ever, sufficient
to base guideline recommendations” and thus evidence-to-decision frame-
works are necessary to help developers to consider important issues beyond
test accuracy. Diagnostic test experts did, however, note four potential situ-
ations when test accuracy is likely to be sufficient to extrapolate the effects
of tests on patient-important outcomes (32):
Acknowledgements
This chapter was prepared by Dr Jack W. O’Sullivan and Dr Susan L Norris
with peer review by Professor Reem Mustafa, Dr Mariska (MMG) Leeflang
and Dr Alexei Korobitsyn.
238
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
References
1. World Health Organiztion. WHO handbook for guideline development-
2nd edition. Geneva. 2014. Available from: http://apps.who.int/iris/
bitstream/10665/145714/1/9789241548960_ eng.pdf?ua=1, accessed 22 February 2019.
2. Siontis KC, Siontis GCM, Contopoulos-Ioannidis DG, Ioannidis JPA. Diagnostic tests often
fail to lead to changes in patient outcomes. J Clin Epidemiol. 2014;67(6):612–21. https://doi.
org/10.1016/j.jclinepi.2013.12.008. PMID: 24679598
3. Ferrante Di Ruffano L, Davenport C, Eisinga A, Hyde C, Deeks JJ. A capture-recapture
analysis demonstrated that randomized controlled trials evaluating the impact of diag-
nostic tests on patient outcomes are rare. J Clin Epidemiol. 2012;65(3):282–7. http://dx.doi.
org/10.1016/j.jclinepi.2011.07.003. PMID: 22001307
4. O’Sullivan JW, Banerjee A, Heneghan C, Pluddemann A. Verification bias. BMJ Evidence-
Based Med. 2018;23(2):54-55. http://ebm.bmj.com/content/23/2/54. PMID: 29595130
5. Kendrick D, Fielding K, Bentley E, Kerslake R, Miller P, Pringle M. Radiography of the lumbar
spine in primary care patients with low back pain: randomised controlled trial. BMJ.
2001;322:400–5. https://doi.org/10.1136/bmj.322.7283.400. PMID: 11179160
6. Felker GM, Anstrom KJ, Adams KF, Ezekowitz JA, Fiuzat M, Houston-Miller N, et al.
Effect of Natriuretic Peptide–Guided Therapy on Hospitalization or Cardiovascular
Mortality in High-Risk Patients With Heart Failure and Reduced Ejection Fraction. JAMA.
2017;318(8):713. http://jama.jamanetwork.com/article.aspx?doi=10.1001/jama.2017.10565.
PMID: 28829876
7. Lijmer JG, Bossuyt PMM. Various randomized designs can be used to evaluate medical
tests. J Clin Epidemiol. 2009;62(4):364–73. http://dx.doi.org/10.1016/j.jclinepi.2008.06.017.
PMID: 18945590
8. Mustafa RA, Wiercioch W, Cheung A, Prediger B, Brozek J, Bossuyt P, et al. Decision
making about healthcare-related tests and diagnostic test strategies. Paper 2: a review
of methodological and practical challenges. J Clin Epidemiol. 2017;92:18–28. https://doi.
org/10.1016/j.jclinepi.2017.09.003. PMID: 28916488
9. Schünemann HJ, Oxman AD, Brozek J, Glasziou P, Jaeschke R, Vist GE, et al. GRADE: grading
quality of evidence and strength of recommendations for diagnostic tests and strategies.
BMJ. 2008;336(7654):0–b. https://doi.org/10.1136/bmj.39500.677199.AE. PMID: 18483053
10. Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for
systematic reviews and meta-analyses: the PRISMA statement. BMJ. 2009;339:b2535.
https://doi.org/10.1136/bmj.b2535. PMID: 19622551
11. Whiting P, Savović J, Higgins JPT, Caldwell DM, Reeves BC, Shea B, et al. ROBIS: A new tool
to assess risk of bias in systematic reviews was developed. J Clin Epidemiol. 2016;69:225–
34. https://doi.org/10.1016/j.jclinepi.2015.06.005. PMID: 26092286
12. Shea BJ, Reeves BC, Wells G, Thuku M, Hamel C, Moran J, et al. AMSTAR 2: a critical
appraisal tool for systematic reviews that include randomised or non-randomised studies
of healthcare interventions, or both. BMJ. 2017;4008:j4008. https://doi.org/10.1136/bmj.
j4008. PMID: 28935701
13. Leeflang MMG, Scholten RJPM, Rutjes AWS, Reitsma JB, Bossuyt PMM. Use of method-
ological search filters to identify diagnostic accuracy studies can lead to the omission
of relevant studies. J Clin Epidemiol. 2006;59(3):234–40. https://doi.org/10.1016/j.
jclinepi.2005.07.014. PMID: 16488353
239
WHO handbook for guideline development
14. McInnes MDF, Moher D, Thombs BD, McGrath TA, Bossuyt PM, Clifford T, et al. Preferred
reporting items for a systematic review and meta-analysis of diagnostic test accuracy
studies. JAMA. 2018;319(4):388. https://doi.org/10.1001/jama.2017.19163. PMID: 29362800
15. Guyatt G, Oxman AD, Akl EA, Kunz R, Vist G, Brozek J, et al. GRADE guidelines: 1.
Introduction - GRADE evidence profiles and summary of findings tables. J Clin Epidemiol.
2011;64(4):383–94. http://doi.org/ 10.1016/j.jclinepi.2010.04.026. PMID: 21195583
16. World Health Organization. WHO Guideline: The use of molecular line probe assays
for the detection of resistance to isoniazid and rifampicin. 2016. Available from:
http://apps.who.int/iris/bitstream/handle/10665/250586/9789241511261-eng.
pdf?sequence=1&isAllowed=y, accessed 22 February 2019.
17. Whiting PF, Rutjes AWS, Westwood ME, Mallet S, Deeks JJ, Reitsma JB, et al. QUADAS-2: A
Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies. Ann Intern Med.
2011;155(4):529–36. https://doi.org/10.7326/0003-4819-155-8-201110180-00009. PMID:
22007046
18. Rolfe A, Burton C. Reassurance After Diagnostic Testing With a Low Pretest Probability of
Serious Disease. JAMA Intern Med. 2013;173(6):407. https://doi.org/10.1001/jamaintern-
med.2013.2762. PMID: 23440131
19. Macaskill P, Gatsonis C, Deeks JJ, Harbord RM, Takwoingi Y. Chapter 10: Analysing and
Presenting Results. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors), Cochrane Handbook for
Systematic Reviews of Diagnostic Test Accuracy Version 1.0. The Cochrane Collaboration,
2010. Available from: http://srdta.cochrane.org/, accessed 21 February 2019.
20. Leeflang MMG. Systematic reviews and meta-analyses of diagnostic test accuracy. Clin
Microbiol Infect. 2014;20(2):105–13. https://coi.org/10.1111/1469-0691.12474. PMID:
24274632
21. Higgins JPT. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557-60.
https://doi.org/10.1136/bmj.327.7414.557. PMID: 12958120
22. Mustafa RA, Santesso N, Khatib R, Mustafa AA, Wiercioch W, Kehar R, et al. Systematic
reviews and meta-analyses of the accuracy of HPV tests, visual inspection with acetic
acid, cytology, and colposcopy. Int J Gynecol Obstet. 2016;132(3):259–65. http://dx.doi.
org/10.1016/j.ijgo.2015.07.024. PMID: 26851054
23. Ahmed I, Sutton AJ, Riley RD. Assessment of publication bias, selection bias, and unavail-
able data in meta-analyses using individual participant data: a database survey. BMJ.
2012;344:d7762–d7762. https://doi.org/10.1136/bmj.d7762. PMID: 22214758
24. Rogozinska E, Khan K. Grading evidence from test accuracy studies: what makes it
challenging compared with the grading of effectiveness studies? Evid Based Med.
2017;22(3):81–4. https://doi.org/10.1136/ebmed-2017-110717. PMID: 28600330
25. World Health Organization. WHO recommendations on antenatal care for a positive
pregnancy experience. Geneva; 2016. Available from: https://www.who.int/reproductive-
health/publications/maternal_perinatal_health/anc-positive-pregnancy-experience/en/,
accessed 20 February 2019.
26. Schünemann HJ, Mustafa R, Brozek J, Santesso N, Alonso-Coello P, Guyatt G, et al.
GRADE Guidelines: 16. GRADE evidence to decision frameworks for tests in clinical
practice and public health. J Clin Epidemiol. 2016;76:89–98. https://doi.org/10.1016/j.
jclinepi.2016.01.032. PMID: 26931285
27. Alonso-Coello P, Schünemann HJ, Moberg J, Brignardello-Petersen R, Akl EA, Davoli M, et
al. GRADE Evidence to Decision (EtD) frameworks: A systematic and transparent approach
to making well informed healthcare choices. 1: Introduction. Gac Sanit. 2018;32(2):166.
e1-166.e10. https://doi.org/10.1016/j.gaceta.2017.02.010. PMID: 28822594
240
Chapter 17. Developing guideline recommendations for tests or diagnostic tools
28. Andrews J, Guyatt G, Oxman AD, Alderson P, Dahm P, Falck-Ytter Y, et al. GRADE guide-
lines: 14. Going from evidence to recommendations: The significance and presentation
of recommendations. J Clin Epidemiol. 2013;66(7):719–25. https://doi.org/ 10.1016/j.
jclinepi.2012.03.013. PMID: 23312392
29. Iorio A, Spencer FA, Falavigna M, Alba C, Lang E, Burnand B, et al. Use of GRADE for assess-
ment of evidence about prognosis: rating confidence in estimates of event rates in broad
categories of patients. BMJ. 2015;350h870–h870. https://doi.org/10.1136/bmj.h870. PMID:
25775931
30. Lim W, Le Gal G, Bates SM, Righini M, Haramati LB, Lang E, et al. American Society of
Hematology 2018 guidelines for management of venous thromboembolism: diagnosis of
venous thromboembolism. Blood Adv. 2018 Nov 27;2(22):3226–56. https://doi.org/10.1182/
bloodadvances.2018024828. PMID: 30482764
31. World Health Organization. Guidelines for screening and treatment of precancerous
lesions for cervical cancer prevention. 2013;60. Available from: http://www.who.int/repro-
ductivehealth/publications/cancers/screening_and_treatment_of_precancerous_lesions/
en/index.html, accessed 20 February 2019.
32. Mustafa RA, Wiercioch W, Ventresca M, Brozek J, Schünemann HJ, Bell H, et al. Decision
making about healthcare-related tests and diagnostic test strategies. Paper 5: a qualitative
study with experts suggests that test accuracy data alone is rarely sufficient for decision
making. J Clin Epidemiol. 2017;92:47–57. https://doi.org/10.1016/j.jclinepi.2017.09.005.
PMID: 28917629
33. Reitsma JB, Rutjes AWS, Whiting P, Vlassov VV, Leeflang MMG, Deeks JJ,. Chapter 9:
Assessing methodological quality. In: Deeks JJ, Bossuyt PM, Gatsonis C (editors), Cochrane
Handbook for Systematic Reviews of Diagnostic Test Accuracy Version 1.0.0. The
Cochrane Collaboration, 2009. Available from: https://methods.cochrane.org/sites/meth-
ods.cochrane.org.sdt/files/public/uploads/ch09_Oct09.pdf, accessed 20 February 2019.
241