You are on page 1of 8

Note: This copy is for your personal non-commercial use only.

To order presentation-ready
copies for distribution to your colleagues or clients, contact us at www.rsna.org/rsnarights.

ORIGINAL RESEARCH
Overinterpretation and
Misreporting of Diagnostic

n SPECIAL REPORT
Accuracy Studies:Evidence
of “Spin”1
Eleanor A. Ochodo, MBChB, MIH
Purpose: To estimate the frequency of distorted presentation
Margriet C. de Haan, MD
and overinterpretation of results in diagnostic accuracy
Johannes B. Reitsma, MD, PhD
studies.
Lotty Hooft, PhD
Patrick M. Bossuyt, PhD Materials and MEDLINE was searched for diagnostic accuracy studies
Mariska M. G. Leeflang, PhD Methods: published between January and June 2010 in journals with
an impact factor of 4 or higher. Articles included were pri-
mary studies of the accuracy of one or more tests in which
the results were compared with a clinical reference stan-
dard. Two authors scored each article independently by
using a pretested data-extraction form to identify actual
overinterpretation and practices that facilitate overinter-
pretation, such as incomplete reporting of study methods
or the use of inappropriate methods (potential overinter-
pretation). The frequency of overinterpretation was esti-
mated in all studies and in a subgroup of imaging studies.

Results: Of the 126 articles, 39 (31%; 95% confidence interval


[CI]: 23, 39) contained a form of actual overinterpreta-
tion, including 29 (23%; 95% CI: 16, 30) with an overly
optimistic abstract, 10 (8%; 96% CI: 3%, 13%) with a
discrepancy between the study aim and conclusion, and
eight with conclusions based on selected subgroups. In
our analysis of potential overinterpretation, authors of
89% (95% CI: 83%, 94%) of the studies did not include
a sample size calculation, 88% (95% CI: 82%, 94%) did
not state a test hypothesis, and 57% (95% CI: 48%, 66%)
did not report CIs of accuracy measurements. In 43%
(95% CI: 34%, 52%) of studies, authors were unclear
about the intended role of the test, and in 3% (95% CI:
0%, 6%) they used inappropriate statistical tests. A sub-
group analysis of imaging studies showed 16 (30%; 95%
CI: 17%, 43%) and 53 (100%; 95% CI: 92%, 100%) con-
tained forms of actual and potential overinterpretation,
respectively.

1
From the Department of Clinical Epidemiology, Biosta- Conclusion: Overinterpretation and misreporting of results in diag-
tistics & Bioinformatics (E.A.O., P.M.B., M.M.G.L.) and nostic accuracy studies is frequent in journals with high
Department of Radiology (M.C.d.H.), Academic Medical
impact factors.
Centre, University of Amsterdam, Meibergdreef 9, 1105
AZ Amsterdam, the Netherlands; Julius Center for Health
Sciences and Primary Care, University Medical Center
q
RSNA, 2013
Utrecht, Utrecht, the Netherlands (J.B.R.); and Dutch
Cochrane Centre, Academic Medical Centre, University of Supplemental material: http://radiology.rsna.org/lookup
Amsterdam, Amsterdam, the Netherlands (L.H.). Received /suppl/doi:10.1148/radiol.12120527/-/DC1
March 17, 2012; revision requested April 23; revision
received August 21; accepted September 12; final version
accepted October 15. Address correspondence to E.A.O.
(e-mail: e.a.ochodo@amc.uva.nl).

q
RSNA, 2013

Radiology: Volume 267: Number 2—May 2013 n radiology.rsna.org 581


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

R
eporting that distorts or misrepre- Overinterpretation may also exist in Eligible for inclusion were primary
sents study results to make the in- diagnostic accuracy studies. Authors of studies in which the authors evaluated
terventions look favorable is called such studies evaluate the ability of a test the diagnostic accuracy of one or more
overinterpretation, which is also referred or marker to correctly identify subjects tests compared with a clinical reference
to as “spin” (1). Authors may use exag- with the target condition. The clinical standard. Excluded were non-English
gerated language, present an overly op- use of tests based on inflated conclu- studies, animal studies, and studies in
timistic conclusion in the abstract com- sions may cause physicians to make in- which accuracy measurements were not
pared with that in the main text, or draw correct clinical decisions, thereby com- reported.
favorable conclusions from results of promising patient safety. Exaggerated One author (E.A.O.) independently
selected subgroups (2,3). Overinterpre- conclusions could also lead to unneces- identified potentially eligible articles by
tation may also be introduced because sary testing and avoidable health care reading titles and abstracts. To ensure
of methodologic shortcomings such as costs (8). The purpose of this study was that no articles were missed, a second au-
failure to specify a study hypothesis or to estimate the frequency of distorted thor (M.M.G.L.) independently screened
make a sample size calculation or by the presentation and overinterpretation of a random sample of titles and abstracts
choice of statistical tests that produce results in diagnostic accuracy studies. of 1000 articles from all those that the
the desired results (1,3–6). These forms search strategy yielded. We aimed to
of misleading representation of the re- find a random sample of the potentially
Materials and Methods
sults of scientific research may compro- eligible articles as outlined in the analysis
mise decision making in health care and This study was based on a systematic and results sections. A summary of the
the well-being of patients. search of the literature for diagnostic search process is outlined in Figure 1.
Overinterpretation has been shown accuracy studies and the evaluation
to be common in randomized con- of the studies for actual and potential Definition of Overinterpretation in
trolled trials. Boutron and colleagues overinterpretation. Diagnostic Accuracy Studies
(7) identified overinterpretation in re-
Diagnostic accuracy studies vary by
ports of randomized clinical trials with Literature Search
study question, design, and type and
a clearly identified primary outcome Two authors (E.A.O. and M.M.G.L., number of tests evaluated (12,13). We
that showed statistically insignificant with 2 and 8 years of experience, re- used a definition of overinterpretation
results. More than 40% of the reports spectively) independently searched based on common features that could
had distorted interpretation in at least MEDLINE in January 2011 for diagnos- apply to a wide range of tests.
two sections of the main text. tic accuracy studies published between We defined overinterpretation as
January and June 2010 in journals reporting of diagnostic accuracy stud-
with an impact factor of 4 or higher. ies that makes tests look more favor-
Advances in Knowledge
We focused on these journals because able than the results justify. We fur-
n Overinterpretation occurs in Lumbreras et al (9) found that overin- ther distinguished between actual and
about three in 10 diagnostic ac- terpretation of the clinical applicabil- potential forms of overinterpretation.
curacy studies in journals with an ity of molecular diagnostic tests was We defined actual overinterpretation
impact factor of 4 or higher. more likely in journals with higher im-
n The most frequent form of over- pact factors. This impact factor cut-off
interpretation is an overly opti- was based on a previously published Published online before print
mistic conclusion in the abstract, analysis of accuracy studies (10). We 10.1148/radiol.12120527 Content code:
which occurs in about one in limited our search to the most recent Radiology 2013; 267:581–588
four studies. studies indexed in MEDLINE.
Abbreviations:
The search was a combination of a
n The most common practices that CI = confidence interval
previously validated search strategy for
facilitate overinterpretation (mis- STARD = Standards for Reporting of Diagnostic Accuracy
diagnostic accuracy studies (“sensitivity
reporting of results) are not
AND specificity.sh” OR “specificit*.tw” Author contributions:
reporting a sample size calcula- Guarantors of integrity of entire study, E.A.O., P.M.B.,
OR “false negative.tw” OR “accuracy.tw”,
tion (n = 112 of 126, 89%) and M.M.G.L.; study concepts/study design or data acquisition
where “.sh” indicates subject heading,
not stating a test hypothesis (n = or data analysis/interpretation, all authors; manuscript
“.tw” indicates text word, and an asterisk
111 of 126, 88%). drafting or manuscript revision for important intellectual
(*) is a wildcard) (11) and a list of 622 content, all authors; approval of final version of submitted
n The proportion of overinterpreta- international standard serial numbers manuscript, all authors; literature research, E.A.O., J.B.R.,
tion and misreporting of results of journals with an impact factor of 4 or L.H., M.M.G.L.; statistical analysis, E.A.O., P.M.B.; and
in the subgroup of imaging higher obtained from the medical library manuscript editing, all authors
studies is similar to that of diag- of the University of Amsterdam. Appen- Conflicts of interest are listed at the end of this article.
nostic accuracy studies as a dix E1 (online) includes details of the
whole. search strategy.
See also the editorial by Levine et al in this issue.

582 radiology.rsna.org n Radiology: Volume 267: Number 2—May 2013


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

Figure 1 randomized control trials (7), molecu- only based on one of these subgroups.
lar diagnostic tests (9), and in scientific Caution must be used in the evaluation of
research generally (1–3,5,6,15,16). results of subgroup analysis because they
From these articles, we extracted a list may have a high false-positive rate due to
of potential items that could help iden- the effect of multiple testing (17,18).
tify overinterpretation in diagnostic ac- Discrepancy between study aim and
curacy studies. conclusion.—A conclusion that does
We then designed and pretested a not reflect the aim of the report may
data collection form by listing the items be indicative of flawed results (3,5). We
that may introduce overinterpretation on evaluated the main text of the article to
10 diagnostic accuracy studies published determine if the conclusion was in line
in 2011, which were independently eval- with the specified aim of the study.
uated by five authors. These studies The potential forms of overinter-
were not included in the final analysis. pretation included the following:
This process of identifying overinterpre- Not stating a test hypothesis.—We
tation is outlined in Figure 2. evaluated whether a statistical test hy-
pothesis was stated, such as the hypo-
Data Extraction thesis that one test was superior to an-
We extracted data on study characteris- other or that a measure of diagnostic
tics and items that can introduce overin- accuracy would surpass a prespecified
terpretation by using the pretested data- value. The minimally acceptable accu-
extraction form (Appendix E2 [online]). racy values (null hypothesis) and antic-
We looked for items that can introduce ipated performance values (alternative
overinterpretation in the abstract (with hypothesis) of a test under evaluation
special focus on the results and conclu- depend on the clinical context. These
sion section), and in the main text (intro- performance values can be obtained
duction, methods, results, and conclusion from pilot data, prior studies, or, in
Figure 1: Flowchart of search results. sections). The actual forms of overinter- cases of novel studies, from experts
pretation that we extracted included the who may give estimates of clinically
following: relevant performance values. A priori
as explicit false-positive interpretation An overly optimistic abstract.—This specification of a hypothesis limits the
of study results and potential overin- was considered to be present when the chances of post hoc or subjective judg-
terpretation as practices that facilitate abstract only reported the best results ment about the accuracy and intended
overinterpretation such as incomplete but the main text had an array of results role of the test (4,19). The anticipated
reporting of applied study methods and or when stronger test recommendations or desired performance measurements
assumptions or the use of inappropriate or conclusions were reported in the ab- also guide sample-size calculations.
methods. Incomplete reporting of data stract than in the main text. For the lat- Not reporting a sample size calcula-
may hinder objective appraisal of an ar- ter, we evaluated the language that was tion.—We assessed whether the sample
ticle and may mislead readers to think used to make the recommendations. If size required gave a study the power to
that test results are favorable (3,4,14). the authors used affirmative language allow estimation of test accuracy with
This definition of overinterpre- in the abstract but they used condi- sufficient precision and whether the
tation was based on items extracted tional language in the main text to make method used to calculate the sample size
from published literature (1–3,5,6), the recommendations or conclusions, we was reported. Without knowledge of the
Standards for Reporting of Diagnostic scored the article as having a stronger calculation used to determine sample
Accuracy (STARD) (8), and experi- abstract. Affirmative language included size, readers cannot know if the sample
ence of content experts on the team. words such as “is definitely,” “should be,” size is sufficient to estimate the accuracy
We first listed the potential items that “excellent for use,” and “strongly recom- measurements with precision (19).
could introduce overinterpretation in mended”; conditional language included Not stating or unclearly stating the
diagnostic accuracy studies on the basis words such as “appears to be,” “may be,” intended role of the test under evalua-
of the experience of content experts on “could be,” and “probably should.” tion.—We assessed whether the role of
the team. We then searched MEDLINE Favorable study conclusions or test the test was clearly defined in the main
for key literature on poor reporting recommendations based on selected text. Before evaluation and recommenda-
and interpretation of scientific reports subgroups.—We scored this as overin- tion of a test, its intended role must be
that had been published up to Janu- terpretation when multiple subgroups clearly defined. A new test may be used
ary 2011. We identified key articles on were reported in the methods or results to replace the existing test or may be
misrepresentation of study findings in section, but the recommendations were used before (triage) or after (add-on) an

Radiology: Volume 267: Number 2—May 2013 n radiology.rsna.org 583


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

Figure 2

Figure 2: Flowchart of development of definition of overinterpretation. ∗ = For interventional studies, P values are used to evaluate overinterpretation (7). In diagnostic
tests, overinterpretation can occur when a low reference value or poor comparator is chosen, which can result in reporting of nonsignificant differences as significant.
Because reference values and comparators were not stated in our sample tests, scoring was difficult.

existing test (20,21). The preferred accu- start of the study. Stating a threshold of multiple tests. In diagnostic accuracy
racy value of a test depends on its role. value after data collection and analysis studies, the test of significance to be
Not prespecifying groups for sub- may give room for manipulation to max- used depends on the role of the test
group analysis a priori in the methods imize a test characteristic (4,23,24). under evaluation and whether the tests
section.—We assessed whether the Not stating CIs of accuracy mea- are performed in different groups of pa-
subgroups presented in the results surements.—We determined if the con- tients (unpaired design) or in the same
section were prespecified at the start fidence intervals (CIs) of the accuracy patients (paired design) (28).
of the study. Failure to prespecify sub- estimates were reported. CIs enable Two authors scored each article in-
groups can lead to post hoc analyses readers to appreciate the precision of dependently. The abstract and main text
motivated by initial inspection of data the accuracy measurements. Without sections (introduction, methods, results,
and may give room for manipulation of these data, it is difficult to assess the discussion, and conclusion) were read.
results to look favorable (17,18,22). clinical applicability of the tests (25–27). One author (E.A.O.) scored all the se-
Not prespecifying positivity thresh- Using inappropriate statistical lected articles. The other five authors
olds of tests.—For continuous tests, tests.—We evaluated the tests of sig- (M.C.d.H., J.B.R., L.H., P.M.B., with
we determined whether the optimal nificance used to compare the accu- 4, 18, 13, and 25 years of experience,
threshold or cut-off value at which a racy measurements of the index test respectively, and M.M.G.L.) scored the
test result is either considered positive and reference standard or those that same articles in predetermined pro-
or negative was prespecified before the compared the accuracy measurements portions. Disagreement was resolved

584 radiology.rsna.org n Radiology: Volume 267: Number 2—May 2013


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

by consensus or by consultation with a Table 1


third party when needed.
Summary of Study Characteristics
Analysis Characteristic All Studies (n = 126)

We calculated the proportion of ar- Median journal impact factor, median* 5.491 (4.014–16.225)
ticles with actual and potential over- Sample size, median (range) 150.5 (12–20 765)
interpretation, and the level of inter- Type of test evaluated
rater agreement in scoring both. We Clinical diagnosis (history and physical examination) 13 (10) [5, 16]
anticipated that 40% of articles would Imaging test 53 (42) [33, 51]
have a distorted presentation or overin- Biochemical test 13 (10) [5, 16]
terpretation (7), so we calculated that Molecular test 14 (11) [6, 17]
evaluating 100 articles would produce a Immunodiagnostic test 14 (11) [6, 17]
two-sided 95% CI of 30%-50% by us- Other 19 (15) [9, 21]
ing the exact (Clopper-Pearson) method Study design
Single test 64 (51) [42, 60]
(29,30).
Comparator test 62 (49) [40, 58]
We analyzed all the included studies
Study design
and the subset of imaging studies to es-
Cross sectional 102 (81) [74, 88]
timate the frequencies of actual and po-
Longitudinal (with follow up) 17 (13) [7, 19]
tential overinterpretation by using SAS
Diagnostic randomized design 1 (1) [0, 2]
version 9.2 (SAS Institute, Cary, NC). Case control 2 (2) [0, 4]
Unclear 5 (4) [0, 6]
Sampling method
Results
Consecutive series 60 (48) [39, 56]
Search Results Random series 6 (5) [1, 8]
Our initial search yielded 6978 articles. Convenience sampling 23 (18) [11, 25]
Multistage stratified sampling 1 (1) [0, 2]
After the titles and abstracts were read,
Unclear 36 (28) [21, 37]
6558 articles were deemed to be ineli-
Number of groups from which patients were sampled
gible. Of the remaining 420 potentially
One group 87 (69) [61, 77]
eligible articles, we randomly selected
Different groups 30 (24) [16, 31]
140 articles for evaluation by using
Unclear 9 (7) [3, 12]
STATA version 10.0 (Stata, College Sta-
tion, Tex) with the code “sample 140, Note.—Unless otherwise indicated, data are number of studies, with percentage in parentheses and 95% CIs in brackets.
count.” We included more articles than * Numbers in parentheses are the range.
indicated by our sample size calculation
to compensate for any false-positive ar-
ticles in the random selection. After the Table 2
full texts of these 140 articles were as-
Actual Overinterpretation in Diagnostic Accuracy Studies
sessed, 14 studies were excluded (Fig
Form of Actual Overinterpretation All Studies (n = 126) Imaging Studies (n = 53)
1). A total of 126 studies were included
in the final analysis. Overly optimistic abstract 29 (23) [16–30] 13 (25) [13–37]
Stronger conclusion in abstract 22 (17) [11–24] 9 (17) [9–30]
Characteristics of Included Articles Selective reporting of results in abstract 7 (6) [2–10] 4 (8) [3–20]
Study conclusions based on selected subgroups 8* (10) [5–19] 3† (7) [2–21]
Details of the study characteristics are
Discrepancy between aim and conclusion 10 (8) [3–13] 2 (4) [0–9]
outlined in Table 1. In summary, the me-
Articles with one or more forms of actual 39 (31) [23–39] 16 (30) [17–43]
dian impact factor of all the included ar-
overinterpretation
ticles was 5.5 (range, 4.0–16.2) and the
median sample size was 151 (range, 12– Note.—Data are number of studies, with percentage in parentheses and 95% CIs in brackets. Studies can have more than one
20 765). Of all the tests evaluated in the form of actual overinterpretation

included articles, imaging tests formed * Eighty articles included analysis of subgroups (n = 80).

Forty-one imaging articles included analysis of subgroups (n = 41).
the largest group (n = 53 of 126, 42%).

Agreement
Interrater agreement for scoring both Actual Overinterpretation overinterpretation (Table 2). The most
actual and potential overinterpretation Of 126 articles, 39 (31%; 95% CI: frequent form of overinterpretation
are outlined in Appendix E3 (online). 23%, 39%) contained a form of actual was overly optimistic abstract (n = 29

Radiology: Volume 267: Number 2—May 2013 n radiology.rsna.org 585


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

Table 3 of 126; 23%; 95% CI: 16%, 30%). Of


these, 22 (17%; 95% CI: 11%, 24%)
Potential Overinterpretation in Diagnostic Accuracy Studies articles had stronger test recommen-
Forms of Potential Overinterpretation All Studies (n = 126) Imaging Studies (n = 53) dations or conclusions in the abstract
than in the main text, and seven (6%;
Sample size calculation not reported 112 (89) [83, 94] 48 (91) [82, 99] 95% CI: 2%, 10%) articles had selec-
Test hypothesis not stated 111 (88) [82, 94] 47 (89) [80, 98] tively reported favorable results in the
CIs of accuracy measurements not reported 72 (57) [48, 66] 26 (49) [35, 63]
abstract.
Role of test not stated or unclear 54 (43) [34, 52] 17 (32) [19, 45]
Of the 53 included imaging studies,
Groups for subgroup analysis not prespecified in the 25/80* (31) [21, 42] 8/41† (20) [8, 36]
16 articles (30%; 95% CI: 17%, 43%)
methods section of main text
contained a form of actual overinter-
Positivity thresholds of continuous tests 22/63‡ (35) [24, 48] 6/25§ (24) [10, 45]
pretation (Table 2). Similar to the re-
not reported
Use of inappropriate statistical tests 4 (3) [0, 6] 3 (6) [0, 12]
sults of the overall studies, the most fre-
Overall proportion of articles with one or more forms 125 (99) [98, 100] 53 (100) [92, 100] quent form of actual overinterpretation
of potential overinterpretation was an overly optimistic abstract, in 13
articles (25%; 95% CI: 13%, 37%).
Note.—Data are number of studies, with percentage in parentheses and 95% CIs in brackets. Studies can have more than one
form of potential overinterpretation.
Potential Overinterpretation
* Eighty articles included analysis of subgroups.

Forty-one imaging articles included analysis of subgroups. Details of the potential forms of over-

Sixty-three articles included evaluation of continuous tests. interpretation are outlined in Table 3.
§
Twenty-five articles included evaluation of continuous tests. Of the 126 included articles, only 14
(11%) reported a sample size calcula-
tion. Only 15 (12%) articles reported
an explicit study hypothesis. All imaging
Table 4 studies (n = 53) contained a form of
potential overinterpretation. Of these,
Examples of Actual Overinterpretation
only five (9%) included articles that re-
Actual Overinterpretation ported a sample size calculation. Only
six (11%) articles reported an explicit
1. An abstract with a stronger conclusion: (31)
Conclusion in main text:
study hypothesis. Examples of overin-
“Detection of antigen in BAL using the Mvista antigen appears to be a useful method (…) Additional terpretation are provided in Table 4.
studies are needed in patients with pulmonary histoplasmosis.”
Conclusion in Abstract: Discussion
“Detection of antigen in BAL fluid complements antigen detection in serum and urine as an objective
Our study results showed that about
test for histoplasmosis”
three of 10 studies of the diagnostic
2. Conclusions drawn from selected subgroups: (32)
accuracy of biomarkers or other med-
A study evaluates the aptness of F-desmethoxyfallypride (F-DMFP) PET for the differential diagnosis of
ical tests published in journals with an
Idiopathic Parkinsonian Syndrome (IPS) and non-IPS in a series of 81 patients with a clinical
diagnosis of Parkinsonism. The authors compared several F-DMFP PET indexes for the
impact factor of 4 or higher contained
discrimination of IPS and non-IPS and reported only the best sensitivity and specificity estimates. overinterpreted results, and 99% of
They concluded that F-DMFP PET was an accurate method for differential diagnosis. studies contained practices that facili-
3. Disconnect between the aim and conclusion of the study: (33) tate overinterpretation. The most com-
The study design described in this paper aimed to evaluate the sensitivity and specificity of the mon form of actual overinterpretation
IgM anti-EV71 assay. However the conclusion is not on accuracy rather it focuses on other is an overly optimistic abstract (ap-
measurements of diagnostic performance. proximately one in four) with stronger
Aim of study: conclusions or test recommendations
“The aim of this study was to assess the performance of detecting IgM anti-EV71 for early diagnosis than those in the main text (approxi-
of patients with HFMD.” mately one in five). In terms of prac-
Conclusion: tices that facilitate overinterpretation,
“The data here presented show that the detection of IgM anti-EV71 by ELISA affords a reliable or potential overinterpretation, most
convenient and prompt diagnosis of EV71. The whole assay takes 90 mins using readily available of the studies did not report an a pri-
ELISA equipment, is easy to perform with low cost which make it suitable in clinical diagnosis as ori formulated test hypothesis, did not
well as in public health utility.” include a corresponding sample size
Note.—BAL = bronchoalveolar lavage, PET = positron emission tomography, IgM = Immunoglobulin M, EV71 = enterovirus 71, calculation, and did not include CIs for
HFMD = hand, foot, and mouth disease, ELISA = enzyme-linked immunosorbent assay. accuracy measurements. In a closely re-
lated study, Lumbreras and colleagues

586 radiology.rsna.org n Radiology: Volume 267: Number 2—May 2013


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

(9) evaluated overinterpretation of the guidelines in preparing their manu- to ensure that the abstracts are fair
clinical applicability of molecular diag- scripts. The suboptimal use of STARD representations of the main texts. We
nostic tests only. Of the 108 evaluated has also been documented in previous hope that highlighting the forms of
articles, 61 (56%) had overinterpreta- reports (36–41). overinterpretation will enable peer re-
tions of the clinical applicability of the Comprehensively evaluating over- viewers to correctly identify overly op-
molecular test under study. interpretation in diagnostic studies timistic reports of diagnostic accuracy
The defining strengths of our study depends on the context in which the studies and encourage investigators to
were that we analyzed a sample of di- test is used. For instance, overinterpre- be clearer in designing, more transpar-
agnostic accuracy studies evaluating tation can occur when positive recom- ent in reporting, and more stringent in
a wide range of tests and we defined mendations are made for the clinical interpreting test accuracy studies.
overinterpretation in terms of com- use of tests even if the accuracy mea-
mon features that apply to most tests. surements do not justify this. Because Acknowledgments: We thank René Spijker,
MSc (Dutch Cochrane Centre, University of
To limit subjectivity, we systematically of the wide range of tests evaluated in
Amsterdam) for assisting in the development
searched for diagnostic studies with a our study, it was difficult to determine of the search strategy of this project. We also
validated search strategy. Scoring of a standard cutoff measure to define thank Gordon Guyatt, BSc, MD, MSc, FR-
the articles was done by two authors in- low and high accuracy measurements. CPC (McMaster University, Canada) and Paul
Glasziou, PhD, FRACGP, MRCGP (Bond Uni-
dependently by using a pretested data- Preferred accuracy measurements dif- versity) for their comments on an earlier ver-
extraction form. fer and are dependent on the type of sion of this manuscript. Mr Spijker, Dr Guyatt
The forms of overinterpretation test used, the role of the test, the tar- and Dr Glasziou received no compensation for
that we found in our study may have get condition, and the setting in which their contributions.
several implications for diagnostic re- the test is being evaluated and on the
Disclosures of Conflicts of Interest: E.A.O.
search and practice. One of the most accuracy of other methods available. No relevant conflicts of interest to disclose.
important consequences might be that A sensitivity of 80% may be “definitely M.C.d.H. No relevant conflicts of interest to
diagnostic accuracy studies with opti- useful” in one situation, but it may be disclose. J.B.R. No relevant conflicts of interest
mistic conclusions may be highly cited, useless in another situation. to disclose. L.H. No relevant conflicts of inter-
est to disclose. P.M.B. No relevant conflicts of
leading to a cascade of inflated and In addition, not reporting CIs may interest to disclose. M.M.G.L. No relevant con-
questionable evidence in the literature. be regarded as either actual or po- flicts of interest to disclose.
Subsequently, this may translate to the tential overinterpretation depending
premature adoption of tests in clinical on the context. For instance, report- References
practice. In a recently published review, ing very high point estimates such as 1. Fletcher RH, Black B. “Spin” in scientific
Ioannidis and Panagiotou (34) reported 99% without CIs on the basis of a small writing: scientific mischief and legal jeop-
that highly cited biomarker studies of- sample size such as 10 cases may be ardy. Med Law 2007;26(3):511–525.
ten had inflated results. Of the included regarded as actual overinterpretation. 2. Horton R. The rhetoric of research. BMJ
highly cited studies in their review, 86% On the other hand, not reporting CIs 1995;310(6985):985–987.
had larger effect sizes than those in the in cases of moderate or high estimates 3. Marco CA, Larkin GL. Research ethics:
largest study and 83% had larger effect with large sample sizes or in compara- ethical issues of data reporting and the
sizes than those in a corresponding tive evaluations of two tests when one quest for authenticity. Acad Emerg Med
meta-analysis evaluating the same bio- is statistically superior, can be regarded 2000;7(6):691–694.
marker. Of note, the larger studies and as potential overinterpretation. 4. Bossuyt PM, Reitsma JB, Bruns DE, et
meta-analysis received fewer citations To curb the occurrence of overin- al. The STARD statement for reporting
(35). terpretation and misreporting of re- studies of diagnostic accuracy: explana-
tion and elaboration. Ann Intern Med
Our study was largely limited to sults in diagnostic accuracy studies, we
2003;138(1):W1–W12.
what was reported. For instance, there recommend that journals continually
was no guarantee that, because sub- emphasize that the manuscripts sub- 5. Zinsmeister AR, Connor JT. Ten common
statistical errors and how to avoid them.
groups or thresholds were listed in the mitted must be reported according to
Am J Gastroenterol 2008;103(2):262–266.
methods, they were indeed prespeci- the STARD reporting guidelines. This
fied. An alternative is to look at study may also diminish the methodologic 6. Scott IA, Greenberg PB, Poole PJ. Cau-
tionary tales in the clinical interpretation
protocols, but unlike in randomized conditions that may lead to overinter-
of studies of diagnostic tests. Intern Med J
trials, protocols of diagnostic accuracy pretation. Readers largely depend on 2008;38(2):120–129.
studies are not always registered. An- abstracts to draw conclusions about
7. Boutron I, Dutton S, Ravaud P, Altman DG.
other limitation of our study was the an article, and sometimes when full Reporting and interpretation of randomized
considerable variation in the interra- texts are not available, they may make controlled trials with statistically nonsignif-
ter agreement for scoring. The overall decisions on the basis of abstracts icant results for primary outcomes. JAMA
scoring of articles was difficult because alone (7,42,43). Hence, reviewers 2010;303(20):2058–2064.
many articles were poorly reported. must be more stringent when reading 8. Bossuyt PM, Reitsma JB, Bruns DE, et al.
Many authors had not used the STARD abstracts of submitted manuscripts Towards complete and accurate reporting of

Radiology: Volume 267: Number 2—May 2013 n radiology.rsna.org 587


SPECIAL REPORT: Overinterpretation and Misreporting of Diagnostic Accuracy Studies Ochodo et al

studies of diagnostic accuracy: The STARD 20. Fritz JM, Wainner RS. Examining diagnostic 33. Xu F, Yan Q, Wang H, et al. Performance
Initiative. Ann Intern Med 2003;138(1):40– tests: an evidence-based perspective. Phys of detecting IgM antibodies against entero-
44. Ther 2001;81(9):1546–1564. virus 71 for early diagnosis. PLoS ONE
21. Bossuyt PM, Irwig L, Craig J, Glasziou P. 2010;5(6):e11388.
9. Lumbreras B, Parker LA, Porta M, Pol-
lán M, Ioannidis JP, Hernández-Aguado I. Comparative accuracy: assessing new tests 34. Ioannidis JP, Panagiotou OA. Comparison
Overinterpretation of clinical applicabil- against existing diagnostic pathways. BMJ of effect sizes associated with biomarkers
ity in molecular diagnostic research. Clin 2006;332(7549):1089–1092. reported in highly cited individual articles
Chem 2009;55(4):786–794. 22. Montori VM, Jaeschke R, Schünemann HJ, and in subsequent meta-analyses. JAMA
10. Smidt N, Rutjes AW, van der Windt DA, et et al. Users’ guide to detecting misleading 2011;305(21):2200–2210.
al. Quality of reporting of diagnostic accu- claims in clinical research reports. BMJ 35. Bossuyt PM. The thin line between hope
racy studies. Radiology 2005;235(2):347– 2004;329(7474):1093–1096. and hype in biomarker research. JAMA
353. 23. Ewald B. Post hoc choice of cut points in- 2011;305(21):2229–2230.
11. Devillé WL, Bezemer PD, Bouter LM. Pub- troduced bias to diagnostic research. J Clin 36. Paranjothy B, Shunmugam M, Azuara-
lications on diagnostic test evaluation in Epidemiol 2006;59(8):798–801. Blanco A. The quality of reporting of diag-
family medicine journals: an optimal search 24. Leeflang MM, Moons KG, Reitsma JB, Zwin- nostic accuracy studies in glaucoma using
strategy. J Clin Epidemiol 2000;53(1):65– derman AH. Bias in sensitivity and specific- scanning laser polarimetry. J Glaucoma
69. ity caused by data-driven selection of optimal 2007;16(8):670–675.
12. Knottnerus JA, Muris JW. Assessment cutoff values: mechanisms, magnitude, and 37. Bossuyt PM. STARD statement: still
of the accuracy of diagnostic tests: the solutions. Clin Chem 2008;54(4):729–737. room for improvement in the reporting
cross-sectional study. In: Knottnerus JA, 25. Harper R, Reeves B. Reporting of precision of diagnostic accuracy studies. Radiology
ed. The evidence base of clinical diagnosis. of estimates for diagnostic accuracy: a re- 2008;248(3):713–714.
London, England: BMJ Publishing Group, view. BMJ 1999;318(7194):1322–1323. 38. Wilczynski NL. Quality of reporting of di-
2002; 39–59.
26. Habbema JDF, Eijekmans R, Krijnen P, agnostic accuracy studies: no change since
13. Irwig L, Bossuyt P, Glasziou P, Gatsonis Knottnerus JA. Analysis of data on the accu- STARD statement publication—before-and-
CA, Lijmer JG. Designing studies to ensure racy of diagnostic tests. In: Knottnerus JA, after study. Radiology 2008;248(3):817–823.
that estimates of test accuracy will travel. ed. The evidence base of clinical diagnosis.
39. Fontela PS, Pant Pai N, Schiller I, Den-
In: Knottnerus JA, ed. The evidence base London, England: BMJ Publishing Group,
dukuri N, Ramsay A, Pai M. Quality and
of clinical diagnosis. London, England: BMJ 2002; 117–144.
reporting of diagnostic accuracy studies
Publishing Group, 2002; 95–116.
27. Altman DG. Why we need confidence inter- in TB, HIV and malaria: evaluation using
14. Chalmers I. Underreporting research is vals. World J Surg 2005;29(5):554–556. QUADAS and STARD standards. PLoS ONE
scientific misconduct. JAMA 1990;263(10): 2009;4(11):e7753.
28. Hayen A, Macaskill P, Irwig L, Bossuyt P.
1405–1408.
Appropriate statistical methods are re- 40. Areia M, Soares M, Dinis-Ribeiro M.
15. Ioannidis JP. Why most published re- quired to assess diagnostic tests for replace- Quality reporting of endoscopic diagnostic
search findings are false. PLoS Med ment, add-on, and triage. J Clin Epidemiol studies in gastrointestinal journals: where
2005;2(8):e124. 2010;63(8):883–891. do we stand on the use of the STARD
16. Young NS, Ioannidis JP, Al-Ubaydli O. Why 29. Clopper CJ, Pearson ES. The use of confi- and CONSORT statements? Endoscopy
current publication practices may distort dence or fiducial limits illustrated in the case 2010;42(2):138–147.
science. PLoS Med 2008;5(10):e201. of the binomial. Biometrika 1934;26(4):404– 41. Selman TJ, Morris RK, Zamora J, Khan
17. Wang R, Lagakos SW, Ware JH, Hunter DJ, 413. KS. The quality of reporting of primary test
Drazen JM. Statistics in medicine—report- 30. Hintze JL. PASS 11 (Power Analysis and accuracy studies in obstetrics and gynae-
ing of subgroup analyses in clinical trials. N Sample Size). Kaysville Utah: NCSS, 2011. cology: application of the STARD criteria.
Engl J Med 2007;357(21):2189–2194. BMC Womens Health 2011;11:8.
31. Hage CA, Davis TE, Fuller D, et al. Diagno-
18. Lagakos SW. The challenge of subgroup sis of histoplasmosis by antigen detection in 42. Pitkin RM, Branagan MA, Burmeister LF.
analyses—reporting without distorting. N BAL fluid. Chest 2010;137(3):623–628. Accuracy of data in abstracts of published
Engl J Med 2006;354(16):1667–1669. research articles. JAMA 1999;281(12):1110–
32. la Fougère C, Pöpperl G, Levin J, et al.
1111.
19. Pepe MS, Feng Z, Janes H, Bossuyt PM, The value of the dopamine D2/3 receptor
Potter JD. Pivotal evaluation of the accu- ligand 18F-desmethoxyfallypride for the 43. Beller EM, Glasziou PP, Hopewell S, Altman
racy of a biomarker used for classification differentiation of idiopathic and nonidio- DG. Reporting of effect direction and size
or prediction: standards for study design. J pathic parkinsonian syndromes. J Nucl Med in abstracts of systematic reviews. JAMA
Natl Cancer Inst 2008;100(20):1432–1438. 2010;51(4):581–587. 2011;306(18):1981–1982.

588 radiology.rsna.org n Radiology: Volume 267: Number 2—May 2013

You might also like