You are on page 1of 192

Fundamentals of Clinical Research

for Radiologists

Craig A. Beam 1
C. Craig Blackmore 2
Editors’ Introduction to the Series
Steven Karlik 3
Caroline Reinhold 4 Research is the “eccentric uncle” of The Value of Research to Radiology
radiology. The specialty acknowledges Examples of the value of research to the spe-
“his” presence, brings “him” out at cialty of radiology are not hard to find. The inti-
appropriate times to be viewed and mate synergistic relationship with technology is
admired, and, when the mood strikes, obvious. Isn’t it equally apparent that research is
pays homage to “his” importance. the means by which radiologists maintain lead-
However, the specialty has always ership of technical innovation and utilization?
treated research at arm’s length, outside From a more pedestrian perspective, research
the greater, clinical concerns of orga- can be seen as a means to protect and expand
nized radiology [1]. “turf.” As an example, consider the fact that re-
search by radiologists in minimally invasive
Received July 14, 2000; accepted without revision
July 14, 2000.
Series editors: Craig A. Beam, C. Craig Blackmore,
T he preceding telling quote was ut-
tered by Charles Putman and comes
from a special article reporting the
findings of the 1991 Radiology Summit Meet-
therapies, and development of these techniques,
has allowed radiologists to assume a dominant
role in this area. However, many believe that
this area of interventional radiology is currently
Steven Karlik , and Caroline Reinhold.
ing [1]. This meeting, one in a series of annual at risk of being swallowed by the surgical spe-
This is the introduction to the series designed by the
American College of Radiology (ACR), the Canadian events sponsored by the Intersociety Commis- cialties. Active research and continued leader-
Association of Radiologists, and the American Journal of sion of the American College of Radiology ship in innovation and technology improvement
Roentgenology. The series, which will ultimately comprise (ACR), was held in Asheville, NC. For this
22 articles, is designed to progressively educate
by members of our specialty will help radiology
radiologists in the methodologies of rigorous clinical meeting, radiology leaders from the United maintain a primary role and prevent the attrition
research, from the most basic principles to a level of States and Canada were invited to discuss the is- of the many areas of radiology practice.
considerable sophistication. The articles are intended to sue of how to improve the research performed
complement interactive software that permits the user to Finally, from a loftier perspective, research is
work with what he or she has learned, which is available by radiologists. Obviously, the point of the quo- essential for practicing good medicine. We all
on the ACR Web site (www.acr.org). tation is that leaders in radiology think it is time have anecdotes about how cautious we must be
Project coordinator: Bruce J. Hillman, Chair, ACR to assign research a role greater than that of just in drawing conclusions from limited and sub-
Commission on Research and Technology Assessment. the too often ignored and impotent relative. jective experience. For example, because we
1
Department of Radiology, Medical College of Wisconsin, The group reached the consensus that re- have diagnosed a case of pericardial tamponade
8701 Watertown Plank Rd., Milwaukee, WI 53226. Address
correspondence to C. A. Beam.
search has important intrinsic values both to the from CT findings does not mean that CT is the
2
specialty and to individual radiologists and imaging modality of choice for this condition,
Department of Radiology, University of Washington,
325 Ninth Ave., Box 359728, Seattle, WA 98104. made the following recommendation [1]: or that all patients at risk for pericardial tampon-
3
Diagnostic Radiology and Nuclear Medicine, University of
ade should undergo CT. Good medicine re-
Western Ontario, London Health Sciences Center, To improve understanding of the quires decision making based on evidence, and
University Campus, 339 Windermere Rd., London, Ontario value and methods of research, all train- research is the method by which this evidence is
N6A 5A5, Canada.
ees and faculty should receive basic acquired, synthesized, and put into action.
4
Department of Radiology, Montreal General Hospital, instruction in critically reading the medi- Sometimes this pattern of research is codified
1650 Cedar Ave., Montreal P. Q. H3G 1A4, Canada.
cal literature, experimental design, and into practice guidelines and disseminated for
AJR 2001;176:323–325
biostatistics. Those wishing to conduct the benefit of other practitioners. Greater effort
0361–803X/01/1762–323 research should receive more extensive in conducting the research for developing prac-
© American Roentgen Ray Society training. tice guidelines in radiology is needed.

AJR:176, February 2001 323


Beam et al.

Goals and Overview tion available in the literature. Another goal is biostatistics and health economics, and technol-
Recently, the ACR and the Canadian Associa- to provide the reader with a sufficient base ogy assessment. One of the significant issues
tion of Radiologists (CAR) formed a joint exec- with which to conduct independent radiology they identified was the mechanism to teach the
utive panel to address the need for improved research. However, these modules are not de- evaluative sciences to radiology residents. Spe-
training in clinical research methods. The out- signed to replace formal training in epidemiol- cifically, their discussion focused on problem-
growth of the work of this panel was the com- ogy and biostatistics. It is our goal that these based or lecture-based alternatives, with an ar-
mitment of these two organizations and the modules provide enough background so that gument being made that some combination of
American Journal of Roentgenology to publish a radiologists know when to seek statistical ex- the two would likely be optimal, depending on
series of articles and to develop interactive soft- pertise and to facilitate communication with the individual program. Our ACR–CAR pro-
ware to meet this training need. It was decided experts in methodology. gram will be a resource for all radiology resi-
that the goal was to establish a program that will dency programs that could be presented by local
allow the progressive education of trainees and Discussion experts. The software will provide an interac-
junior faculty interested in clinical research so Any radiologist wanting to conduct or lead tive, problem-based adjunct to this presentation.
that they can proceed from a level of nearly total research must be knowledgeable of the research Our goal is to commission 22 modules. This
ignorance to one of methodologic sophistication tradition that has arisen in medicine over the is a prodigious list of concepts that most practic-
capable of critically understanding the literature, past 100 years and of the methodologic ad- ing radiologists in the United States and Canada
intelligently applying the results of research to vances that have been made in the past 10 years. have likely not had the opportunity to study.
clinical practice, communicating with methodol- In addition, every radiologist needs to be able to These topics remain significant by their absence
ogy experts, and directing independent research. critically appraise the medical literature— in many radiology training programs today.
To meet this goal, the coeditors of the series within and outside the specialty—to make the Without an appreciation of these issues and their
designed 22 articles with associated software best use of new information for their patients. vital role in producing research excellence, radi-
that form modules of self-instruction. Each Because much of the medical literature is the re- ology publications will continue in their time-
journal article and its associated software are in- porting of research, a fundamental knowledge honored and out-of-date series descriptions.
tended to be complementary, not repetitive, of research is essential to every practicing radi- However, the materials we are proposing in
learning experiences. The software, which will ologist. In brief, to understand the message of this series can be considered only as expert re-
be available on the ACR Web site, will help research, radiologists must understand the sources. They will give specific and comprehen-
readers better understand, evaluate, and refine methods of research. sive information at the junior level to form a
their mastery of the material, as well as allow Training residents, fellows, and junior faculty basis for the teaching of the critical inquiry to ra-
them to practice what has been learned. As out- in facets of research and critical inquiry by radi- diology residents, fellows, and junior faculty.
lined in the Appendix, it is our plan to offer an ology departments in both the United States and The materials are meant to support, not replace,
initial series of six modules at a basic level, Canada is recognized by leaders of our specialty institutional instruction in these disciplines. With
eight modules at an intermediate level, and eight to be a critical need. However, exposure to the such materials easily available throughout the ra-
advanced-level modules. discipline of research has been sporadic in dis- diology community, it will become a far easier
The editors are sensitive to the ease with tribution and nonuniform in content. The recent task to ensure exposure of radiologists and resi-
which methodology articles can become evaluation of the introduction to research pro- dents to these very important topics.
user-unfriendly when discussing statistical gram for second-year residents by the Radio- These efforts are not meant to dilute any of
aspects of research. To ensure that modules logical Society of North America, Association the essential aspects of the radiology training
are applicable to all readers, articles about of Univeristy Radiologists, and American program. On the contrary, this series will pro-
statistical methods will show relevance to Roentgen Ray Society indicated that such a pro- vide a specific, highly concentrated, and rele-
clinical radiology research by providing ex- gram encourages development of research ca- vant primer in critical inquiry. It is time that
amples of the methods from the radiology lit- reers in those individuals who are oriented to radiology incorporates these effective and sci-
erature within the past 10 years. These research independently of participating in the entific aspects into the discipline. Otherwise,
articles will accentuate concepts, definitions, training program [2]. our research efforts might come to be regarded
and rules for use. Pictures and diagrams will Although there is considerable background as the “eccentric uncle” of medicine.
be encouraged. Formulas will be discouraged information available for teaching aspects of
and, if absolutely necessary, will be limited to critical inquiry, these materials are tailored to
an appendix. the academic disciplines from which they arise References
The goal for the more advanced statistical and are sometimes too esoteric for the specific
1. Hillman BJ, Putman CE. Fostering research by
articles is to give readers a basic understanding needs of radiologists and radiology residents. radiologists: recommendations of the 1991 sum-
of the research methods available and to evalu- What is needed is a thorough introduction to the mit meeting. Radiology 1992;182:315–318
ate their appropriateness when used in the lit- topics with radiology-specific examples cast in 2. Hillman BJ, Nash KD, Witzke DB, Fajardo LL,
erature. This will allow readers to critically a professor- and student-friendly manner. Davis D. The RSNA-AUR-ARRS introduction to
review the statistical methods section of an ar- Stolberg et al. [3] have recently detailed as- research program for 2nd year radiology resi-
pects of a core curriculum in the evaluative sci- dents: effect on career choice and early academic
ticle or proposal and the resultant interpreta-
performance. Radiology 1998;209:323–326
tion of results. This is a critical skill, not only ences for diagnostic imaging. The list of
3. Stolberg HO, Norman GR, Moran LA, Gafni A.
for the researcher but also for the clinical radi- desirable areas of interest include clinical epide- A core curriculum in the evaluative sciences for
ologist, who must continuously reassess his or miology, scientific method and study design, diagnostic imaging. Can Assoc Radiol J 1998;49:
her daily practice on the basis of new informa- evaluation of diagnostic tests and screening, 295–306

324 AJR:176, February 2001


Fundamentals of Clinical Research for Radiologists

APPENDIX: Series Modules

Basic Modules Advanced Modules


• Introduction to clinical research for • Inference on means and medians
radiologists • Estimating and comparing proportions
• The research framework • Reader agreement studies
• How to develop and critique a research protocol—meeting the • Correlation and regression
“so what?” challenge • Multivariate statistical methods
• Selecting a study population • Receiver operating characteristic curve analysis
• Collecting data • Survival analysis
• Statistically engineering the study for success • Assessing the evidence: methods for combining published data

Intermediate Modules
• Critical literature review
• Screening
• Exploring, presenting, and summarizing data
• Probability and samples
• Clinical evaluation of diagnostic technology
• Observational studies
• Decision analysis and simulation modeling
• Outcomes studies

AJR:176, February 2001 325


Fundamental of Clinical Research Series
(July 2000 – February 2006)

Contents (page numbers in PDF file)

Introduction………………………………………………………………………………….3
The Challenge of Clinical Radiology Research…………………………………………6
The Research Framework………………………………………………………………..11
How to Develop and Critique a Research Protocol………………………………..…17
Data Collection in Radiology Research………………………………………...……...23
Population and Sample…………………………………………………………………...30
Statistically Engineering the Study for Success……………………………...………..38
Screening for Preclinical Disease: Test and Disease Characteristics…………..…44
Exploring and Summarizing Radiologic Data…………………………………..…….51
Visualizing Radiologic Data……………………………………………………..……...59
Introduction to Probability Theory and Sampling Distribution……………...……..72
Observational Studies in Radiology……………………………………………...……..79
Randomized Controlled Trials……………………………………………………..……85
Clinical Evaluation of Diagnostic Tests……………………………………..………...91
ROC Analysis……………………………………………………………………...……….97
Statistical Inference for Continuous Variables………………………………...…….106
Statistical Inference for Proportions…………………………………………..……...116
Reader Agreement Studies……………………………………………………...………124
Correlation and Regression…………………………………………………..………..131
Survival Analysis………………………………………..……………………………….147
Multivariate Statistical Methods………………………..……………………………..151
Decision Analysis and Simulation Modeling for Evaluating Diagnostic Tests on the
Basis of Patient Outcomes………………………………………………….…………..162
Radiology Cost and Outcomes Studies: Standard Practice and Emerging
Methods……………………………………………………………………………...…....172
Meta-Analysis of Diagnostic and Screening Test Accuracy Evaluations:
Methodologic Primer……………………………………………………………………179
Fundamentals of Clinical Research
for Radiologists

Craig A. Beam 1
C. Craig Blackmore 2
Editors’ Introduction to the Series
Steven Karlik 3
Caroline Reinhold 4 Research is the “eccentric uncle” of The Value of Research to Radiology
radiology. The specialty acknowledges Examples of the value of research to the spe-
“his” presence, brings “him” out at cialty of radiology are not hard to find. The inti-
appropriate times to be viewed and mate synergistic relationship with technology is
admired, and, when the mood strikes, obvious. Isn’t it equally apparent that research is
pays homage to “his” importance. the means by which radiologists maintain lead-
However, the specialty has always ership of technical innovation and utilization?
treated research at arm’s length, outside From a more pedestrian perspective, research
the greater, clinical concerns of orga- can be seen as a means to protect and expand
nized radiology [1]. “turf.” As an example, consider the fact that re-
search by radiologists in minimally invasive
Received July 14, 2000; accepted without revision
July 14, 2000.
Series editors: Craig A. Beam, C. Craig Blackmore,
T he preceding telling quote was ut-
tered by Charles Putman and comes
from a special article reporting the
findings of the 1991 Radiology Summit Meet-
therapies, and development of these techniques,
has allowed radiologists to assume a dominant
role in this area. However, many believe that
this area of interventional radiology is currently
Steven Karlik , and Caroline Reinhold.
ing [1]. This meeting, one in a series of annual at risk of being swallowed by the surgical spe-
This is the introduction to the series designed by the
American College of Radiology (ACR), the Canadian events sponsored by the Intersociety Commis- cialties. Active research and continued leader-
Association of Radiologists, and the American Journal of sion of the American College of Radiology ship in innovation and technology improvement
Roentgenology. The series, which will ultimately comprise (ACR), was held in Asheville, NC. For this
22 articles, is designed to progressively educate
by members of our specialty will help radiology
radiologists in the methodologies of rigorous clinical meeting, radiology leaders from the United maintain a primary role and prevent the attrition
research, from the most basic principles to a level of States and Canada were invited to discuss the is- of the many areas of radiology practice.
considerable sophistication. The articles are intended to sue of how to improve the research performed
complement interactive software that permits the user to Finally, from a loftier perspective, research is
work with what he or she has learned, which is available by radiologists. Obviously, the point of the quo- essential for practicing good medicine. We all
on the ACR Web site (www.acr.org). tation is that leaders in radiology think it is time have anecdotes about how cautious we must be
Project coordinator: Bruce J. Hillman, Chair, ACR to assign research a role greater than that of just in drawing conclusions from limited and sub-
Commission on Research and Technology Assessment. the too often ignored and impotent relative. jective experience. For example, because we
1
Department of Radiology, Medical College of Wisconsin, The group reached the consensus that re- have diagnosed a case of pericardial tamponade
8701 Watertown Plank Rd., Milwaukee, WI 53226. Address
correspondence to C. A. Beam.
search has important intrinsic values both to the from CT findings does not mean that CT is the
2
specialty and to individual radiologists and imaging modality of choice for this condition,
Department of Radiology, University of Washington,
325 Ninth Ave., Box 359728, Seattle, WA 98104. made the following recommendation [1]: or that all patients at risk for pericardial tampon-
3
Diagnostic Radiology and Nuclear Medicine, University of
ade should undergo CT. Good medicine re-
Western Ontario, London Health Sciences Center, To improve understanding of the quires decision making based on evidence, and
University Campus, 339 Windermere Rd., London, Ontario value and methods of research, all train- research is the method by which this evidence is
N6A 5A5, Canada.
ees and faculty should receive basic acquired, synthesized, and put into action.
4
Department of Radiology, Montreal General Hospital, instruction in critically reading the medi- Sometimes this pattern of research is codified
1650 Cedar Ave., Montreal P. Q. H3G 1A4, Canada.
cal literature, experimental design, and into practice guidelines and disseminated for
AJR 2001;176:323–325
biostatistics. Those wishing to conduct the benefit of other practitioners. Greater effort
0361–803X/01/1762–323 research should receive more extensive in conducting the research for developing prac-
© American Roentgen Ray Society training. tice guidelines in radiology is needed.

AJR:176, February 2001 323


Beam et al.

Goals and Overview tion available in the literature. Another goal is biostatistics and health economics, and technol-
Recently, the ACR and the Canadian Associa- to provide the reader with a sufficient base ogy assessment. One of the significant issues
tion of Radiologists (CAR) formed a joint exec- with which to conduct independent radiology they identified was the mechanism to teach the
utive panel to address the need for improved research. However, these modules are not de- evaluative sciences to radiology residents. Spe-
training in clinical research methods. The out- signed to replace formal training in epidemiol- cifically, their discussion focused on problem-
growth of the work of this panel was the com- ogy and biostatistics. It is our goal that these based or lecture-based alternatives, with an ar-
mitment of these two organizations and the modules provide enough background so that gument being made that some combination of
American Journal of Roentgenology to publish a radiologists know when to seek statistical ex- the two would likely be optimal, depending on
series of articles and to develop interactive soft- pertise and to facilitate communication with the individual program. Our ACR–CAR pro-
ware to meet this training need. It was decided experts in methodology. gram will be a resource for all radiology resi-
that the goal was to establish a program that will dency programs that could be presented by local
allow the progressive education of trainees and Discussion experts. The software will provide an interac-
junior faculty interested in clinical research so Any radiologist wanting to conduct or lead tive, problem-based adjunct to this presentation.
that they can proceed from a level of nearly total research must be knowledgeable of the research Our goal is to commission 22 modules. This
ignorance to one of methodologic sophistication tradition that has arisen in medicine over the is a prodigious list of concepts that most practic-
capable of critically understanding the literature, past 100 years and of the methodologic ad- ing radiologists in the United States and Canada
intelligently applying the results of research to vances that have been made in the past 10 years. have likely not had the opportunity to study.
clinical practice, communicating with methodol- In addition, every radiologist needs to be able to These topics remain significant by their absence
ogy experts, and directing independent research. critically appraise the medical literature— in many radiology training programs today.
To meet this goal, the coeditors of the series within and outside the specialty—to make the Without an appreciation of these issues and their
designed 22 articles with associated software best use of new information for their patients. vital role in producing research excellence, radi-
that form modules of self-instruction. Each Because much of the medical literature is the re- ology publications will continue in their time-
journal article and its associated software are in- porting of research, a fundamental knowledge honored and out-of-date series descriptions.
tended to be complementary, not repetitive, of research is essential to every practicing radi- However, the materials we are proposing in
learning experiences. The software, which will ologist. In brief, to understand the message of this series can be considered only as expert re-
be available on the ACR Web site, will help research, radiologists must understand the sources. They will give specific and comprehen-
readers better understand, evaluate, and refine methods of research. sive information at the junior level to form a
their mastery of the material, as well as allow Training residents, fellows, and junior faculty basis for the teaching of the critical inquiry to ra-
them to practice what has been learned. As out- in facets of research and critical inquiry by radi- diology residents, fellows, and junior faculty.
lined in the Appendix, it is our plan to offer an ology departments in both the United States and The materials are meant to support, not replace,
initial series of six modules at a basic level, Canada is recognized by leaders of our specialty institutional instruction in these disciplines. With
eight modules at an intermediate level, and eight to be a critical need. However, exposure to the such materials easily available throughout the ra-
advanced-level modules. discipline of research has been sporadic in dis- diology community, it will become a far easier
The editors are sensitive to the ease with tribution and nonuniform in content. The recent task to ensure exposure of radiologists and resi-
which methodology articles can become evaluation of the introduction to research pro- dents to these very important topics.
user-unfriendly when discussing statistical gram for second-year residents by the Radio- These efforts are not meant to dilute any of
aspects of research. To ensure that modules logical Society of North America, Association the essential aspects of the radiology training
are applicable to all readers, articles about of Univeristy Radiologists, and American program. On the contrary, this series will pro-
statistical methods will show relevance to Roentgen Ray Society indicated that such a pro- vide a specific, highly concentrated, and rele-
clinical radiology research by providing ex- gram encourages development of research ca- vant primer in critical inquiry. It is time that
amples of the methods from the radiology lit- reers in those individuals who are oriented to radiology incorporates these effective and sci-
erature within the past 10 years. These research independently of participating in the entific aspects into the discipline. Otherwise,
articles will accentuate concepts, definitions, training program [2]. our research efforts might come to be regarded
and rules for use. Pictures and diagrams will Although there is considerable background as the “eccentric uncle” of medicine.
be encouraged. Formulas will be discouraged information available for teaching aspects of
and, if absolutely necessary, will be limited to critical inquiry, these materials are tailored to
an appendix. the academic disciplines from which they arise References
The goal for the more advanced statistical and are sometimes too esoteric for the specific
1. Hillman BJ, Putman CE. Fostering research by
articles is to give readers a basic understanding needs of radiologists and radiology residents. radiologists: recommendations of the 1991 sum-
of the research methods available and to evalu- What is needed is a thorough introduction to the mit meeting. Radiology 1992;182:315–318
ate their appropriateness when used in the lit- topics with radiology-specific examples cast in 2. Hillman BJ, Nash KD, Witzke DB, Fajardo LL,
erature. This will allow readers to critically a professor- and student-friendly manner. Davis D. The RSNA-AUR-ARRS introduction to
review the statistical methods section of an ar- Stolberg et al. [3] have recently detailed as- research program for 2nd year radiology resi-
pects of a core curriculum in the evaluative sci- dents: effect on career choice and early academic
ticle or proposal and the resultant interpreta-
performance. Radiology 1998;209:323–326
tion of results. This is a critical skill, not only ences for diagnostic imaging. The list of
3. Stolberg HO, Norman GR, Moran LA, Gafni A.
for the researcher but also for the clinical radi- desirable areas of interest include clinical epide- A core curriculum in the evaluative sciences for
ologist, who must continuously reassess his or miology, scientific method and study design, diagnostic imaging. Can Assoc Radiol J 1998;49:
her daily practice on the basis of new informa- evaluation of diagnostic tests and screening, 295–306

324 AJR:176, February 2001


Fundamentals of Clinical Research for Radiologists

APPENDIX: Series Modules

Basic Modules Advanced Modules


• Introduction to clinical research for • Inference on means and medians
radiologists • Estimating and comparing proportions
• The research framework • Reader agreement studies
• How to develop and critique a research protocol—meeting the • Correlation and regression
“so what?” challenge • Multivariate statistical methods
• Selecting a study population • Receiver operating characteristic curve analysis
• Collecting data • Survival analysis
• Statistically engineering the study for success • Assessing the evidence: methods for combining published data

Intermediate Modules
• Critical literature review
• Screening
• Exploring, presenting, and summarizing data
• Probability and samples
• Clinical evaluation of diagnostic technology
• Observational studies
• Decision analysis and simulation modeling
• Outcomes studies

AJR:176, February 2001 325


Fundamentals of Clinical Research
for Radiologists

C. Craig Blackmore1
The Challenge of Clinical
Radiology Research

T he development of new technol-


ogy traditionally has been the
lifeblood of radiology. Many of
the spectacular advances in medicine over
termediate and more advanced modules. The
objective is to provide a pathway for the nov-
ice researcher to learn to critically appraise the
literature and to conduct evidence-based radi-
the past few decades have centered around ology, to communicate effectively with meth-
radiology. One does not have to go far into odology experts, and finally, to perform or
the past to predate the development of CT, direct independent, scientifically valid, and
MR imaging, and sonography, technologies clinically useful research.
that now are omnipresent, critical components The concepts introduced in this first article
of medical care. Yet for all the advances in the will be by design simplistic. The intent of this
development of imaging technology, radiology first module is to introduce the scope of the ma-
research has come under deserved criticism in terial that is to be covered in much greater detail
its efforts to assess the effectiveness and appro- in the sessions to come. Many of the major con-
priate use of such imaging technology [1–5]. cepts of rigorous technology assessment will be
Production of a technologically adequate im- introduced, with detailed discussions to follow
Received July 14, 2000; accepted without revision age is a starting point, but it is only the first in future modules. This introduction describes
July 14, 2000.
step in determining whether such a technology the problems of research in radiology and at-
C. C. Blackmore received salary support as a General
Electric–Association of University Radiologists Academic
should be used in clinical care. To be useful, tempts to provide the radiology investigator
Fellow. an imaging study must also be accurate and with an understanding of some of the potential
Series editors: Craig A. Beam, C. Craig Blackmore, provide information that has the potential to pitfalls to be avoided.
Steven Karlik, and Caroline Reinhold. change the medical care, and ultimately the
This is the first in a series designed by the American health, of the patient [6, 7].
College of Radiology (ACR), the Canadian Association of This article is the first of an ongoing series Evidence-Based Radiology
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is
that, taken together, will form a comprehensive Every day in the clinical practice of radi-
designed to progressively educate radiologists in the teaching primer on basic and advanced con- ology, we make observations and adjust our
methodologies of rigorous clinical research, from the most cepts in technology assessment and outcomes practice accordingly. Many of the great ad-
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive research as described in the introductory arti- vances in science have arisen from just such
software that permits the user to work with what he or she cle in this month’s issue of the American Jour- observations. The fortuitous observation that
has learned, which is available on the ACR Web site nal of Roentgenology (AJR) [8]. This series of bacteria colonies did not grow around bread
(www.acr.org).
articles published in the AJR will form one mold led Alexander Fleming to discover pen-
Project coordinator: Bruce J. Hillman, Chair, ACR
Commission on Research and Technology Assessment.
component of the research course cosponsored icillin. In radiology, we constantly observe
1
by the American College of Radiology and the imaging appearances of diseases and
Department of Radiology, University of Washington,
Harborview Medical Center, 325 Ninth Ave., Box 359728, Canadian Association of Radiologists on the healthy states and subtly adjust our thresh-
Seattle, WA 98104. Address correspondence to fundamentals of clinical research for radiolo- olds for interpretation. However, at the same
C. A. Beam, Department of Radiology, Medical College of gists. Tightly linked with these articles will be time, this simple anecdote and experience is,
Wisconsin, 8701 Watertown Plank Rd., Milwaukee, WI 53226.
Web-based interactive teaching modules. The by definition, limited to what we personally
AJR 2001;176:327–331
intent of this integrated series is to be progres- have seen and is most strongly influenced by
0361–803X/01/1762–327 sive, starting with basic introductory concepts what we have seen recently. We have all ob-
© American Roentgen Ray Society and gradually adding complexity through in- served the phenomenon that after a patient

AJR:176, February 2001 327


Blackmore

presents with a rare and difficult-to-diagnose a four-grade scale (Appendix). At the top level ses [3, 20]. Today, more sophisticated and
disease, the next group of patients that ap- (grade A) are methodologically rigorous studies dependable research methods have been ap-
pear at all similar will be examined for that with broad generalizability, including large ran- plied to MR imaging and assessment of effi-
same malady. Our belief that a disease is rare domized clinical trials and prospective compar- cacy with this modality for a number of
is shaken by the fact that we have seen it, and isons of diagnostic test results to an appropriate indications. However, most of the research
have seen it recently. The same is true for the gold standard. At the bottom level of this hierar- literature on the use of radiology techniques
use of diagnostic technology. For example, chy are grade D studies, which include multiple remains descriptive, with little published
because we have diagnosed a case of testicu- methodologic flaws, biases in study design, or work on the influence of radiology on patient
lar seminoma from CT findings does not unsubstantiated opinion [2, 15]. Most of the ra- treatment or outcome [4]. One of the reasons
mean that CT is the imaging modality of diology literature relates to development of new for these deficiencies is the lack of research
choice for this condition, or that all patients techniques and descriptive work. Actual assess- training of the individual radiology investi-
at risk for testicular seminoma should un- ment of these new technologies and determina- gators. Unfortunately, training in research
dergo CT. tion of any impact on patient outcome is relatively methodology has been underemphasized in
To supersede this practice based on anec- uncommon [4]. Few grade A or B studies exist. radiology residency training in the United
dote, the field of evidence-based medicine New radiology technologies have been rapidly States [21]. Many radiologists, although
has evolved and has become the standard for developed and disseminated, often without ade- highly skilled clinicians, have only a rudi-
medical practice [9, 10]. Although less estab- quate proof of efficacy [1, 16, 17]. Although radi- mentary background in research methodol-
lished in radiology than in other areas of ologists may not have paid great attention to the ogy and lack many of the basic tools
medicine, this evidence-based paradigm is shortcomings in their research efforts, these limi- required to perform a critical review of the
no less relevant for radiology [11]. The con- tations may have been more apparent to the re- medical literature. The objective of this dis-
struct underlying evidence-based medicine is mainder of the medical community. cussion is to introduce some major concepts
that one individual’s experience is limited. Early studies of MR imaging represent an in research design and in critical literature re-
Decisions should be based on the best evi- illustrative example of how radiology re- view. More detailed discussion will be in-
dence from the medical literature rather than search has come under external criticism, cluded in subsequent modules.
one’s own limited experience [9, 11]. As a particularly for methodologic deficiencies.
corollary, as physicians we tend to cling to Developed in the 1970s and early 1980s, MR
Anatomy of a Research Project
what we were taught in residency or fellow- imaging was initially greeted with a variety
ship, often by acknowledged experts in the of investigations and reports in the radiology It is useful to review the anatomy of a re-
field. However, the evidence-based paradigm literature in particular, describing the excit- search project. This standard framework is the
suggests that the experts are also individuals, ing potential of this new modality. However, foundation of the scientific literature. In brief, a
and we should trust their anecdotal experi- most of this early research was merely de- research question is formulated, methods are
ence only somewhat more than we trust our scriptive. Those studies that attempted to as- derived to answer the question, data are col-
own. Instead, practice should be guided by sess even accuracy were limited in size and lected and analyzed, and conclusions are drawn.
rigorous scientific investigation [9, 11, 12]. generally suffered from important design Within this framework are several key concepts
The major source for the evidence on which flaws [2, 16, 18, 19]. A 1988 article by Coo- that are discussed in the following text, includ-
to base practice is the medical literature. With per et al. [16] noted that none of the initial 54 ing formulation of the research question, use of
the rapid proliferation in radiology technology research reports on the efficacy of MR imag- efficient study design, avoidance of error and
has come a parallel increase in the volume of ing met accepted contemporary standards for bias, and appropriate data analysis.
the radiology literature. There are now more research design. The article concluded that
than 40 radiology journals and more than 4000 “health care professionals paying for expen- The Research Question
articles published each year [13]. However, the sive innovative technology should demand The first step in any research endeavor is to
published literature has its own perils and better research on diagnostic efficacy.” In frame an appropriate research question. This
should be interpreted with a critical eye. First, 1994, Kent et al. [2] found that of 142 studies question must be important (or it is not worth
case reports, even if published, are essentially of MR neuroimaging published through 1993, our efforts), but it also must be precise [22, 23].
anecdotes that are codified in print. Although only one provided grade A information, 28 As an example, we can start with a common
they are often interesting, may be provocative, provided grade B or C, and most (113) pro- and vexing clinical problem that has been the
and can invoke questions for scientific study, vided only grade D information. Kent et al. cause of considerable interest in the radiology
they should not form the basis for practice. Sec- concluded that despite the fact that more than literature, “Which test is better in patients with
ond, and more insidious, are published reports 2000 MR imaging scanners had been installed, possible appendicitis, CT or sonography?” This
that, although well intended, contain biases or the evidence supporting the use of MR imag- question is certainly important and clinically
flaws in the methodology that attenuate the ap- ing in clinical practice was weak. relevant, but as framed above it cannot be an-
plicability of the results into practice. A central The credibility of the radiology research swered. The question must be defined more pre-
tenet of evidence-based medicine is that the lit- community was shaken by these criticisms, cisely with respect to the type of patients in
erature must be analyzed critically, and only with some nonradiologists questioning whom the question is being raised, the target
those studies that are robust should be used as whether conflicts of interest would influence population, and what is actually being asked.
the basis for practice [11, 14]. A useful frame- radiologists and organized radiology [17]. The imaging accuracy and usefulness of sonog-
work for evaluating the value of a literature arti- Similar methodologic deficiencies have also raphy and CT will likely vary on the basis of a
cle is promoted by Kent et al. [2], who propose been reported for radiology economic analy- number of patient-specific variables. Are the pa-

328 AJR:176, February 2001


Fundamentals of Clinical Research for Radiologists

tients we are interested in adults or children? modeling studies can also provide useful in- pensate for systematic error. Avoidance of
Are they thin or fat? Are they cooperative or un- formation [4, 26]. These study designs will such systematic error, or bias, is one of the ma-
cooperative? Are they men or women? Disease- be discussed in detail in future modules. jor challenges of research design. Unfortu-
specific factors may also affect the imaging. nately, many of the apparently simple research
Has the patient been symptomatic for a few Error designs that are common in the radiology liter-
hours and we suspect simple unperforated ap- The research design is intended to arrive at the ature succumb to bias. As an example, one
pendicitis, or has the patient been symptomatic truth for the question under study. One of the could imagine a study designed to compare
for 4 days and appears septic, leading us to sus- major driving factors of research design is the ef- CT and MR imaging for detection of liver me-
pect an abscess? These factors also might affect fort to avoid or control error. Error can be di- tastases in patients with known adenocarci-
the performance of sonography and CT. vided into two general categories: random error, noma of another organ. To identify patients for
Finally, how we are using the findings of an and systematic error, also known as bias. Ran- such a study, one might review all the patients
imaging study might affect the determination dom error, as the name implies, is due to chance who underwent both tests, and using some ex-
of optimal imaging modality. Are we using im- events that have the potential to lead to false con- ternal gold standard, make a comparison.
aging to confirm appendicitis en route to the clusions. The field of statistics has evolved in However, would this study design be free of
operating room, or are we using imaging to large part to deal with the random and therefore bias? Likely, there would be significant bias in
look for other abnormalities that might mimic unpredictable error that can occur in any study the selection of the subjects. For example, if at
appendicitis, such as ureteral calculi, diverticu- design. Statistics is a methodology for drawing a given center CT is generally used as the ini-
litis, or even abdominal aortic aneurysm? A inference about populations from data collected tial imaging modality for the evaluation of
better defined research question might be, “In on samples [27]. In medicine, we generally ac- possible liver metastases, then the patients who
nonpregnant women younger than 40 years cept events as being true (not related to random undergo both imaging studies would be the
with symptoms suggestive of appendicitis but chance) if the probability of their random occur- ones in whom the initial CT was equivocal.
no peritoneal signs, what is the preferred imag- rence is less than 5%, expressed as the common The comparison would not be CT versus MR
ing modality to exclude the presence of an ab- statistical p value of 0.05. Of course, unlikely imaging, but rather, CT versus MR imaging in
dominal condition that might require surgical events do occur. Type I (also known as alpha) er- patients in whom the CT was equivocal. Of
intervention?” This reformulated research ror occurs when we conclude that a difference course, the results of such a study would un-
question is perhaps less “sexy” than “Which exists when in fact two groups are the same. At a derestimate the accuracy of CT, because only
test is better?” but it is also much more useful. significance threshold of p less than 0.05, we those cases that are difficult to diagnose with
The reformulated question is no longer an is- will make such type I errors in 5% of compari- CT were included. This is a simple but unfor-
sue of comparing radiology tests. Instead, we sons. However, if a study involves multiple com- tunately common example of selection bias in
are asking a clinical question about a specific parisons (i.e., comparing six different MR recruiting patients for a study. Selection bias
group of patients that can potentially affect the imaging pulse sequences), then the probability occurs when the subjects studied are not repre-
health of those patients [22–25]. Some experi- of a type I error also increases [28]. sentative of the target population. In the previ-
enced researchers believe that formulating and The opposite of type I error, known as type ous example, the target population is all
framing the research question is the most chal- II error, is when we conclude that two popula- patients with known adenocarcinoma of an-
lenging aspect of doing research [22]. tions are the same when in fact they are not. other organ. However, the study group is only
Unfortunately, the commonly reported p value those patients with known adenocarcinoma
Study Design gives no information about the potential for who underwent both CT and MR imaging. To
Having determined the question to be an- this type II, or beta, error. There is a common avoid this bias, subject selection should be
swered, the next issue is the research meth- misconception that a p value greater than 0.05 based on clinical criteria (i.e., all subjects with
odology itself. To produce evidence that will indicates that two groups are the same. How- a new diagnosis of adenocarcinoma) rather
appropriately drive decision making, experi- ever, this is only true if the study sample has than availability of imaging studies [14, 22].
mental design is of critical importance and sufficient size to have the power to detect a dif- When using a test to screen a population,
will be the focus of much of this article se- ference if it is present [27]. Sufficient sample selection bias can be more subtle but equally
ries. The goal of study design is to achieve size is determined by the size of difference we problematic. Intuitively, one would expect that
the most with the least (i.e., to achieve effi- are interested in detecting, usually the amount if a cohort of subjects is randomly selected to
ciency). Fortunately, we have the experience of difference that would be clinically signifi- undergo a radiologic screening test, we could
of clinical epidemiologists and biostatisti- cant, and by the desired power of the study compare the subjects who actually undergo
cians with decades of experience from which [27, 29]. Power is the chance the study will re- screening with those who elect not to undergo
to draw to determine the most efficient way veal the clinically significant difference when screening and make reasonable conclusions.
of designing studies and the most appropri- it exists and equals one minus the type II error However, convincing evidence from previous
ate way to productively critique research. probability. As an example, a study might re- screening studies indicates that differences ex-
Prospective comparisons of diagnostic test port 90% power to detect a difference of 5%. ist between subjects who elect screening and
results with a well-defined reference test and those who refuse. Subjects who elect to un-
randomized double-blinded clinical trials are Bias dergo screening may be more health con-
the study designs that provide the best infor- The opposite of random error is systematic scious, or more optimistic, or there may be
mation to guide clinical practice [2, 26]. error that is introduced through inadequacy in some other factor that is not understood [4,
However, other study designs, including co- the study design, subject selection, or analysis. 30]. Thus, in a research study designed to in-
hort and case-control investigations and Statistics are for the most part unable to com- vestigate patient outcome for a new screening

AJR:176, February 2001 329


Blackmore

study, comparison of those who undergo the observers and would therefore potentially study is reproducible [1, 24]. The opposite of
screening with those who elect against screen- introduce greater bias. reliability is variability. Interpretation of
ing could show improved outcomes in the The effect of these various biases has been some diagnostic tests can be quite subjective.
screened group even if the test has no benefit, documented. In general, studies with bias tend If different observers cannot agree on the test
or is even harmful. Therefore, to investigate the to report more encouraging results than those result on the same subject, then interobserver
effectiveness of a screening study, it is essential without bias [31]. In addition, preliminary variability is high. Similarly, if the same ob-
to compare patients who are randomized to be studies of a diagnostic technology, performed server determines the results of the same test
invited for screening to those who are random- with small sample size and vulnerable to bias, to be different at different times, then in-
ized not to be invited. In the analysis, all sub- often will be highly optimistic about the capa- traobserver variability is high. If a test has
jects are included, regardless of whether they bilities of that technology. Subsequent reports low reliability, then the test cannot achieve
actually undergo the screening study. This is may present a more realistic appraisal [32]. high accuracy in general practice [1].
known as an intention-to-treat analysis and
avoids the subtle bias I have described [4]. Data Analysis Conclusion
Other bias can develop from the way in Research is conducted on samples. We Performing methodologically rigorous
which data are collected. All humans have measure outcome or accuracy on a relatively scientific research is not a trivial task. The
preconceived notions, both conscious and small number of subjects. Yet the intent of optimal research study will be directed at an
unconscious. These preconceptions alter the research is (eventually) to influence clinical important, precisely defined clinical ques-
way in which we observe our surroundings care. To achieve this, the research results tion, with a specified target population
and can unintentionally affect data that we must be valid on subjects other than those in- matched by the subject selection. The most
collect, which is referred to as review bias. cluded in the study. Statistics is the science efficient study design will be used and the
To remove any review bias, it is necessary that allows us to make inferences about pop- sample size will be sufficient to limit type II
to ensure that the individual who collects ulations from measurements made on sam- error to an acceptable level. Further, bias will
the data is unaware of the outcome under ples. A vast array of tools is available to the be avoided, and the results will be reliable,
study. For example, the individual who de- biostatistician to enable such inference. internally valid, and generalizable to the tar-
termines if a test is positive should not These tools must be familiar to the research get population and possibly beyond. Success
know whether the subject truly has the dis- radiologist and will be discussed in future at such demanding research endeavors is cer-
ease in question. Also, when comparing two modules. In this discussion I will limit my- tainly within the reach of radiologists and ra-
tests, the results of the first test should not self to introduction of the concepts of valid- diology researchers. However, training—the
be known before interpretation of the sec- ity and reliability. goal of this series of articles—is necessary.
ond. A recent analysis of research on diag- Validity can be divided into internal valid- In this article, I have attempted to intro-
nostic tests performed by Reid et al. [1] ity and external validity, which is also known duce the problem—the need for improved re-
included some radiology studies that re- as generalizability. Internal validity refers to search methodology in radiology research. I
ported that 62% of research studies did not the extent to which the results and conclu- have also begun to outline the solution
document that appropriate steps had been sions of a study actually relate to true events through briefly introducing the concept of
taken to avoid such review bias. in the sample under study. Some of the bi- evidence-based radiology and discussing the
Similarly, if different gold standards are ases and study design considerations de- basics of research methodology: posing the
used for patients with disease than for those scribed previously relate to validity. For research question, and study design, error,
without, then results of accuracy studies may example, an observer who is aware of the re- bias, and data analysis. I am certain that this
be overestimated. Lijmer et al. [31] found that sults of the reference test might unintention- discussion has been too basic for some and
the reported accuracy of diagnostic studies was ally overestimate the accuracy of the too sophisticated for others. However, in the
significantly greater if different verification diagnostic test under study. Thus, the re- modules that follow, increasing depth, clar-
standards were applied to patients with and corded results might not be an internally ity, and detail will be added to the rough out-
without disease than if the same gold standard valid representation of the actual sample. line that has been described in this article. By
was applied to all. The term “verification bias” The method of data analysis and the statisti- the conclusion of this project, the radiology
has been applied to this problem [31, 32]. cal tests used are also critical to the internal investigator will have a comprehensive re-
Additional potential biases in diagnostic validity of the study, because use of inappro- source to aid the transition from relative nov-
test evaluation include spectrum bias, in priate analysis can lead to false conclusions. ice to skilled researcher.
which only patients with overt disease are Similarly, the external validity of a study is
used in assessment of a diagnostic test. Not dependent on both the research design and the References
including subtle or indeterminate cases can analytic methods. The extent to which the 1. Reid MC, Lachs MS, Feinstein AR. Use of meth-
also lead to overestimation of disease accu- sample selected truly reflects the target popu- odological standards in diagnostic test research:
racy [31, 32]. Prospective data collection is lation is a strong determinate of the generaliz- getting better but still not good. JAMA 1995;274:
generally less subject to bias than retrospec- ability of a study [22]. Also, the use of 645–651
tive collection and is therefore preferred appropriate statistics allows determination of 2. Kent DL, Haynor DR, Longstreth WT Jr, Larson EB.
The clinical efficacy of magnetic resonance imaging
when designing a study. However, retrospec- what inferences can be drawn about the target
in neuroimaging. Ann Intern Med 1994;120:856–871
tive data collection may be preferred in a few population on the basis of the sample data. 3. Blackmore CC, Magid DJ. Methodologic evalua-
circumstances, such as when prospective data A final consideration is study reliability. tion of the radiology cost-effectiveness literature.
collection would remove the ability to blind Reliability refers to the extent to which the Radiology 1997;203:87–91

330 AJR:176, February 2001


Fundamentals of Clinical Research for Radiologists

4. Blackmore CC, Black WB, Jarvik JG, Langlotz 14. Jaeschke R, Guyatt G, Sackett DL. Users’ guides 22. Eng J, Siegelman SS. Improving radiology re-
CP. A critical synopsis of the diagnostic and to the medical literature. III. How to use an article search methods: what is being asked and who is
screening radiology outcomes literature. Acad about a diagnostic test. A. Are the results of the being studied? Radiology 1997;205:651–655
Radiol 1999;6[supp 1]:S8–S18 study valid? JAMA 1994;271:389–391 23. Hulley SB, Cummings SR. Designing clinical re-
5. Hillman BJ. Outcomes research and cost-effec- 15. Kent DL, Larson EB. Disease, level of impact, and search. Baltimore: Williams & Wilkins, 1988:12–18
tiveness analysis for diagnostic imaging. Radiol- quality of research methods: three dimensions of 24. Jaeschke R, Guyatt GH, Sackett DL. Users’
ogy 1994;193:307–310 clinical efficacy assessment applied to magnetic guides to the medical literature. III. How to use
6. Fryback DG, Thornbury JR. The efficacy of diag- resonance imaging. Invest Radiol 1992;27:245–254 an article about a diagnostic test. B. What are the
nostic imaging. Med Decis Making 1991;11:88–94 16. Cooper LS, Chalmers TC, McCally M, Berrier J, results and will they help me in caring for my pa-
7. Thornbury JR. Clinical efficacy of diagnostic im- Sacks HS. The poor quality of early evaluations tients? JAMA 1994;271:703–707
aging: love it or leave it. (Eugene W. Caldwell of magnetic resonance imaging. JAMA 1988; 25. Black WC. How to evaluate the radiology litera-
lecture) AJR 1994;162:1–8 259:3277–3280 ture. AJR 1990;154:17–22
8. Beam CA, Blackmore CC, Karlik S, Reinhold, C. 17. Kent DL, Larson EB. Diagnostic technology as- 26. Sackett DL, Haynes RB, Guyatt GH, Tugwell P.
Fundamentals of clinical research for radiolo- sessments: problems and prospects. Ann Intern Clinical epidemiology: a basic science for clini-
gists: editors’ introduction to the series. AJR Med 1988;108:759–761 cal medicine. Boston: Little, Brown, 1991:51–68
2001;176:323–325 18. Beam CA, Sostman HD, Zheng J. Status of clini- 27. Altman DG. Practical statistics for medical re-
9. Sackett DL, Richardson WS, Rosenberg W, cal MR evaluations 1985-1988: baseline and de- search. London: Chapman and Hall, 1991
Haynes RB. Evidence-based medicine. New sign for future assessments. Radiology 1991; 28. Fleiss JL. Statistical methods for rates and pro-
York: Churchill Livingstone, 1997:2–3 180:265–270 portions, 2nd ed. New York: Wiley, 1981:121
10. Evidence-Based Medicine Working Group. Evi- 19. Kent DL, Larson EB. Magnetic resonance imaging of 29. Obuchowski NA. Testing for equivalence of diag-
dence-based medicine: a new approach to teach- the brain and spine: is clinical efficacy established af- nostic tests. AJR 1997;168:13–17
ing the practice of medicine. JAMA 1992; 268: ter the first decade? Ann Intern Med 1988;108:402– 30. Black WC, Welch HG. Screening for disease.
2420–2425 424 [Erratum in Ann Intern Med 1988;109:438] AJR 1997;168:3–11
11. Wood BP. What's the evidence? Radiology 20. Blackmore CC, Smith WJ. Economic analyses of ra- 31. Lijmer JG, Mol BW, Heisterkamp S, et al. Empir-
1999;213:635–637 diological procedures: a methodological evaluation of ical evidence of design-related bias in studies of
12. Eisenberg JM. Ten lessons for evidence-based tech- the medical literature. Eur J Radiol 1998;27:123–130 diagnostic tests. JAMA 1999;282:1061–1066
nology assessment. JAMA 1999;282:1865–1869 21. Hillman BJ, Putman CE. Fostering research by 32. Ransohoff DF, Feinstein AR. Problems of spec-
13. Index to imaging literature. Radiology 1999;210 radiologists: recommendations of the 1991 sum- trum and bias in evaluating the efficacy of diag-
[suppl]:iv–v mit meeting. Radiology 1992;182:315–318 nostic tests. N Engl J Med 1978;299:926–930

APPENDIX: Quality of Research Methods


Grade A: Studies with broad generalizability
• No significant flaws
• Prospective comparison of a diagnostic test with a well-defined diagnosis
• Large randomized, blinded clinical trial assessing therapeutic efficacy or patient outcome

Grade B: Studies with narrower spectrum of generalizability


• Few well-described flaws with definable impact on the results
• Prospective study of diagnostic tests
• Randomized trial of therapeutic effects and patient outcomes

Grade C: Studies with limited generalizability


• Multiple flaws in research methods, small sample size, incomplete reporting
• Retrospective studies of diagnostic accuracy

Grade D: Studies with multiple flaws in research methods


• Obvious selection bias
• Opinions without substantiating data

(Modified from Kent et al. [2, 15])

AJR:176, February 2001 331


Fundamentals of Clinical Research
for Radiologists
Jeffrey G. Jarvik 1
The Research Framework

I n recent years, the evaluation of di-


agnostic technologies has become
more demanding. It is no longer
sufficient to show that a new diagnostic tech-
introduction to some of the issues involved in
diagnostic screening.

Levels of Diagnostic Efficacy


nology can better depict anatomy or function. The six-tiered model of Fryback and
From the perspective of either a single hospital Thornbury [2] is based on efficacy, which
or society as a whole, the purchase of new has been defined as the benefit from technol-
technology, such as an upgrade for an MR ogy applied under ideal circumstances [4].
scanner, competes directly with resources that This is in distinction to effectiveness, which
could be spent on other aspects of health care, refers to the use of a technology in everyday,
such as childhood immunizations. A key ques- usual circumstances. Efficacy must be shown
tion in an environment of scarce resources is before effectiveness, because a test that can-
always, “What is the most cost-effective ex- not perform well under ideal circumstances
penditure of our dollars?” or put another way, has no chance of succeeding under less-than-
“Where can we get the biggest bang for our ideal conditions.
Received September 21, 2000; accepted after revision buck?” The most comprehensive evaluations Once the decision has been made to concen-
October 5, 2000. try to answer this question. trate on efficacy, the next question is on which
Supported in part by grant HS-094990 from the Agency for In 1977, Fineberg [1] described a hierarchal aspect of efficacy to focus. Guyatt et al. [5]
Healthcare Research and Quality and by a Veterans scheme for evaluating diagnostic tests that made the observation that “…we must go be-
Administration ERIC (Epidemiology Research and
Information Center) grant.
consisted of four levels of efficacy. Fryback yond accuracy and try to determine if our pa-
and Thornbury [2] and Thornbury [3] later re- tients are better off as a result of new
Series editors: Craig A. Beam, C. Craig Blackmore,
Steven Karlik, and Caroline Reinhold. vised this scheme into a model consisting of technologies.” However, the link between pa-
six tiers of diagnostic efficacy (Table 1). tient outcomes and a diagnostic test is frequently
This is the second in the series designed by the American
College of Radiology (ACR), the Canadian Association of In addition to a hierarchy for what to eval- tenuous. One may fail to observe a beneficial ef-
Radiologists, and the American Journal of Roentgenology. uate, there is also a hierarchy for how to fect on patient outcome because a test is truly
The series, which will ultimately comprise 22 articles, is
designed to progressively educate radiologists in the
evaluate it. The randomized clinical trial is worthless, meaning that it is not accurate. How-
methodologies of rigorous clinical research, from the most the “gold standard” in the realm of clinical ever, there are other possibilities. The informa-
basic principles to a level of considerable sophistication. trials, although few have actually been per-
The articles are intended to complement interactive
tion from an accurate test may be used
software that permits the user to work with what he or she formed for diagnostic tests. This is in part incorrectly by the clinician. Or there may be no
has learned, which is available on the ACR Web site because of the expense and difficulty con- effective therapy. Or the patient does not comply
(www.acr.org). ducting randomized clinical trials. Although with effective therapy. Or the patient may not
Project coordinator: Bruce J. Hillman, Chair, ACR the randomized clinical trial is the best scien- have adequate access to effective therapy. The
Commission on Research and Technology Assessment.
tific method to combat bias, other strategies six-tiered model disaggregates the overall effect
1
Departments of Radiology, Neurosurgery and Health exist for evaluating diagnostic tests. These
Services, and the Center for Cost and Outcomes Research,
of a diagnostic test in an attempt to discern and
University of Washington, Seattle, WA. Address strategies include case series, case-control account for these various possibilities.
correspondence to J. G. Jarvik, Department of Radiology, studies, cohort studies, and modeling.
University of Washington, Box 357115, 1959 N.E. Pacific St., In this article, I review the hierarchal Technical Efficacy
Seattle, WA 98195.
scheme for assessing the efficacy of diagnos- Technical efficacy refers to the ability to
AJR 2001;176:873–877
tic technologies and the various study de- produce an image and is generally measured
0361–803X/01/1764–873 signs that can be used to evaluate the through the physical characteristics of the
© American Roentgen Ray Society different levels of efficacy. I end with a brief image (e.g., signal-to-noise ratio, resolution).

AJR:176, April 2001 873


Jarvik

without the disease has positive test findings,


TABLE 1 Six-Tiered Model of Diagnostic Efficacy
and false-negative (FN) results occur when a pa-
Stage of Efficacy Definition tient with the disease has negative test findings.
Technical capacity Resolution, sharpness, reliability The sensitivity of a diagnostic test is de-
Diagnostic accuracy Sensitivity, specificity, predictive values, ROC curves fined as the number of true-positive cases di-
vided by all cases with the disease (TP / TP +
Diagnostic impact Ability of a diagnostic test to affect the diagnostic workup
FN) (Fig. 1). Specificity is the number of
Therapeutic impact Ability of a diagnostic test to affect therapeutic choices
true-negative cases divided by all cases with-
Patient outcomes Ability of a diagnostic test to increase the length or quality of life
out the disease (TN / TN + FP) (Fig. 2). Sen-
Societal outcomes Cost-effectiveness and cost-utility sitivity and specificity are related to the
Note.—Data adapted from [3]. ROC = receiver operating characteristic. columns of the two-by-two table and are sta-
ble characteristics of a diagnostic test. This
This phase of investigation should be explor- pretation of a test is diagnostic accuracy. Diag- means that they do not change with varying
atory to determine the possible uses for a di- nostic tests are ideally compared with a gold disease prevalence. Positive predictive value
agnostic test. One should explore a wide standard to determine accuracy. The two-by-two refers to the number of patients with the dis-
range of conditions and patients. At this table is the standard way to display the compari- ease with a positive test divided by all those
stage, blinded interpretations should be son of a new diagnostic test— usually called the with a positive test (TP / TP + FP) (Fig. 3).
avoided to allow the discovery of unexpected index test—with that of a gold standard test, Negative predictive value is the number of
correlations and to refine interpretations. The called the reference test (Table 2). The results of patients without the disease with a negative
danger of being too stringent at this stage of the reference test determine the presence or ab- test divided by all those with negative find-
evaluation is that the development of promis- sence of disease. The parameters of sensitivity, ings (TN / TN + FN) (Fig. 4).
ing technologies might actually be delayed if specificity, positive predictive value, and nega- Predictive values are in one respect more
a rigorous but inappropriately early evalua- tive predictive value can all be derived from a clinically relevant than sensitivity and speci-
tion is negative. This phase can also be two-by-two table. The cells of the two-by-two ficity because they answer the question, “If a
thought of as the laboratory phase of investi- table define four possible test results: true-posi- test is positive or negative, what is the likeli-
gation, at which time technical parameters tives, false-positives, false-negatives, and true- hood of a patient having the disease?” In
are optimized for clinical use. negatives. A case is a true-positive (TP) result contrast, sensitivity and specificity address
when the diagnostic test is positive and the sub- the question, “Given that the patient does or
Diagnostic Accuracy Efficacy ject has the disease. Similarly, a true-negative doesn’t have the disease, what is the proba-
To be useful, not only must an image be pro- (TN) result is when the diagnostic test is nega- bility that the test will be positive or nega-
duced, it also must be interpreted. The ability to tive and the subject does not have the disease. tive?” One important characteristic of
differentiate normal from abnormal in the inter- False-positive (FP) results occur when a patient predictive values is that, unlike sensitivity
and specificity, they vary with disease preva-
lence. Tables 3 and 4 illustrate this point. Ta-
Typical Two-by-Two Table Comparing a New Test (Index Test) with a
TABLE 2 ble 3 is a two-by-two table for a diagnostic
Reference Test
test with 90% sensitivity and specificity that
Reference Text is applied to a population with a high (50%)
Index Test Row Total
Positive Negative prevalence of disease. In this setting, the pre-
dictive values are also quite high (90%).
Positive A (True-positive) B (False-positive) A+B
However, take the same diagnostic test and
Negative C (False-negative) D (True-negative) C+D
apply it to a population with a much lower
Column total A+C B+D A+B+C+D disease prevalence (1%), and the positive
predictive value decreases precipitously.
The two-by-two table assumes that a test
result is dichotomous (either positive or neg-
ative). However, there are frequently many
cut points to define a positive or negative
test. This situation can be summarized using
a receiver operator characteristic (ROC)
curve. The ROC curve is a plot of sensitivity
versus 1–specificity for a family of cut points
that define positive and negative for a test.
Fig. 1.—Diagram shows that test sensitivity focuses Fig. 2.—Diagram shows how test specificity focuses on
For example, a degenerated disk loses signal
on first column of two-by-two table. Sensitivity equals second column of two-by-two table. Test specificity on T2-weighted MR images. One can create
A / (A + C), or number of patients with true-positive equals D / (B + D), or number of patients with true-negative a scale of 1–5 to describe this signal loss,
(TP) findings divided by all patients with positive refer- (TN) findings divided by all patients with negative findings with 1 being no signal loss and 5 being com-
ence test findings. + = positive test result, – = negative on reference test. + = positive test result, – = negative test
test result, FP = false-positive, FN = false-negative, TN = result, TP = true-positive, FP = false-positive, FN = false- plete signal loss. Now assume that we have a
true-negative. negative. direct line to a divine, omniscient being who

874 AJR:176, April 2001


Fundamentals of Clinical Research for Radiologists

Fig. 3.—Predictive values are calculated from table rows rather than table columns. Fig. 4.—Negative predictive value is calculated from second row of table and
Positive predictive value equals A / (A + B), or number of true-positive (TP) findings di- equals D / (C + D), or number of true-negative (TN) findings divided by number of all
vided by number of all patients with positive findings on index test. + = positive test re- patients with negative index test results. + = positive test result, – = negative test
sult, – = negative test result, FP = false-positive, FN = false-negative, TN = true-negative. result, TP = true-positive, FP = false-positive, FN = false-negative.

cians to make reliable and valid estimates of


TABLE 3 Disease Prevalence 50%
disease probabilities, something in which
Reference Text few physicians have training.
Index Test Row Total
Positive Negative
Therapeutic Impact Efficacy
Positive 90 10 100 Just as diagnostic impact assesses the abil-
Negative 10 90 100 ity of a diagnostic test to affect a diagnosis,
Column total 100 100 200 therapeutic impact assesses the degree to
Note.—Diagnostic test with 90% sensitivity, specificity, and positive and negative predictive values. Prevalence of disease is
which a diagnostic test influences subse-
a relatively high 50%. quent therapeutic choices. This is also gener-
ally measured with questionnaires to
TABLE 4 Disease Prevalence 1%
physicians; but with appropriate study de-
sign, subsequent therapies can be measured,
Reference Text and differences in therapies can be attributed
Index Test Row Total
Positive Negative to diagnostic tests.
Fineberg [1] examined the impact that CT
Positive 9 99 108 of the head had on diagnostic and therapeutic
Negative 1 891 892 plans. All physicians requesting a head CT
Column total 10 990 1000 were asked to list the probabilities of the di-
Note.—Decreasing the disease prevalence to 1% leaves the sensitivity and specificity at 90%; however, the positive predic-
agnoses being considered. They were also
tive value has decreased to 8% and the negative predictive value has increased to 99.9%. asked, if no CT were available, what diag-
nostic tests they would definitely and proba-
tells us gold standard truth as to whether a diagnostic test to diagnostic thinking. This is bly require and what their treatment plan
disk is desiccated. We could then construct usually assessed using questionnaires that would be. Medical records were then re-
an ROC curve using each level of signal ab- clinicians complete before and after receiv- viewed at discharge to determine which di-
normality as a cutoff for normal versus ab- ing the results of the diagnostic test. Clini- agnostic tests were actually performed and
normal. In the first instance, 1 represents cians can be asked to rank diagnostic what therapies were instituted. Fineberg
normal and 2–5 represent abnormal. The sec- possibilities or even to assign probabilities to found that between 41% and 73% fewer di-
ond cutoff would be 1 or 2 are normal and 3– given diagnoses. If the probabilities converge agnostic tests were performed than were pro-
5 are abnormal, and so forth. An advantage on a given diagnosis, or important diagnoses jected by the physician before CT. The
of ROC curves is that diagnostic accuracy are excluded, then the test has diagnostic therapeutic plan changed in 19% of patients.
can be quantified for the complete range of merit. Diagnostic entropy is a concept that This study was one of the first published ex-
cut points by calculating the area under the stems from the work of Shannon and Weaver amples measuring the diagnostic and thera-
curve (Az ). A perfect diagnostic test would [6] in the 1940s, based on engineering infor- peutic impact of a radiologic intervention,
have an Az of 1. A diagnostic test that con- mation theory. The probability for a given di- and it helped to define the paradigm later
veyed no useful information would have an agnosis is compared with the spread of adopted by Fryback and Thornbury [2].
Az of 0.5. Such quantification facilitates the probabilities over all diagnoses. Diagnostic
comparison of diagnostic tests. entropy increases as the probabilities become Patient Outcome Efficacy
more evenly spread across the diagnoses. Measures of patient outcome have tradition-
Diagnostic Impact Efficacy Entropy decreases as probabilities concen- ally been limited to mortality and morbidity.
A diagnostic test can be quite accurate and trate around a single or a few possibilities. However, in recent years researchers have fo-
yet still not provide clinically useful infor- The problem with assessing diagnostic en- cused more attention on health-related quality
mation. Measures of diagnostic impact at- tropy, as well as other schemes to quantify of life, which refers to the patient’s appraisal of
tempt to quantify the importance of a diagnostic impact, is that it requires clini- and satisfaction with his current level of func-

AJR:176, April 2001 875


Jarvik

tioning as compared with what the patient per- be quite powerful in their own right, and be- without the outcome of interest. These stud-
ceives to be possible or ideal [7]. A physician’s cause they are easier and cheaper, they should ies are usually done prospectively, with the
estimate of the success or failure of an inter- be the study design of choice for certain situa- exposure identified and the subjects then fol-
vention is no longer sufficient. The patient’s tions. In addition to the randomized controlled lowed up over time for the development of
perspective as well has become important in trial, we will consider three other study de- an outcome. However, cohort studies can
determining efficacy. This is seen in the study signs: the case-control study, the cross-sec- also be retrospective. Risk factors can be
by Dixon et al. [8], in which the researchers tional study, and the cohort study. identified in the past and then the cohort as-
compared quality-adjusted life years In choosing a study design, the first deci- sembled on the basis of these past data. One
(QALYs), as well as diagnostic and therapeu- sion for researchers is whether they have a can then look at the subjects’ current disease
tic impact, before and after brain and spine question that should be answered with a de- status to determine if a relevant outcome has
MR imaging. A QALY indicates a patient’s scriptive or an analytic study. Descriptive occurred. An example of a prospective co-
willingness to trade-off length of life for qual- studies, which can also be regarded as hy- hort study in radiology is the study by Nevitt
ity of life. There are a variety of methods to pothesis generating, include case reports, et al. [13], who assembled a cohort of sub-
quality-adjust life years, including the standard case series, and cross-sectional studies. They jects with and without new osteoporotic ver-
gamble, time trade-off, and rating scales [9]. usually describe the epidemiologic charac- tebral compression fractures (the risk factor)
These methods will be described in detail in teristics of diseases, or in the case of radiol- and looked at the proportion of patients in
future articles. Dixon et al. [8] used a question- ogy, how imaging findings relate to patient each group who developed subsequent back
naire (the QALY toolkit [10]) to estimate the characteristics. Measuring all variables at a pain and functional limitation (the out-
adjusted quality of life for different health single time is the distinguishing characteris- comes). They found that new vertebral frac-
states. The key point is that quality adjustment tic of cross-sectional studies. The classic tures were strongly associated with increased
is from the patient’s and not the physician’s study by Jensen et al. [12] of MR imaging pain and limitations in functional status.
perspective. Although Dixon et al. found im- findings in patients without lower back pain Case-control studies are particularly use-
portant effects on the clinicians’ diagnostic is an example of a cross-sectional study. The ful for examining rare outcomes, because
confidence and therapeutic plans, there was no researchers identified 98 subjects, performed subjects are selected on the basis of their
change in the patients’ quality of life. MR imaging on them, and determined the having the outcome of interest. Conversely,
lack of lower back pain at one time point. In cohort studies are useful for rare risk factors,
Societal Efficacy fact, most imaging investigations are cross- because subjects are chosen on the basis of
In the era of constrained resources, those sectional in nature. Although cross-sectional their having a particular exposure.
who pay for health care demand value. This studies are relatively easy to perform, a dis- Experimental or intervention studies are also
implies that a new technology not only must advantage is that it is frequently impossible prospective cohort studies, because participants
improve patient outcomes, but also must to determine if the exposure preceded the are enrolled on the basis of risk factors. How-
maximize the health that can be bought for a disease or the disease preceded the exposure. ever, experimental studies differ from observa-
dollar. Cost-effectiveness analyses are now For example, it has been observed that indi- tional studies in that the exposure status is
commonly incorporated into the evaluation viduals with spinal stenosis are more likely assigned by the investigator. We at the Univer-
of new technologies and in all likelihood will to have lower activity levels, but it is impos- sity of Washington are currently conducting a
remain an important aspect of technology as- sible to determine from cross-sectional data randomized trial comparing a rapid MR imag-
sessment. An excellent example of this sort if it is the stenosis that leads to less activity ing with radiography as the initial imaging tech-
of study was described by Colice et al. [11]. or less activity that leads to spinal stenosis. nique in patients with lower back pain. The
The researchers used decision analytic mod- Unlike descriptive studies, analytic studies exposure we are studying is the imaging study,
eling to compare the cost-effectiveness of allow hypothesis testing to determine the asso- to which patients are randomly assigned. We are
screening asymptomatic patients with lung ciation between an exposure (risk factor) and measuring a variety of outcomes, but a back-
cancer for brain metastases using head CT an outcome (disease). Analytic studies can be pain-specific functional status measure, the
versus scanning patients only when they be- divided into observational and experimental. modified Roland scale [14], is our primary out-
came symptomatic. They determined that the Observational studies can be further divided come of interest. We will monitor patients for 1
cost per QALY ($70,000) with the screening into case-control and cohort studies. Patients year and determine if one exposure group has
strategy would be substantially higher than in case-control studies are selected on the basis significantly different outcomes from the other.
that of many accepted medical interventions, of whether they have the disease (or outcome) Although observational studies can control
and thus not justified given the assumptions of interest. The proportion of cases with the for known risk factors, both at the design and
used in their model. exposure of interest is then compared with the analysis stages, a researcher can never be
controls. For example, if sciatica is the disease confident that all important risk factors that in-
Methods of Assessing Diagnostic of interest and nerve root compression is the fluence outcome have been identified. The
Technologies exposure, a case-control study would identify unique strength of a randomized trial is that,
Randomized trials focusing on patient out- patients with sciatica and then a matched on average, all factors, known and unknown,
comes are the only way to investigate these is- group of patients without sciatica. are controlled. Deyo [15] provided the inter-
sues with the absolute assurance that bias is In contrast, a cohort study chooses sub- esting example of comparing two batches of
being avoided, and such trials should be con- jects on the basis of the exposure (or risk fac- fruit and matching them on characteristics that
ducted when the stakes are high enough. How- tor) and then examines the proportion of would seem important, such as shape, source,
ever, other research tools are available that can subjects in each exposure group with and edibility, size, and weight (Table 5). It might

876 AJR:176, April 2001


Fundamentals of Clinical Research for Radiologists

Why Not Find “Matching”


time causes an apparent increase in survival, when the stakes are high enough.
TABLE 5 known as lead-time bias, in all screening
Controls?
programs. This increase in survival would be
Characteristic Apples Oranges Acknowledgments
equal to the lead time if testing were continu-
Shape Round Round ous, but is one-half the lead time for single I thank Peter Jucovy and Craig Blackmore
Source Tree Tree episodes of screening [18]. Adjusting for for their insightful comments.
Edible Yes Yes lead-time bias usually is not possible, be-
Size Handled Handled cause lead times for new tests are not known, References
Weight .23 kg .23 kg and there is no guarantee that disease de- 1. Fineberg H. Computerized cranial tomography:
tected by screening progresses at the same effect on diagnostic and therapeutic plans. JAMA
Note.—Adapted from [15]. 1977;38:224–227
rate as disease that appears clinically.
2. Fryback D, Thornbury J. The efficacy of diagnos-
Disease that progresses more slowly will tic imaging. Med Decis Making 1991;11:88–94
appear to some that the two groups were well be more likely to be identified by a screening 3. Thornbury JR. Clinical efficacy of diagnostic im-
matched, but ultimately you’re still comparing test than rapidly progressive disease simply aging: love it or leave it. (Eugene W. Caldwell
apples with oranges. because slower-growing cases are in the de- lecture) AJR 1994;162:1–8
4. Brook RH, Lohr KN. Efficacy, effectiveness,
Randomized trials are the most powerful tectable preclinical stage for a longer time.
variations, and quality: boundary-crossing re-
study design for excluding bias, but because Thus, screening preferentially detects dis- search. Med Care 1985;23:710–722
they are generally difficult to conduct and are ease with slower progression compared with 5. Guyatt GH, Tugwell PX, Feeny DH, Haynes RB,
quite expensive, it is neither practical nor de- disease that manifests clinically. Not surpris- Drummond M. A framework for clinical evaluation
sirable to do randomized trials for every di- ingly, this bias, termed length bias, may re- of diagnostic technologies. CMAJ 1986;134:587–594
agnostic imaging question. An alternative sult in an apparent improvement in survival, 6. Shannon CE, Weaver W. The mathematical the-
ory of communication. Chicago: University of Il-
study design that is potentially widely appli- when in fact the screening program has only linois Press, 1949
cable is modeling. Modeling refers to the use increased the identification of slowly pro- 7. Cella DF, Tulsky DS. Measuring quality of life today:
of decision analytic techniques to model gressive cases relative to the clinically more methodological aspects. Oncology 1990;4:29–38
clinical situations. Frequently used for cost- important rapidly progressive ones. 8. Dixon AK, Southern JP, Teale A, et al. Magnetic res-
effectiveness analysis, decision modeling Perhaps the ultimate example of length onance imaging of the head and spine: effective for
the clinician or the patient? BMJ 1991;302:79–82
usually refers to constructing a decision tree bias is when a screening test detects “dis- 9. Drummond MF, Stoddart GL, Torrance GW. Meth-
that incorporates, in a quantifiable manner, ease” that would never manifest itself clini- ods for the economic evaluation of health care pro-
various aspects of clinical practice. The ad- cally. Some subjects may have disease that grammes. Oxford, England: Oxford Medical, 1987:
vantage of decision analysis is that it deals progresses so slowly that the individual 112–148
systematically with complex situations, al- would have died from other causes before 10. Gudex C, Kind P. The QALY toolkit. York, En-
gland: University of York, 1988
though failure to account for all aspects of a the disease became clinically apparent. This 11. Colice GL, Birkmeyer JD, Black WC, Littenberg
complex situation is a potential weakness. effect is termed pseudodisease, and it causes B, Silvestri G. Cost-effectiveness of head CT in
The first step in constructing a decision an apparent improvement in survival attribut- patients with lung cancer without clinical evi-
model is to identify the clinical starting point, able to screening. dence of metastases. Chest 1995;108:1264–1271
which identifies the group of patients for I have reviewed a variety of research 12. Jensen MC, Brant-Zawadzki MN, Obuchowski
N, Modic MT, Malkasian D, Ross JS. Magnetic
whom the analysis is conducted. Second, the methods that can be applied to evaluating di- resonance imaging of the lumbar spine in people
diagnostic and therapeutic choices that can be agnostic tests. Each has relative advantages without back pain. N Engl J Med 1994;331:69–73
applied to that population are defined. Third, and disadvantages that must be weighed be- 13. Nevitt MC, Ettinger B, Black DM, et al. The as-
probabilities are assigned to the information fore deciding which to use. In addition, a test sociation of radiographically detected vertebral
derived from diagnostic tests and intermedi- can be evaluated at several possible levels fractures with back pain and function: a prospec-
tive study. Ann Intern Med 1998;128:793–800
ate clinical states resulting from treatments. ranging from diagnostic accuracy to cost-ef-
14. Roland M, Morris R. A study of the natural history
Fourth, patient outcomes are defined that fectiveness. Without a doubt, demand will be of back pain. 1. Development of a reliable and sen-
form the end points for the analysis. increasing for data that can show that a new sitive measure of disability in low back pain. Spine
Screening refers to examining people technology improves patient outcomes. As 1983;8:141–144
who do not have signs or symptoms for the Guyatt [19] has written: 15. Deyo RA. Practice variations, treatment fads, ris-
ing disability: do we need a new clinical research
presence of disease. Black and Welch [16,
paradigm? Spine 1993;18:2153–2162
17] have highlighted three problems with We must go beyond accuracy and try 16. Black WC, Welch HG. Advances in diagnostic
screening: lead-time bias, length bias, and to determine if our patients are better imaging and overestimations of disease preva-
pseudodisease. off as a result of new technologies. lence and the benefits of therapy. N Engl J Med
Lead time refers to the interval between Randomized trials focusing on patient 1993;328:1237–1243
17. Black WC, Welch HG. Screening for disease.
detection of clinically occult disease by outcomes are the only way to investi-
AJR 1997;168:3–11
screening and the point when the disease gate these issues convincingly and de- 18. Black WC, Ling A. Is earlier diagnosis really bet-
would have manifested clinically. This lead finitively and should be conducted ter? The misleading effects of lead time and

AJR:176, April 2001 877


Jarvik

length biases. AJR 1990;155:625–630


19. Guyatt GH. Critical evaluation of radiologic technolo-
gies. (editorial) Can Assoc Radiol J 1992;43:6–7

878 AJR:176, April 2001


Fundamentals of Clinical Research
for Radiologists

Stephen J. Karlik 1
How to Develop and Critique a Research
Protocol

I
magine that I am working in the knowns. How many times have radiologists
sonography suite. It is 2:35 A.M., succumbed to a manufacturer’s glossy brochure
and I have just spent a fruitless 45 or an impressive pilot study presented at a meet-
min assessing the perfusion of a recently trans- ing by a colleague with the promise of the “holy
planted liver. If I do not detect any flow in the grail” of imaging advances without solid statis-
portal circulation, the patient must have an an- tically verified scientific support of the advan-
giogram with the risk of hepatorenal toxicity or tages of the latest and greatest? When faced
return to surgery. New contrast material is avail- with such a quandary, radiologists should con-
able, but expensive, and the hospital does not sider all the options, evaluate the existing evi-
sanction its routine use. What are the criteria I dence, and possibly investigate the problem
would use to judge the effectiveness of this themselves. The purpose of this module is to in-
change in procedure to include contrast agents troduce the concepts involved in turning an in-
so that I can justify it to the hospital and for the teresting and valuable question into a
examination of the patient? The manufacturer of reasonable and effective research protocol. I
the contrast agent has provided a variety of sales will briefly introduce some essential concepts
material that shows the apparent excellent ability that will be expanded in detail in later modules
Received October 20, 2000; accepted after revision of the contrast material to show perfusion at low in the series. At the end, I will use the preceding
December 4, 2000. flow rates. A recent refresher course about con- clinical scenario to focus my ideas and generate
Series editors: Craig A. Beam, C. Craig Blackmore, trast media had no reference to portal venous as- a summary of my research protocol.
Stephen J. Karlik, and Caroline Reinhold. sessment. However, at a specialty meeting, one
This is the third in the series designed by the American of my residency classmates presented a case re-
College of Radiology (ACR), the Canadian Association of Defining the Question
Radiologists, and the American Journal of Roentgenology.
port in which she claimed to have had great suc-
The series, which will ultimately comprise 22 articles, is cess. I have heard about “evidence-based” Research is a personal issue. A key feature in
designed to progressively educate radiologists in the medicine and realize a quick literature search defining the question to be addressed is the
methodologies of rigorous clinical research, from the most
basic principles to a level of considerable sophistication. may assist. Unfortunately, relevant citations in value of the research to the discipline and prac-
The articles are intended to complement interactive MEDLINE are virtually nonexistent. tice of radiology. In some respects, the wider the
software that permits the user to work with what he or she A hypothetic example perhaps, but consider applicability of the new technique, procedure,
has learned, which is available on the ACR Web site
(www.acr.org). the outcome of this quandary. I could simply or algorithm, the greater the importance of the
Project coordinator: Bruce J. Hillman, Chair, ACR
administer the contrast material, but do I know work to the discipline. However, there are cer-
Commission on Research and Technology Assessment. the limitations and actual measurable flow rates tainly individual or location-specific problems
1
Department of Diagnostic Radiology, London Health attainable with its use? What would be the out- that can only be settled by a rigorous scientific
Sciences Center-University Campus, Rm. 2MR21, 339 come of negative findings? What patients would examination, no matter how limited the impor-
Windermere Rd., London, Ontario, N6A 5A5 Canada. be the best subjects for this contrast material? tance to others.
Address correspondence to S. J. Karlik.
Who would benefit the most from the injection? Motivation is a significant additional com-
AJR 2001;176:1375–1380
Is there sufficient scientific backup to identify ponent of the personal nature of the research.
0361–803X/01/1766–1375 this usage? Unfortunately, many choices in radi- The definition of a research question is based
© American Roentgen Ray Society ology rest on such slim justifications and un- on knowledge, skills, and the perceived is-

AJR:176, June 2001 1375


Karlik

sue. An inquiring mind would probably see protocol includes the following: a strong per- tions are based on statistical analyses that are
questions in the special areas of interest, ask- sonal interest and motivation; a determination performed casually and unconsciously on the
ing “I wonder if there is a better way to do of originality, relevance, and lack of triviality basis of observations of the world. Generaliza-
this?” Choosing the topic is a matter of inter- or predictability; wide potential interest; defi- tions are useful in daily life because they have
est, perceived need, and remembering the nite clinical importance; and risk factors ad- predictive value. The highway home has been
fact that research requires time, effort, and dressed. In the selection of this list, other key jammed at the end of nearly every day. If the
money (TEaM) to succeed. factors beyond importance, novelty, and an- highway is jammed at the end of every day, then
Radiology research questions fall into four swerability have been emphasized [2]. A re- it would be reasonable to predict that it will be
general categories: evaluation of equipment cent editorial in Radiology, written to offer a jammed today and that avoiding it altogether
(e.g., technology assessment, as in the value series of guidelines for manuscript review, ad- would be faster. Predicting future events from
of helical CT), discovery of and evaluation of dressed the elements of both substance and past occurrences, is “statistical thinking,” which
techniques (e.g., platinum embolization coils style [3]. It would be wise to consider the can help make decisions about the future.
or accuracy of an imaging sign), reevaluation strengths and weaknesses of the protocol and What is the best way to answer the question
of old techniques or procedures (e.g., the as- advances in knowledge mentioned in this arti- “What proportion of patients who receive
sessment of ionic and nonionic contrast cle when planning a project. contrast agents will have a serious reaction?”
agents or cost-effectiveness of an evaluative Sometimes the question just does not seem If “best” means most accurate, then logging
pathway), and application of radiologic tech- to warrant publication, yet is still important to every reaction for every procedure for every
niques to investigate changes in treatment the investigator. An example might be the bottle of contrast agent manufactured would
(e.g., the use of diffusion MR imaging in usefulness of a new piece of equipment be the best way. Although this procedure
early stroke treatment). All topics can provide brought to the practice, such as an add-on ste- would be ideal, it is obviously not practical.
significant opportunities to contribute to the reotaxic unit for mammography. Does it im- For most, “best” means as accurate as one can
advancement of the discipline of radiology. prove diagnostic ability compared with the afford to be, and accuracy can be expensive in
How can the question be evaluated and put in previous technique and equipment? This data time and money (TEaM). Therefore, generali-
perspective? How is the “so what” challenge could be valuable to practice management, zations are usually made from incomplete in-
met? The questions can come from many perhaps without a wider range of applicability formation.
places: an interesting patient, a new piece of or publication. However, the same scientific
equipment, a new contrast agent, or a clinical skills required for publication-quality research
collaborator. Once the problem becomes inter- should be used in this investigation. Why Statistics?
esting, radiologists must evaluate its value to the In a more formal sense, the primary objec-
patient population and their discipline. The in- tive of statistics is to infer the characteristics
Scientific Inquiry Loop
vestment in TEaM places the decision directly of a whole, on the basis of the characteristics
on the potential investigators. A thorough re- The formulation of a specific research topic observed in a part. Gaining a complete
view of all existing available literature is essen- involves scientific reasoning. The first goal is to knowledge of the whole is usually impossi-
tial, and “Module 7” will address the issues express the question in a succinct way and to jus- ble for practical, technical, or financial rea-
related to an effective critical review of the liter- tify the query as a worthwhile expenditure of sons. Although statistics may not reveal the
ature. Obviously, existing studies should not be time, effort, and money. It is essential to evaluate absolute truth about the whole, they will al-
repeated if they are well done and give an ade- thoroughly the existing evidence relevant to the low the estimation of the truth. How close an
quate answer to the question. Unfortunately, the question. Then the question must be formulated estimate is to the truth is affected by many
radiology research literature has often not met in a clear and succinct hypothesis. The study factors, and under certain conditions, the
this criterion [1]. should then be designed with sufficient statistical probability that the estimate is in error may
One of the good ways to approach a re- power to be unequivocal. After evaluating the fi- be quantified. Statistics refers to methodolo-
search inquiry is to think from the beginning nal results, choose the null (the statement that gies used to interpret quantitative data with
about publication because peer review is a crit- groups do not differ) or alternate hypothesis (the special calculated values that describe a col-
ical filter for research. Does the project war- belief that the null hypothesis is unlikely). The lection of data and then to assess error in
rant a paper to describe the results? Is the work selection of the correct one would lead directly these values. Statistical methods are useful in
trivial, predictable, or unoriginal? Sometimes to the formulation of a new testable hypothesis. scientific and clinical research because they
the issue could be outdated or irrelevant. Does Then the loop of science continues in these re- include tools that can make accurate general-
the study show true innovation? Similarly, a peated small steps (Appendix). izations and meaningful comparisons be-
study with a narrow interest or directed at a This loop is the foundation for our research tween groups of observations [4]. Statistical
highly specialized target population may be of work. After a discussion of the individual back- methods enable the evaluation of treatment
less interest. All studies must have a clinical ground components below, I will return to the effectiveness and diagnostic test perfor-
importance, whether directly or indirectly, initial quandary about sonographic contrast ma- mance and assist in the development of new
with significant implications for patients. In terial and use the information to structure an ap- drugs or therapies. “Module 6” will examine
the discipline of radiology, it is important to propriate research protocol. these methods in detail.
ask if a new technique or procedure carries ad- The sensitivity (or more properly, the
ditional risk factors that make a study of mar- Making Generalizations power) of statistical methods depends on the
ginal importance a poor choice. A summary of Generalizations are often used in science and amount of data collected. Because statistical
the key considerations for assessing a research in everyday life. Many day-to-day generaliza- conclusions are based on incomplete infor-

1376 AJR:176, June 2001


Fundamentals of Clinical Research for Radiologists

mation, studies with small samples can fail there is no difference in the adverse reac- effects. A small sample size frequently leads
to determine that a large observed difference tions between nonionic and ionic contrast to a type II error. Type I and type II errors are
is statistically significant. Similarly, using a agents (H0) is opposed to the hypothesis inversely related: that is, a smaller risk of one
large sample size can also make a small dif- that there is a difference (H1). type is accompanied by a higher risk of the
ference statistically significant. After doing other. The objective is to obtain the lowest
the statistical analysis, radiologists still must Formulating Hypotheses for Testing chance of a type I error, while minimizing
judge their results and those of others in To simplify the interpretation of the re- the possibility of a type II error.
terms of the clinical significance of the in- sults of any statistical test, what is being The type I error is more serious and, there-
vestigation. There might be highly important compared and the expected outcome, if pos- fore, should be avoided. Thus, when an ex-
differences between our groups, but the sible, must be clearly defined. The rule to periment is proposed, the hypothesis test
sample size is too small to detect them. An follow is to assume that no difference exists procedure is adjusted to produce a low prob-
example was the need to use large numbers between treatments, groups, and procedures. ability of incorrectly rejecting H0. The prob-
of cases to compare the incidence of adverse Assume that any difference that does exist ability of a type I error is the “significance
effects in nonionic and ionic contrast agents between the groups is entirely attributable to level” (commonly 0.05 or 5%). Therefore, a
because the actual incidences were small. In chance (sampling error, in particular) [7]. significance level of 0.05 defines the proba-
a paper that finds no significant difference, This assumption will be maintained until a bility level that we accept to mistakenly re-
did the study have sufficient numbers to de- statistical test can show that it is unlikely that ject the null hypothesis. The way statistical
termine if a truly important difference ex- chance alone can account for the difference. science limits a type I error to 5% is to reject
isted? Conversely, studies with large This rationality is analogous to a court of law the null hypothesis only if a statistic called
samples can reveal significant results that in which someone is innocent until proven the p value is less than 5%. The p value mea-
have no substance. Thus, in a study report- guilty. Because absolute proof is rare in the sures the likelihood of observing the data, or
ing statistical significance, is the result sta- courts, guilt that is shown beyond a reason- something further removed, and assuming
tistical in origin and possibly not important able doubt is good enough. So it is in statisti- that the null hypothesis is true. The null hy-
[4, 5]? This latter scenario refers to the “so cal analysis. Absolute proof that a difference pothesis is rejected when the data are a rare
what” challenge on a completed protocol, between groups is not due to chance is rare, event (i.e., when p is small). The smaller the
but not on a new one. so thresholds are set beyond which one can p value, the more it suggests that the null hy-
no longer reasonably believe that the differ- pothesis is unlikely to be correct and should
ence is due to chance alone. Conventionally, be rejected. How small is small? Because we
Hypothesis Testing
the scientific community has used a p value consider the significance level as 5%, an
Choosing the Right Hypothesis less than 0.05 as sufficiently small to call a event that occurs one in 20 times is rare
A hypothesis is a fundamental basis for result statistically significant. enough to make us reject the null hypothesis.
generating a successful research project. The statement that the groups do not differ Examples of rare events are the following:
Generating a testable hypothesis from a is called the null hypothesis (H0). If the null being hit by lightening, one in 2,000,000;
question leads directly to a definition of the hypothesis is shown to be sufficiently unlikely, winning a state lottery, one in 14,000,000; or
specific studies needed to prove or disprove the belief to which one switches is called the being killed in an automobile accident one in
the hypothesis [6]. The statistical tests to alternate hypothesis (H1) [8]. The final out- 5000. All these rare events are substantially
determine the potential differences between come of a hypothesis test is to either reject or less frequent than the one in 20 criteria for a
groups also directly follow. To formulate a not reject H0. Statisticians give the null hy- rare event in scientific research.
test, usually some theory has been proposed pothesis priority over the alternative hypothe- Type II errors occur when the null hy-
as the truth, such as that MR imaging is sis as it relates to the statement being tested. pothesis is accepted as true, although it is
better than CT for diagnosing spinal tumors Often the null hypothesis is set up as a straw false. Suppose MR angiography was com-
or that an idea is proposed as true, but is man to be rejected in the study. However, if H0 pared with angiography for detection of ca-
unproven, such as claiming that a new bar- is not rejected, the data from the experiment do rotid stenosis. A type II error would occur if
ium contrast agent is superior to the old not prove that the null hypothesis is true; the we concluded that the two imaging modali-
formulation. data only suggest that there might not be suffi- ties were the same when in fact, the perfor-
Medical science has adopted the scien- cient evidence against H0 in favor of H1. mance was different. A strategy to minimize
tific method for determining differences A type I error occurs in a hypothesis test the type II error is to have sufficient num-
between groups by testing statistical hy- when a true null hypothesis is rejected (false- bers of studies or patients. Obtaining larger
potheses. Usually, the question of interest is positive). An example would be if a study re- study groups is a two-edged sword because
divided into two competing hypotheses, ported a difference between MR imaging and the larger the numbers, the higher the risk of
and a study must be designed to provide ev- sonography for the evaluation of carotid finding differences (or a type I error). The
idence for choosing between them. These stenosis when in fact, there was no differ- size of the risk of a type II error is β, and the
are the null hypothesis (H0) and the alter- ence. A type II error occurs when the null hy- power of the study (the probability of draw-
native hypothesis (H1). Additionally, if the pothesis is not rejected when it should be ing a true-positive conclusion when the con-
null hypothesis is to be disproved, studies (false-negative). A type II error would occur clusion is true) is 1-β. Table 1 shows these
must be designed so that it cannot be re- if it were concluded that two MR imaging concepts in a manner familiar to radiolo-
jected unless the evidence is sufficiently contrast agents produced the same enhance- gists, the two-by-two diagram; the power of
strong. For example, the hypothesis that ment when in fact, they produced different the study is analogous to the sensitivity of a

AJR:176, June 2001 1377


Karlik

Labeling the Erroneous


Each one of these new ideas potentially the best techniques for visualization of low
TABLE 1 adds to the TEaM. Sometimes, simpler is bet- flow? Does the contrast agent work for all ves-
Conclusions from a Study
ter. Answer one hypothesis, go entirely sels, or are there anatomic limitations? Are
Reality
Conclusion through our scientific loop as shown in the Ap- there specific patients who should not have this
Drawn from Test A No pendix, propose a second hypothesis on the contrast material?
Test A Better
Study Better Than basis of the results, and continue the scientific
Than Test B
Test B progression [6]. A statistician collaborator Assessing the Existing Evidence
Test A better True-positive False-positive should assist in making that determination on A contrast agent that permits visualization
than test B Correct Type I error the basis of the study in question. and quantification of low flow velocities could
1-β = power Risk of error = Similarly, specific aims should be identifi- potentially improve examination on sonogra-
α able for each of the protocol hypotheses. For phy of the patient with a transplanted liver. Un-
Test A no better False-negtive True-negative example, if we hypothesize that enhanced fortunately, portal venous thrombosis is a
than test B Type II error Correct sonography is equivalent to enhanced MR an- common complication of liver transplantation,
Risk of error = giography for the evaluation of carotid steno- leading to high mortality rates, difficult surger-
β sis, then we need to understand that a specific ies, and more postoperative complications [9].
Note.—Adapted from [7]. aim also should be considered, perhaps some- In the diagnostic armamentarium, contrast-en-
thing like the following: to perform contrast- hanced studies have proved effective in the as-
enhanced MR angiography and sonography on sessment of hepatic allografts with MR
diagnostic test [7]. Because we have a conven- 100 consecutive patients with suspected ca- imaging and angiography [10]. Although MR
tion that accepts an error of 5%, the standard rotid stenosis using carotid angiography as a angiography has been compared with unen-
acceptable β error is 20% (risk of finding no standard of reference (previously called the hanced sonography in the examination of liver
difference when one exists), and the power, 1- gold standard). Subsequent other secondary transplants [10], only preliminary studies have
β, is an 80% chance of finding a statistically hypotheses should also have identifiable asso- been performed with sonographic contrast
significant difference when one exists. ciated aims. agents to determine the blood flow in the por-
tal circulation [11, 12]. Contrast-enhanced MR
Primary and Secondary Hypotheses imaging has already been used with Doppler
Defining a Protocol
The discussion so far has concentrated on sonography in the preoperative assessment of
the concept of testing one hypothesis. Scien- Remember the opening scenario, assessing the portal venous system [13]. Clearly, poten-
tific protocol is divided into primary and sec- the perfusion of a recently transplanted liver. tial exists for the use of sonographic agents for
ondary hypotheses. A hypotheses can be The steps to produce a summary of the re- the examination of the portal venous system
expressed in terms of “guiding”: search protocol are the following: identify the after transplantation in the patient. Therefore,
CT is better than MR imaging for spinal problem, answer the question of whether it is this new technique should be applied to the as-
disease.—or “testable”: generalized or specific, evaluate the existing sessment of the transplanted hepatic allograft,
CT is superior to MR imaging for lumbar evidence, construct an appropriate hypothe- especially in the patients in whom a conven-
spinal stenosis in asymptomatic individuals. sis, establish one or more aims to test the hy- tional unenhanced sonogram detects low flow
A study can be designed to investigate more pothesis, and define a research plan that or fails to detect perfusion at all. Such an
than one hypothesis. For example, a study provides sufficient statistical power to answer added discrimination could prevent the un-
comparing the effectiveness of sonography the hypothesis. needed surgical procedures, such as mesopor-
versus MR angiography for carotid stenosis The researcher has a valid clinical question tal jump graft or splanchnic tributary, in lieu of
could have a primary null hypothesis that and a specific and relevant issue in the practice thrombectomy [9].
MR imaging and sonography are equiva- of sonography. Evaluation of portal perfusion
lent for the diagnosis of carotid stenosis. posttransplantation is a reasonable and valu-
Perhaps secondary hypotheses could in- able clinical diagnostic test for an important Honing the Hypothesis
clude a comparison of enhanced sonography patient population. Hypothesis 1: Enhanced sonography is
and enhanced MR imaging on the evaluation: better than unenhanced sonography for the
Enhanced sonography is equivalent to en- Our Basic Query detection of low flow rates.
hanced MR angiography for the evaluation Can enhanced sonography help detect low This statement seems reasonable; however,
of carotid stenosis.—or that there is equiva- flow rates in vessels that are apparently be- this hypothesis can be tested only with great
lence only for certain degrees of stenosis: low the detection threshold for conventional difficulty because the statement is too generic.
Enhanced sonography is equivalent to en- Doppler sonography? Why else would the Some defining questions are the following: in
hanced MR angiography for the evaluation of manufacturers invest so much time and what patients, tissue, or structures? What does
carotid disease in the range of 50–80% stenosis. money in their development? However, is the low flow mean? These issues are addressed in
Perhaps the patient’s medical condition or clinical use scientifically proven? hypothesis 2.
symptoms could also be the focus of a sec- Hypothesis 2: Enhanced sonography is bet-
ondary hypothesis: Some Other Relevant Questions ter than unenhanced sonography for the detec-
Enhanced sonography is equivalent to en- What is the minimal flow level at which the tion of greater than 50% thrombosis in the
hanced MR angiography for the evaluation contrast agent will work? Is the effectiveness portal venous system.
of carotid stenosis in patients with bruits. of the contrast machine dependent? What are This hypothesis is better, but questions re-

1378 AJR:176, June 2001


Fundamentals of Clinical Research for Radiologists

main. For example, what does “better” mean? propriate radiologic measures. However, the than 0.05 will be used to evaluate the differ-
Does it mean less expensive, faster, more spe- statistical methods and sample size that will ences with receiver operator characteristic
cific, more sensitive, easier, or less risky to the achieve the desired power must be deter- curve, chi-square, regression, and t tests, if
patient? In the discipline of radiology, the value mined. This stage is absolutely critical in the appropriate.
of a diagnostic test must rest solidly on the con- design of the study. If an investigator does
cepts of sensitivity and specificity (to be dis- not have the competence in statistical design,
Protocol Summary
cussed in “Module 11”) [14, 15]. A procedure is a statistician should be consulted to deter-
valueless if it does not show significant sensitiv- mine how the observations will be compared Background
ity and specificity. In this instance, the technique and how many subjects will be needed. A contrast agent that permits visualization
must be sensitive to flow rates currently unde- Aim 1.—to determine and compare the and quantification of low flow velocities
tected by conventional means—a valuable ex- sensitivity and specificity of enhanced and could potentially improve examination on
tension of the existing technology. This unenhanced sonography for the detection of sonography of the patient with a transplanted
consideration leads us to hypothesis 3. portal venous stenosis in patients with trans- liver. Although contrast-enhanced MR imag-
Hypothesis 3: Enhanced sonography is planted livers with angiography as the stan- ing has been compared with unenhanced
more sensitive than unenhanced sonography dard of reference. sonography in hepatic allografts, only pre-
for the detection of stenotic vessels (greater Additional aims from the same study liminary studies have been performed with
than 50% stenosis) in portal venous vessels. could be the following: contrast agents to determine the blood flow
If the determination of sensitivity and Aim 2.—to determine the highest degree in the transplanted liver. This added discrimi-
specificity is added to the protocol, it is es- of stenosis on sonography and contrast-en- nation could significantly improve the care
sential to propose some type of a standard of hanced sonography when flow is still visible. of the patient with a liver transplant by pre-
reference. This can be a difficult issue in ra- Aim 3.—to evaluate the incremental cost venting unneeded surgical intervention.
diology; a discussion of this topic will be and benefit of the addition of contrast mate-
found in “Module 9.” The determination of a rial to the routine examination of newly Hypothesis
standard of reference for a diagnostic proce- transplanted livers.
Enhanced sonography is more sensitive
dure usually involves postsurgical examina- Aim 4.—to determine the predictive value
than unenhanced sonography for the detection
tion of the relevant tissues. However, other of the detection of stenoses below the thresh-
of greater than 50% stenosis in liver allograft
diagnostic tests with established sensitivity old of conventional Doppler sonography to
portal vessels, whereas conventional angiogra-
and specificity have also been used. In ap- the failure of hepatic allografts.
phy is used as the standard of reference.
propriate conditions, follow-up clinical diag- The final step in the definition of our re-
nosis may also be appropriate. These search protocol is to generate a research plan Specific Aims
considerations speak directly to the relevant that incorporates the relevant experiments
Aim 1.—to determine and compare the
knowledge of the investigators and their abil- needed to fulfill the aims of the study. The
sensitivity and specificity of sonography
ity to choose an appropriate standard of ref- following is an example of a research plan to
with and without contrast agents for the de-
erence and leads to hypothesis 4: fulfill the primary aims our project.
tection of portal venous stenosis in patients
Hypothesis 4: Enhanced sonography is
with transplanted livers with angiography as
more sensitive than unenhanced sonography
Research Plan the standard of reference.
for the detection of greater than 50% steno-
Aim 2.—to determine the largest stenosis
sis in portal venous vessels, in which angiog- Consecutive patients referred to the sono-
on sonography and enhanced sonography
raphy is used as the standard of reference. graphic service for routine examination of a
when flow is still visible.
Do normal livers have stenoses? The orig- liver allograft will have a conventional Dop-
Aim 3.—to evaluate the incremental cost
inal inquiry and postulate was concerning a pler sonogram, a contrast-enhanced sono-
and benefit of the addition of contrast me-
transplanted liver. This problem is relevant gram (with 10 mg/kg Dopplerview), and a
dium administration to the routine examina-
and gives the opportunity to generate a final conventional angiogram with the administra-
tion of newly transplanted livers.
testable hypothesis. tion of 30 mL radiographic contrast agent.
Aim 4.—to determine the predictive value
Hypothesis 5: Enhanced sonography is The percentage of stenosis will be deter-
of stenoses below the threshold of conven-
more sensitive than unenhanced sonography mined on all three modalities, and the sensi-
tional Doppler sonogram for the failure of
for the detection of greater than 50% steno- tivities and specificities for enhanced and
hepatic allographs.
sis in liver allograft portal vessels, in which unenhanced sonography will be determined
conventional angiography is used as the and compared. Scatterplots of detection and
Research Plan
standard of reference. degree of stenosis will be used to establish
With this hypothesis, the specific aim can lower cutoff levels for stenosis detection In consecutive patients referred to the sonog-
be defined, incorporating a patient popula- with and without contrast medium adminis- raphy service for routine examination of a liver
tion with a transplanted liver, sonographic tration. A cutoff level based on angiography allograft, a conventional Doppler sonogram, a
investigation with and without contrast will be established to perform a receiver op- contrast-enhanced sonogram (with 10 mg/kg
agents, quantification of stenosis with sonog- erator characteristic curve analysis of the Dopplerview), and a conventional angiogram
raphy and angiography, and determination of ability of contrast-enhanced sonography to with 30 mL radiographic contrast agent will be
sensitivity and specificity. As experts in the reveal pathologically important lower levels obtained. Inclusion and exclusion criteria for
field, radiologists know the patients and ap- of liver flow. A significance level of p less patient participation will be defined. The de-

AJR:176, June 2001 1379


Karlik

gree of stenosis will be determined for all three esis has been produced. This exercise has il- 7. Sackett DL, Haynes RB, Tugwell P. Deciding on
modalities, and the sensitivities and specifici- lustrated the thinking behind the scientific the best therapy. In: Clinical epidemiology. Bos-
ton: Little Brown, 1985:162–165
ties for enhanced and unenhanced sonography method, the basis of which is statistical hy-
8. Clarke GM. Statistics and experimental design.
will be determined and compared with an an- pothesis testing. The purpose of this module London: Edward Arnold, 1994:171–197
giogram as the standard of reference. Scatter- is, therefore, to give the structural basis to 9. Yerdel MA, Gunson B, Mirza D, et al. Portal vein
plots of detection and stenosis will be used to take a question, evaluate its importance, and thrombosis in adults undergoing liver transplanta-
establish lower cutoff levels for stenosis detec- structure it in a manner suitable for testing. tion: risk factors, screening, management, and
outcome. Transplantation 2000;69:1873–1881
tion with and without contrast administration. The other modules in this series will address
10. Glockner JF, Forauer AR, Solomon H, Varma CR,
A percentage stenosis cutoff level will be various aspects of defining and understand- Perman WH. Three-dimensional gadolinium-en-
established to perform a receiver operator ing the ideas behind specific techniques for hanced MR angiography of vascular complications
characteristic curve analysis of the ability of specific research protocols. after liver transplantation. AJR 2000;174:1447–1453
contrast-enhanced sonography to reveal patho- 11. Venz S, Gutberlet M, Eisele RM, et al. The diagnosis
logically important portal flow. The costs of the Acknowledgments and imaging of the a. hepatica after orthoptic liver
transplantation: a comparison of frequency-modu-
procedures will be established and compared. We thank Donal Downey and Craig Beam lated and amplitude-modulated color Doppler sonog-
Patients will be followed up clinically for 6 for their helpful comments. raphy [in German]. Rofo Fortschr Geb Rontgenstr
months to determine the relationship between Neuen Bildgeb Verfahr 1998;169:284–289
allograft survival and stenosis detected. A sig- 12. Leutloff UC, Scharf J, Richter GM, et al. Use of the
nificance level of p less than 0.05 will be used References ultrasound contrast medium levovist in after-care of
liver transplant patients: improved vascular imaging
to evaluate the differences with receiver opera- 1. Kent DL, Haynor DR, Longstreth WT Jr, Larson EB.
in color Doppler sonography [in German]. Radiologe
tor characteristic curve, chi-square, regression, The clinical efficacy of magnetic resonance imaging
1998;38:399–404
and t tests, if appropriate. in neuroimaging. Ann Intern Med 1994; 120:856–871
13. Naik KS, Ward J, Irving HC, Robinson PJ. Com-
2. Eng J, Siegelman SS. Improving radiology re- parison of dynamic contrast enhanced MRI and
search methods: what is being asked and who is Doppler sonography in the pre-operative assess-
Conclusion being studied? Radiology 1997;205:651–655 ment of the portal venous system. Br J Radiol
3. Proto AV. Radiology 2000: reviewing for radiol- 1997;70:43–49
The generation of this protocol has ad- ogy. Radiology 2000;215:619–621 14. Sackett DL, Richardson WS, Rosenberg W,
dressed a number of the key issues that de- 4. Giere RA. Justifying statistical hypothesis. In: Haynes RB. Is this evidence about a diagnostic
fine the scientific approach to radiologic Understanding scientific reasoning. Fort Worth, test important? In: Evidence-based medicine.
investigation. In this module, an important TX: Holt, Rinehart & Winston, 1984: 230–272 New York: Churchill Livingston, 1997:118–128
question has been raised, the relevant back- 5. Gilbert N. Scientific tests. In: Biometrical inter- 15. Jaeschke R, Guyatt GH, Sackett DL. Users’ guides
pretation. Oxford, England: Oxford University to the medical literature. III. How to use an article
ground information has been examined, a
Press, 1989:69–79 about a diagnostic test. B. What are the results and
testable hypothesis has been honed, a series 6. Medina LS. Study design and analysis in neurora- will they help me in caring for my patients? The
of aims have been generated, and a possible diology: a practical approach. AJNR 1999;20: Evidence-Based Medicine Working Group. JAMA
set of experimental studies to test the hypoth- 1584–1596 1994;271:703–707

APPENDIX: Loop of Science Algorithm


1. Ask a question
2. Assess the importance: motivation, originality, innovation, significance
3. Evaluate the existing evidence
4. Generate a specific testable hypothesis
5. State specific aims
6. Design the study
7. Evaluate data with appropriate statistical methods
8. Choose null or alternative hypothesis
9. Return to 3

1380 AJR:176, June 2001


Fundamentals of Clinical Research
for Radiologists
Philip E. Crewson 1
Kimberly E. Applegate 2,3
Data Collection in Radiology
Research

T his paper introduces the basic prin-


ciples essential for a successful
data collection effort. Data collec-
tion must begin with a clear research question.
vides guidelines that we believe will improve
clinical research in radiology.
Three general rules of data collection under-
lie this discussion. First, researchers should as-
The researcher should then carefully identify sume they will underestimate the amount of
data needs, anticipate problems with data mea- time and effort involved in data collection.
surement and missing data, design and pilot Second, the more complex the data collection
test a data collection system, establish quality process is, the longer it will take to acquire and
control, and plan both data entry and statistical enter the data. Finally, systematic and individ-
analyses. To be successful, all aspects of data ual data collection errors must be addressed
collection must focus on the goal of obtaining early in the process, because it is unwise to
substantively important data that are consis- trust human memory or a statistician’s creativ-
tent, accurate, and unbiased. ity to resolve errors in the data.
“On being asked to talk on the principles of
research, my first thought was to arise after the
chairman’s introduction, to say, ‘Be careful’, Define the Primary Research Question
Received April 3, 2001; accepted after revision and to sit down…” by J. Cornfield [1]. The first step in designing data collection is
April 24, 2001. Universally lamented by experienced clinical formulating the research question or questions
Series editors: Craig A. Beam, C. Craig Blackmore, researchers as an important but often ignored [4]. The research question should identify the
Stephen J. Karlik, and Caroline Reinhold.
aspect of medical research, good study design study’s end points, also known as the response
This is the fourth in the series designed by the American and data collection are critical to the success of or outcome variables (see Appendix 1 for a
College of Radiology (ACR), the Canadian Association of
Radiologists, and the American Journal of Roentgenology. any clinical study [2, 3]. Although most re- glossary of terms). Examples of common end
The series, which will ultimately comprise 22 articles, is searchers are, by their very nature, excited by points in diagnostic imaging studies are diag-
designed to progressively educate radiologists in the experimentation and analysis, few find enjoy- nostic accuracy, patient quality of life, patient
methodologies of rigorous clinical research, from the most
basic principles to a level of considerable sophistication. ment in the design and implementation of data satisfaction, patient comfort, safety, morbidity,
The articles are intended to complement interactive collection, although these factors are critical to impact on patient care, and costs.
software that permits the user to work with what he or she The end point is the dependent variable, the
has learned, which is available on the ACR Web site
successful research. Too often, researchers pay
(www.acr.org). little attention to how data will be collected, if variable you wish to better understand. Identi-
Project coordinator: Bruce J. Hillman, Chair, ACR the data are available or can be measured, or fying other variables becomes an exercise in
Commission on Research and Technology Assessment. how much data will be incorrect or missing. determining what factors might explain varia-
1
American College of Radiology, 1891 Preston White Dr., Even fewer researchers carefully train the data tions in the study’s end point [4, 5, 6]. These
Reston, VA 20191. Address correspondence to collectors and periodically check their work. factors, known as independent variables, usu-
P. E. Crewson.
This paper outlines seven basic elements of ally include basic demographics such as age,
2
Department of Radiology, Rainbow Babies & Children’s data collection. We discuss defining the re- sex, and race. Other independent variables
Hospital, 11100 Euclid Ave., Cleveland, OH 44106-5056.
search question, deciding on what data to col- could include comorbidity, stage of disease,
3
Present address: Department of Radiology, Riley Hospital lect, obtaining institutional review board (IRB) signs or symptoms, laboratory test results, im-
for Children, 702 Barnhill Dr., Indianapolis, IN 46202-5200.
approval, planning statistical analyses, design- aging test results, clinician experience and
AJR 2001;177:755–761 training, type of imaging equipment, and pa-
ing the data collection system establishing
0361–803X/01/1774–755 quality control, and organizing data entry. This tient movement, to name only a few. To be
© American Roentgen Ray Society article is by no means comprehensive but pro- worthy of inclusion in the study, independent

AJR:177, October 2001 755


Crewson and Applegate

variables should either relate directly to the re- There are likely to be several different ways to by the federal government. This concern can-
search question or provide useful controls for measure the data you collect. For example, not be overemphasized and is exemplified by
defining the study population and sample. when recording carotid stenosis is it sufficient a recent New York Times report that a com-
to record stenosis to one decimal place (0.4), puter hacker accessed thousands of medical
two decimal places (0.44), or three (0.435)? records in a cardiology research database at
Identify Data Requirements
Obviously, the more precise the measure the the University of Washington [9].
Deciding what, how, and when to collect data better, but the goal of precision may need to be
may be the most difficult part of the data collec- tempered by consideration of the cost in both
tion design [2]. Every principal investigator will The Statistical Analysis Plan
time and money and the substantive impor-
be faced with the dilemma of either collecting tance of the measure. The precision and scale of the data (i.e.,
too little data, thereby weakening the study’s re- Whenever possible, use well-established mea- nominal, ordinal, and interval) will determine
sults, or attempting to collect too much data, surements and common terminology to reduce or limit the statistical techniques used in the
and becoming so overwhelmed that the study is design time and improve comparability with data analysis portion of the study. Collecting
never completed or participation of institutions other studies [6]. In addition, good research de- patient age in years is fairly precise, but col-
and personnel wanes from exhaustion. Collect- sign must address reliability (consistency and re- lecting date of birth allows computation of age
ing insufficient data may also significantly im- producibility, such as the extent to which a to years, weeks, and days. Similarly, collecting
pact the statistical analyses. Few studies have measure obtains similar results on identical pa- age by grouping (<35, 36–55, ≥56) converts
the luxury of retrospectively obtaining data after tients) and validity (how often the positive test re- the scale of data from interval (computing
the study has been closed. In turn, endeavoring sult is correct) [2, 4, 5, 6]. Both are important for mean age in years) to ordinal (not useful for
to collect too much data can result in lack of establishing the accuracy of the study outcome. computing a mean). As a result, developing a
participation by patients and institutions, exces- clear statistical analysis plan at this early stage
sive amounts of missing data, fatigue of support Collection Sequence can be very useful [10] not only in providing
personnel, and cascading delays in patient ac- Finally, study collaborators should con- focus for the data collection effort (such as
crual, data cleaning, and analysis [2]. This sider the sequence of data collection early in specifying sample size estimation) but also
trade-off is particularly important to address in the study design. This will allow for thought- pointing out weaknesses in the scale and preci-
multidisciplinary research where there will be ful preparation of data forms, the design of sion of the data before data collection begins.
greater demands to collect extraneous data. In an adequate data file format, and develop- The statistical analysis plan is a detailed out-
characterizing these trade-offs, one author sug- ment of a suitable analysis plan. Many stud- line of what data will be analyzed and how. This
gests that the right amount of data are “as many ies incorporate patient follow-up, often at plan should include clear definitions of vari-
as necessary and as few as possible” [4]. multiple intervals. Follow-up measurements ables and statistical end points (descriptive or
Three additional elements must also be con- should be recorded on well-focused forms inferential), a description of the required sub-
sidered when determining what data will be that are coded with a common linking identi- group analyses, and identification of the most
collected. Essential for designing data collec- fier (generally the case identification num- appropriate statistical techniques and their rela-
tion forms and creating data files, these ele- ber) to ensure they can be aggregated with tionship to the research hypotheses. Although a
ments are the unit of analysis, data precision, previously collected data. poor research methodology, data are frequently
and the collection sequence. The involvement collected without a clear understanding of how
of a statistician at this stage of data collection it will be analyzed or the scale of data necessary
IRB Approval
design cannot be overemphasized. They can for a particular statistical technique. A useful but
Local institutional review board approval time-consuming tool in designing a statistical
provide guidance on determining the unit of
is required in most, if not all, clinical studies plan is to draft the tables you will use to present
analysis, data precision, and many other re-
and deserves to be a major consideration the results of your analysis [5]. This approach is
search design issues essential to a defensible
when designing data collection (Appendix helpful in identifying important comparisons
statistical analysis.
2). Regardless of whether it is a retrospective while clarifying statistical method requirements
analysis of collected data or a prospective and data needs.
Unit of Analysis
clinical trial, no data collection should be ini-
Determining the unit of analysis is a basic tiated until all ethical, procedural, and legal
task in designing a study, not only for method- requirements are satisfied [7, 8]. IRB re- Designing the Data Collection System
ologic reasons, but also because it affects the de- quirements will vary, but you should be pre- Most data are either collected from second-
sign of data collection forms, the storage and pared to address patient confidentiality, ary data sources such as patient records and
linking of documentation, and the design of potential risks to the patient, and procedures other administrative databases or from primary
electronic data files. The most common unit of for obtaining informed consent and monitor- sources such as patient interviews, patient sur-
analysis is the individual patient, but there are ing for adverse events. Some IRBs will give veys, and interpretation of imaging studies by
many other possibilities, such as the institution, quick administrative approval for a retro- clinical personnel. Collection instruments can
the type of procedure, the images, or in the case spective study of medical records, whereas be as simple as handwriting data on a paper led-
of reader studies, even individual radiologists. others require both full IRB committee re- ger or as complex as creating a complete com-
view and patient informed consent for this puterized internet-accessible direct entry
Data Precision type of study. There is increasing public con- system. Do not assume, however, that sophisti-
The degree of accuracy needed in the col- cern over confidentiality of medical records cated electronic collection systems will produce
lected data also deserves early attention [2, 4]. and increased scrutiny of medical research error-free data [11]. They will be susceptible to

756 AJR:177, October 2001


Fundamentals of Clinical Research for Radiologists

the same problems as paper forms, such as entry fill out physician forms, technologists fill out pletely mirror clinical practice, such as having
errors and missing information. In addition, er- technologist forms, and someone who is not a all available patient information before making
rant programming could lead to a multitude of physician or healthcare worker fill out the pa- an assessment.
problems such as improperly formatted data tient questionnaires. Involve everyone who If your study calls for patient randomiza-
fields, unsaved entries, and a confusing data file will be handling the data collection, data entry, tion into multiple study arms, it is essential
design that requires extensive and time-con- or analysis in the form design and testing pro- that the randomization is, if at all possible,
suming manipulation. cess. The initial data collection form may be either done by a third party or automation.
Regardless of the complexity, factors to con- piloted on a small sample of potential patients Before obtaining patient consent, the study
sider in designing the data collection system in- to determine whether the desired data are monitor should have no knowledge of that
clude creating data forms, avoiding systematic available and whether the data form is easy to patient’s study arm placement.
bias, and preparing a plan for data administration. complete and enter into the database. The data In diagnostic imaging, comparisons often
form can then be revised before the full study involve the same patient receiving two diag-
Data Collection Forms has begun. Finally, the principal investigator nostic tests [13]. Ensuring that the technolo-
The case report form is a common tool used should not rely on memory to recall study de- gists and radiologists are unaware of the
to collect multiple sources of data (patients, sign issues such as units of analyses, data mea- competing test results is essential to prevent
physicians, records) into one document. In de- surement techniques, definitions of each interpretation bias. This sort of blinding will
veloping and designing this type of data form, measurement and variable, and time sequence. require two separate, and probably different,
it is wise to allow for detailed notes, regardless All members of the research team need a copy data report forms.
of the number of investigators involved in the of the methodology and a “code book” to Response bias can occur during the follow-
study [12]. These notes may or may not be en- serve as a reminder of the methods established up stage of a study because of incomplete re-
tered into an electronic data file, but they can for data collection. sponses or patients lost to follow-up [14]. Ill
become invaluable in explaining otherwise un- patients may be more likely to complete a fol-
explainable variations in the data later in the Avoid Systemic Bias low-up quality of life survey than patients
study. Examples could be patient movement or Although some biases can be corrected in the who are not ill. As a result, aggregate quality
other uncooperative activity, equipment mal- analysis, some are fatal and may render the of life estimates may be lower than what they
functions, previously undisclosed comorbidi- study invalid. Therefore, it is always best to de- would have been if all participants responded.
ties, or exceptions to protocol guidelines that sign the study to avoid or minimize these biases.
can occur for many reasons including human There are many sources of bias to consider, how- Plan for Data Management
error and clinical necessity. ever; some are more closely related to the data
Investigators must develop a plan for ensuring
Form development is both an art and a sci- collection effort than others. In particular, steps
the confidentiality and storage of paper forms
ence, but there are a few basic rules to follow. should be taken to maintain objectivity in the
and documentation. The preparation of a coding
First, forms should be self-explanatory to the data collection system while avoiding bias in pa-
system that provides individual identifiers (case
person entering the data. Second, data should tient recruitment and minimizing the effects of
IDs) for each patient is one way to keep data
not require extensive interpretation before re- “interpretation bias” and “response bias.”
confidential and to support blinding efforts. The
cording. Third, the unit of measurement Objective measurement of data reduces the
same case ID should be used consistently for all
should be defined. Using time as an example, likelihood of collection bias, but the degree of
data forms related to a particular patient. Main-
specify which unit of measurement is required objectivity can vary, depending on the means
taining consistent case IDs is especially impor-
(hours, days, weeks, months, or years). Level of measurement. As an example, measures of
tant for follow-up efforts, because these data
of precision should be evident (fractions of body weight on a calibrated digital scale are
may have to be entered separately and then
hours, round to nearest full day). unlikely to vary depeding on who weighs the
merged with the primary data file at a later date.
Also, consistent and complete responses patient. In contrast, if surveyors are asked to
In general, all personal identifiable informa-
should be required for each section of the form. indicate whether a patient is underweight, nor-
tion should be collected and stored separately
Never leave a section blank. Leaving a section mal in weight, or overweight, much will de-
from the case report form. Even though patient
blank may mean the issues are not applicable pend on individual perceptions. It is a
identifying information should be separate
(which is important to code) or the originator of subjective measure.
from the case report form, each set of data
the data forgot to respond. In the case of missing Subjective measures are particularly suscep-
should be stored in a secure location with lim-
data, an assumption of irrelevance may be en- tible to prior knowledge of the treatment arm of
ited access. Assign responsibilities for data
tirely wrong. a clinical trial [10]. In blinded studies neither the
storage and maintenance of the master list that
Finally, the form should be visually appeal- patient nor the data collector know who is in the
contains both patient information (names and
ing, easy to navigate, and conducive to data control group or in the treatment group, there-
addresses) and assigned case IDs. Similarly, a
entry [2, 4]. It is often helpful to have the cod- fore minimizing bias. In open studies both the
computer specialist must secure the confidenti-
ing conventions for data entry included di- patient and the data collector know who is given
ality of the electronic files.
rectly on the form (female [0], male [1]). treatment. In the event complete blinding is not
possible, a blinded clinician could be used to re-
Pilot Testing view the data from both groups for consistency. Quality Control
Pretest the forms on individuals who are Although a complete discussion is beyond the Data integrity is the bedrock of any clinical
characteristically the same as those who will scope of this paper, be aware that a clinical study [2, 10, 14]. Early and ongoing review
fill out the forms in the study; have physicians study may require procedures that fail to com- and cleaning of data during the collection pro-

AJR:177, October 2001 757


Crewson and Applegate

cess, while being alert for systematic biases in Once the data are clean, the “database tion), most forms of statistical analysis re-
data collection and processing, is a critical ele- lock” occurs, the point at which no addi- quire the creation of an electronic data file.
ment of ensuring quality control [15]. Prevent- tional cases or data will be added to the data Therefore, planning the format of the data
able errors should be identified and, ideally, file. Always assume, however, that there will file at the beginning of the study is essential
corrected early in the study, instead of con- be data errors even after a complete quality [1]. There can be a disconnect, however, be-
suming expensive resources and time cleaning control plan is used [4, 10]. During the statis- tween what the clinician visualizes as data
the database after the study has been closed. tical analysis, do not be surprised if it be- and what the statistician needs. Although
comes necessary to pull original case report most clinicians are likely to view data in its
Data Cleaning forms to answer questions from the analysis. raw form as patient records, lab results, re-
Cleaning data requires developing a scheme Outliers can be very revealing in a statistical sponses on case report forms, and interview
for ensuring that the data are consistent and ac- analysis and it is not unusual to want to ver- sheets, statisticians view data as numbers in
curate. Much depends on the study design, but ify data integrity when the results run an array of rows and columns.
consider monitoring for the following: out-of- counter to theory or prior experience. Although there are many available data file
range data values; missing data; lack of variabil- formats and complex organizational struc-
ity (survey questionnaires can include reversed Amoral Consequence of Dishonesty tures, such as relational databases, most stat-
questions to see whether the respondent is using Our discussion of bias thus far has assumed isticians prefer the traditional rectangular
the scale appropriately); logic traps (check com- that errors in data collection are the result of data file (Table 1). Analogous to the common
binations of responses for inconsistency, such as unintentional practices, such as misunder- spreadsheet, each row typically represents
a female record that lists chronic prostatitis as a standing instructions or rationalizing postpro- one case (a patient) and each column repre-
comorbidity); and date checking (verify forms tocol changes in study design that result in sents a variable (a data element). Ideally,
are completed in sequential order) [12]. An en- collecting and reporting inaccurate informa- most data entries are numerical codes [1, 4].
try error in the year field is much easier to catch tion. In contrast, dishonesty biases data collec- As an example, although it is possible to enter
early on than it is after the data are entered and tion through deliberate falsification of either “male” or “female” for the sex variable, data
combined with other participants. the raw data or the conditions essential for entry and subsequent statistical programming
All members of the research team should un- maintaining the integrity of the clinical study are much simpler if numbers (numerical
derstand the goals and design of a study so that (such as proper patient recruitment). Regard- fields) are used in place of words (string
they may flag questionable data [2]. The study less of whether data collection errors are acci- fields). In Table 1, female patients are coded
design should clearly identify the target and dental, well-intentioned, or the result of a “0” and male patients are coded “1.” If possi-
study populations and patient selection criteria deliberate fabrication, the amoral consequence ble, avoid open-ended entries (e.g., free text
so that variability among centers and the indi- is bias [3]. Some will conclude that quality comment or description fields) because they
vidual investigators who enroll patients is mini- control is a necessary evil to prevent the errors will inevitably lead to interpretation error. Pop-
mized. Develop and enforce consistent rules for caused by others involved in the study. For ular electronic data files include delimited text
data review and cleaning, including specifica- most studies, however, the danger of institut- files, Excel (Microsoft, Redmond, WA), SAS
tion of how to handle missing dating. These ing error in data collection rests less with the (SAS Institute, Cary, NC), SPSS (SPSS, Chi-
rules should be delineated before data aggrega- dishonest than with those well-intentioned re- cago, IL), Access (Microsoft), and Epi Info
tion in which the temptation to justify certain searchers who fail to recognize and take steps (Centers for Disease Control and Prevention,
decisions in favor of a particular outcome is to mitigate their own potential for bias. Atlanta, GA).
strongest [10]. The principal investigators
should also determine whether “interim analy- Electronic Data File
sis” is necessary and determine prospectively It would be hard to conceive of a clinical Conclusion
when and what is analyzed and identify the de- study that does not require statistical analy- Our discussion has been based primarily on
cision rules for discontinuing the study [10, 16]. sis. Regardless of whether the statistical experiences related to collecting data from tra-
An interim analysis is generally done when it is needs are modest (counts, percents, means) ditional sources, such as paper case report forms
important to monitor the efficacy or safety of or more demanding (multivariate techniques, and questionnaires. Much of what has been pre-
two treatments. survival analysis, complicated error estima- sented, however, will also be useful as technol-

TABLE 1 Data File Design and Formatting

Case ID Treatment Group Recruiting Site Start Date Sex Age Imaging Result Pathology Report
1001 1 23 10212000 1 45 3 0
1002 0 15 10252000 0 32 1 1
1003 0 7 12032000 1 56 5 0
1004 1 12 01052001 1 28 3 1
Note. —The data are coded numerically so that statistical analyses can be easily performed. Female patients are coded “0” and male patients are coded “1.” A code explanation book allows
all members of the research team, including the statistician, to understand each numeric code in the database. Confidentiality is maintained by removing the patient names and assigning a case
identification number.

758 AJR:177, October 2001


Fundamentals of Clinical Research for Radiologists

ogy shifts to more internet-based collection issues can dramatically improve the quality fice, 1995. Publication no. 1996-746–425
efforts [17] and alternative software collection of research in radiology. 9. Hackers from abroad obtain data on Washington
patients. New York Times, Dec 8, 2000. Avaiable at:
systems [18]. We are confident that, regardless
References www.nytimes.com. Accessed on December 8, 2000.
of advances in technology and the potential for 10. Meyerson LJ, Wiens BL, LaVange LM, et al.
1. Feigal D, Black D, Grady D, et al. Planning for
increased automation in data collection, the ba- Quality control of oncology clinical trials. Hema-
data management and analysis. In: SB Hulley, SR
sics will remain important. Cummings, eds. Designing clinical research: an tol Oncol Clin North Am 2000;4:953–971
Data collection requires thoughtful prepara- 11. McManus B. A move to electronic patient records
epidemiologic approach. Baltimore: Williams &
tion and consistent implementation. To be suc- in the community: a qualitative case study of a
Wilkins 1988:159–171
clinical data collection system problems caused
cessful, all aspects of data collection must be 2. Friedman LM, Furberg CD, DeMets DL. Data col-
by inattention to users and human error. Top
focused on the goal of obtaining substantively lection and quality control. In: Fundamentals of
Health Inf Manage 2000;20:23–37
clinical trials. New York: Springer 1998:156–169
important data that are consistent, accurate, and 12. Fisher LD, Van Belle G. Data collection: design
3. Altman DG. Statistics and ethics in medical re-
unbiased. Data collection begins with a clear re- of forms. In: Biostatistics: a methodology for the
search: collecting and screening data. BMJ 1980; health sciences. New York: Wiley, 1993:24–34
search question and is followed by careful atten- 281:1399–1401 13. Valk PE. Clinical trials of cost effectiveness in
tion to identifying data needs, anticipating 4. Crombie IK, Davies HTO. Issues in data collec- technology evaluation. Q J Nucl Med 2000;44:
missing or incorrect data, planning statistical tion. In: Research in health care: design, conduct 197–203
analyses, designing a data collection system, es- and interpretation of health services research. 14. Kane RL. Miscellaneous observations about out-
tablishing quality control, and planning for data West Sussex, England: Wiley, 1996:199–222 comes research: practical advice. In: Understand-
5. Goldin J, Sayre JW. A guide to clinical epidemi- ing health care outcomes research. Gaithersburg,
entry. Considerable misspent effort can be
ology for radiologists. I. Study design and re- MD: Aspen, 1997:243–255
avoided if the principal investigators, data man- search methods. Clini Radiol 1996;51:313–316 15. Begg CB. Biases in the assessment of diagnostic
agers, and statisticians work together early in 6. Grady KE, Wallston BS. Research in health care tests. Stat in Med 1987;6:411–423
the design of a data collection effort. settings. Sage 1998;14:84–100 16. Knatterud GL. Comment. Control Clin Trials
We have presented elements of a data 7. Office for Human Research Protections. Regula- 1996;17:285–293
collection checklist that should be ad- tions. Department of Health and Human Services 17. Wright S, Neill K. Using the World Wide Web for
Web site. Available at: http://ohrp.osophs.dhhs.gov. research data collection. Clin Excell Nurse Pract
dressed in most, if not all, clinical research.
Accessed April 18, 2001 1999;3:362–365
This list is not comprehensive; much will 8. Department of Health and Human Services Commis- 18. eDict Systems, Inc. eDict Systems, Inc. Web site.
depend on the specifics of a particular sion on Research Integrity. Integrity and misconduct Available at: http://www.edictation.com. Ac-
study, but recognition of the seven primary in research. Washington: United States Printing Of- cessed April 18, 2001

The reader’s attention is directed to earlier articles in the Clinical Research series: Introduction,
which appeared in the February 2001 issue; Framework, April 2001; and Protocol, June 2001.

AJR:177, October 2001 759


Crewson and Applegate

APPENDIX 1: Glossary of Terms


Case report form A standardized form used to collect and organize data for analysis.

Dependent variable A measure not under the control of the researcher that reflects responses caused by
variations in another measure (the independent variable).

Descriptive statistic A statistic that classifies and summarizes sample data.

Independent variable A measure that can take on different values that are subject to manipulation by the re-
searcher.

Inferential statistic A statistic that uses characteristics of a random sample along with measures of sam-
pling error to predict the true values in a larger population.

Institutional review board An independent group of reviewers responsible for determining if the appropriate clin-
ical, legal, and ethical safeguards have been incorporated into a study.

Interpretation bias An error in data collection that occurs when knowledge of the results of one test affects
the interpretation of a second test.

Interval data Objects classified by type or characteristic, with logical order and equal differences
between levels of data.

Measurement scale A reflection of how well a variable or concept can be measured. Generally categorized
in order of precision as nominal, ordinal, interval, and ratio data.

Nominal data Objects classified by type or characteristic.

Ordinal data Objects classified by type or characteristic with some logical order.

Patient randomization Assignment to a treatment group that is independent of the person recruiting the pa-
tient and the patient's characteristics.

Precision The degree of accuracy used in measuring a variable.

Reliability The extent to which a measure obtains similar results over repeated trials.

Research question A question that defines the purpose of the study by clearly identifying the relation-
ship(s) the researcher intends to investigate.

Response bias Errors in data collection caused by differing patterns and completeness of data collec-
tion that are dominated by a specific subgroup within the sample.

Response variable The measure not controlled in an experiment. Commonly known as the dependent
variable.

Unit of analysis The object under study, which could be patients, radiologists, images, institutions, etc.

Validity The extent to which a measure accurately represents an abstract concept such as the
presence of disease.

Variable A characteristic that can form different values from one observation to another.

760 AJR:177, October 2001


Fundamentals of Clinical Research for Radiologists

APPENDIX 2: Data Collection Checklist


Determine primary research question and key end point (the de- Design data collection system
pendent variable) • Specify data sources (patients, physicians, records)
• Determine primary and secondary research questions and key • Design case report forms to collect data
end points • Pilot test case report forms
• Specify target population and sample selection criteria • Review for systematic bias
• Develop a case numbering system for data entry and record
Identify data needed to measure end points and provide statisti- management
cal controls • Establish system for securing data forms and maintaining
• Identify the unit of analysis (patients, procedures, images, etc.) confidentiality
• Determine scale and precision needed for each data element
• Identify collection sequence (pre- and postintervention, follow-up, etc.) Establish quality control
• Establish a data cleaning procedure and assign responsibilities
Obtain institutional review board approval • Establish acceptable data ranges
• Informed consent form • Create a timeline for quality control
• Patient confidentiality • Require complete entries (removes doubt about reason for
• Identify potential risks missing data)
• Adverse event monitoring
Organize data entry
Create statistical analysis plan • Determine data format and design electronic data file
• Establish statistical methods used for each research hypothesis • Develop coding and data entry guidelines
• Create tables used for reporting results • Set data checking procedures

AJR:177, October 2001 761


Fundamentals of Clinical Research
for Radiologists

Ella A. Kazerooni 1
Population and Sample

T
he design of clinical research begins to assess and manage the uncertainties inherent
with the formulation of a research in this process of scientific inference.
question. As radiologists, we ask The goal of this article is to review the dis-
many questions about the diagnostic imaging tinction made by modern scientific thought be-
tests we perform and interpret, particularly as tween population and sample, and to review
new tests are introduced. Can we see a disease considerations applicable to the identification
on an imaging test at all (technical efficacy)? and selection of population and sample in clini-
What are the imaging findings of that disease cal radiology research.
(description)? Can these findings be used to dis- Conventional science distinguishes three
tinguish between the disease in question and the groups of individuals (Fig. 1). The goal of the
condition of no disease (accuracy) or distin- series that includes this article is to bring clinical
guish between different diseases (discrimina- research in radiology more in line with main-
tion)? Is a newly introduced imaging test as stream medical research. Researchers in radiol-
good as or better than existing tests (compari- ogy should therefore adhere to the modern
son)? Can the test be performed in a technically concepts of target population, study population,
adequate manner in most clinical circumstances and sample when designing and writing about
(technical reproducibility)? Will the same radi- their research. Introductory statistical texts serve
Received April 6, 2001; accepted after revision ologist interpreting an imaging study today and to codify current concepts in mainstream scien-
April 24, 2001. the same study again next month come to the tific thinking. The following excerpt, represen-
Series editors: Craig A. Beam, C. Craig Blackmore, same conclusion (intraobserver agreement), and tative of many, is taken from one such widely
Stephen J. Karlik, and Caroline Reinhold. will a group of radiologists of varying expertise used text [1].
This is the fifth in the series designed by the American interpret the same study the same way (interob-
College of Radiology (ACR), the Canadian Association of
Radiologists, and the American Journal of Roentgenology. The
server agreement)? What is patient preference We must also carefully distinguish
series, which will ultimately comprise 22 articles, is when given the option of two or more compet- between the TARGET POPULATION
designed to progressively educate radiologists in the ing tests? How cost-effective is the test? How and the STUDY POPULATION. The
methodologies of rigorous clinical research, from the most
basic principles to a level of considerable sophistication. does the test affect treatment outcome? target population is the whole group of
The articles are intended to complement interactive Substantial research questions deal with mat- [individuals] to which we are interested
software that permits the user to work with what he or she ters of vital relevance to important groups, or in applying our conclusions. The study
has learned, which is available on the ACR Web site
(www.acr.org). populations of individuals. However, important population, on the other hand, is the
Project coordinator: Bruce J. Hillman, Chair, ACR
populations are generally large and, because of group of [individuals] to which we can
Commission on Research and Technology Assessment. numerous practicalities (economy, time, and eth- legitimately apply our conclusions.
1
Department of Radiology, 2910 Taubman Center, ics), researchers often find they cannot afford to Unfortunately the target population is
University of Michigan Medical Center, 1500 E. Medical study all members of interesting populations. not always readily accessible and we
Center Dr., Ann Arbor, MI 48109-0326. Address The time-honored scientific solution to this can only study that part of it that is
correspondence to E. A. Kazerooni.
problem is to draw a representative subset, or available. If, for example, we are con-
AJR 2001;177:993–999
sample, from the population and to base conclu- ducting a telephone interview…we do
0361–803X/01/1775–993 sions about the population on conclusions drawn not have access to those individuals
© American Roentgen Ray Society from the sample. Statistical science is then used without a telephone.

AJR:177, November 2001 993


Kazerooni

Further on in the same text, the authors iden- Definition of Sample cancer who were surgical candidates in the
tify “sample”[1]: The sample is described thoroughly in terms United States. A third group must be defined,
of clinical and demographic characteristics in however: the study population. This population
There are many ways to collect infor- the methods section of a research article so that includes the sample and all other patients with
mation about the study population. One others can draw conclusions, apply the results, the same characteristics as the sample who did
way is to conduct a complete CENSUS, and compare one investigation with another. It is not participate in the study, but are in the same
by collecting data for every [individual] not the target population, but rather a group of geographic location during the same time period
in it.… A more practical approach is to patients or individuals who are actually studied. of the study. For example, in the Radiologic Di-
study some fraction, or SAMPLE, of The target population consists of all the individu- agnostic Oncology Group study of 170 patients,
the population. als in the world, or in the United States, with the 250 patients in total met eligibility criteria. The
same characteristics as the sample to which we study population includes those 80 patients who
Before selecting a sample, the investigator would like to apply the conclusions of a study. were excluded for various reasons. Some pa-
first must determine whether a need really exists Because it is unrealistic to perform research on tients might have declined to be studied, others
for the information that will come from the in- all individuals on earth or in the United States or might have dropped out after enrollment. How
vestigation. The question being asked is inti- in one state, we settle on a subset, or a sample, they differ from those who agreed to participate
mately related to the selection of a sample that with defined inclusion and exclusion criteria. might introduce bias, which is discussed later.
can provide the answer, and to the size of the However, the results drawn from the investiga- If a group of patients in clinical practice
sample needed to answer the question. The tion of the sample are interpreted and applied di- meets the same inclusion and exclusion crite-
sample composition impacts the generalizabil- rectly only to the study population. For example, ria as the sample, then we apply the conclu-
ity of the results to the study population; the to evaluate the accuracy of CT and MR imaging sions drawn from the sample to these patients
composition of the study population impacts for lung cancer staging, it is not possible to per- from the study population with confidence.
further generalization to the target population. form CT and MR imaging on all patients diag- The more a patient differs from the sample, the
The biases that might be introduced in the selec- nosed with lung cancer in the United States. The more likely it is that the results from the sam-
tion of the sample impact the confidence in the Radiologic Diagnostic Oncology Group [2] re- ple do not apply to this patient.
conclusions that can be drawn from a research ported the accuracy of CT and MR imaging in
study. In discussing the sample necessary to an- 170 patients with “known or suspected” non–
Can a Disease Be Detected on an
swer different questions, examples have been small cell lung cancer who were “considered to
Imaging Test, and What Does It Look
taken from this author’s subspecialty of thoracic be surgical candidates on the basis of general Like?
radiology, particularly the use of CT pulmonary health and pulmonary function.” The sample
angiography for the diagnosis of acute pulmo- If the intended purpose of proposed re-
was the 170 patients, and the target population
nary embolism and lung cancer. search is to introduce a new concept to the lit-
was all patients with known or suspected lung
erature, then a sample of one or a few might be
sufficient. This approach might be useful when
a new technology is applied to a disease or
clinical circumstance, or when the imaging
findings of a specific disease are being de-
scribed. This type of research is called descrip-
tive research, and it is used in most of the
published radiology articles [3–6]. Descriptive
research is the lowest on the hierarchy of stud-
ies at providing information that can be used to
evaluate the efficacy of a diagnostic test in ac-
tual clinical practice [7], but for rarely occur-
ing diseases it might be difficult to do anything
more. However, these studies are a necessary
first step along the way to evaluating efficacy.
They are the easiest to perform, use the least
amount of resources, and in the circumstance
of a single case report, are usually the hardest
to publish. Without knowing what a disease
looks like, the next step—determining whether
a test can distinguish between disease or no
disease, can discriminate between diseases,
and, if so, how accurately and reproducibly—
Fig. 1.—Graphic shows relationships among target population, study population, and sample. Conventional sci- cannot be done.
ence distinguishes three groups of individuals. Target population is population of ultimate clinical interest. But, For example, in the early to mid 1980s, sev-
because of practicalities, entire target population often cannot be studied. Study population is subset of target
population that can be studied. Samples are subsets of study populations used in clinical research because often eral groups of researchers reported on CT and
not every member of study population can be measured. pulmonary embolism [8–13]. Those articles

994 AJR:177, November 2001


Fundamentals of Clinical Research for Radiologists

were case reports and small case series that for tient enrollment, and selection based on the Exclusions and Omission of Uninterpretable Results
the first time documented that pulmonary em- availability of imaging rather than the clinical As important as it is to describe who was
bolism could be seen on IV contrast-enhanced presentation or clinical question. studied, it is also important to describe pa-
CT. Although this simple concept might appear tients who were excluded from the study or
obvious to someone looking at the CT technol- Sampling Bias who declined to participate, because they
ogy of today, it was not apparent before that The best sample is one that has the same might be different from the patients actually
time. The purpose of these reports by several characteristics as the study population to which studied [15]. Some exclusions are random:
investigators was to confirm the observation the investigator wishes the results to be applied. for example, an optical disk on which a CT
and to generate a database of knowledge that The choice of a control group might introduce scan of a patient was stored is corrupted and
could lead to the generation of more complex bias. A control group made up of normal volun- the hard- copy images for that case are lost,
scientific hypotheses. The early observations teers recruited from a newspaper advertisement or a patient died an unrelated death as a result
did not show the technical limitations of the or a notice on a bulletin board is likely to be of an airplane crash. Other exclusions are not
technique or reveal the parameters necessary to healthier than disease-free patients being seen in random, and might introduce bias. For exam-
optimize the technique. They did not show the a medical clinic, which will make a diagnostic ple, if patients with early stage lung cancer
accuracy of CT compared with a known refer- test appear more specific [14]. For example, if manifesting predominantly as a solitary pul-
ence standard such as conventional pulmonary the intent is to investigate the diagnostic accu- monary nodule declined to participate in a
angiography, and they did not show the accu- racy of a test, such as positron emission tomog- CT study designed to evaluate lung cancer
racy of CT compared with other diagnostic raphy, to distinguish between lung cancer and staging, the sensitivity of CT staging might
tests, such as ventilation–perfusion scintigra- no lung cancer, the appropriate group to study is be artificially high and the population studied
phy alone or in combination with lower all patients with suspected lung cancer, not pa- might be biased to patients with relatively ob-
extremity sonography. They did not show tients with lung cancer and healthy volunteers. vious metastatic disease. On the other hand, if
whether observers of varying expertise could In actual clinical practice, the diagnostic test patients with advanced metastatic lung cancer
agree on the diagnosis reproducibly or evaluate would not be applied to normal healthy volun- declined to participate in the study because
patient preference for one diagnostic test or an- teers but instead to patients with, for example, a they felt too sick, then the sensitivity of CT
other. These observations were simply the first solitary nodule detected on a chest radiograph, staging might be artificially low because the
step in a series of steps that need to occur be- some of whom will have lung cancer and some patients with the most obvious disease were
fore it can be determined if and what the role of of whom will not. not included. For these reasons, it is impor-
a new technology is in medical practice. No matter what population is studied, it is im- tant to describe the patients studied as well as
portant to thoroughly describe them. It is equally the patients who were not studied, and to
important to describe the sample. Although age compare them to determine whether inherent
Selection Bias and How to Select an and sex are usually specified, other factors, such differences exist.
Unbiased Population as racial mix, inner city versus rural setting, or Consider the Radiologic Diagnostic On-
When looking for a population of patients type of medical center in which the investigation cology Group lung cancer staging study [2]
with a specific disease for which the findings was performed, often are not. Diseases might in which 80 of the 250 eligible patients were
of that disease are to be described, or to com- look different in populations of different ethnic excluded from the analysis. The report states
pare the accuracy of one test against another, it backgrounds, and therefore diagnostic tests that 43 of these patients did not undergo a
might seem straightforward to generate a list might perform differently. Patients referred to a surgical staging procedure, and “20 of these
of all patients with the disease who have un- tertiary academic medical center might have were considered to have extensive disease on
dergone the test or tests of interest over a spec- more severe disease than patients treated for the the basis of imaging studies (six of these had
ified period of time. However, who is chosen same disease in a community hospital. This fac- T3 or T4 lesions).” Therefore, six (7.5%) of
impacts to whom the results can be general- tor might make a diagnostic test appear to be 80 patients excluded had T3 or T4 lesions,
ized. Many times in descriptive series a state- more sensitive than it is in actual community compared with 48 of the 170 studied, or 28%
ment is made in the methods section that all practice, because more severe disease is gener- [2]. In general, the higher the T level, the
patients with a specific disease imaged with a ally easier to detect [14]. It is also important to more likely that metastatic lymph nodes are
specific test formed the sample. Or, when report comorbidities. For example, the accuracy present and that these lymph nodes are larger
comparing one test against another, such as CT of CT pulmonary angiography for pulmonary in size and greater in number than for lower
versus MR imaging, all patients who under- embolism might be different in outpatients, who level T lesions, and therefore easier to iden-
went CT were compared against all patients in general are less sick and more likely to be able tify. If the sample is skewed toward patients
who underwent MR imaging. What does this to hold their breath for a CT examination, than in with more severe disease, then the sensitivity
really mean? It is important that the population hospitalized patients, particularly intensive care might be overestimated. On the other hand,
studied is thoroughly described, so that readers unit patients, who are more likely to have lung for the other 14 of 20 excluded for extensive
can compare the results of one study against disease. In this example, reporting the fre- disease, it is not stated in the published report
another, particularly when results appear to be quency of pleural effusions, lung abnormalities, what the extensive disease was. It is logical to
in conflict. Several biases can be introduced; pulmonary function test results, and the percent- think it might have been metastatic disease or
the major issues of concern are sampling bias, age of patients who are ventilator-dependent M1 disease because patients with all levels of
the exclusion of patients, the use of a retrospec- might be crucial to understanding the popula- nodal or N disease were reported. If this is cor-
tive sample versus a prospectively collected tion studied and how the results could be ap- rect, then 14 (17.5%) of 80 excluded patients
sample, consecutive versus nonconsecutive pa- plied in clinical practice. had metastatic disease. Because it is more likely

AJR:177, November 2001 995


Kazerooni

that patients with metastatic disease have larger for pulmonary embolism on CT pulmonary an- With regard to CT and pulmonary embo-
lymph nodes of greater size than patients with- giography; the other five patients (12.5%) were lism, in order to know the sensitivity of CT pul-
out metastatic disease, selecting out more obvi- not scanned because of suspected pulmonary monary angiography for pulmonary embolism
ous cases of lymph node metastases might embolism. In other words, the CT scans were in the general population of patients presenting
artificially reduce the reported sensitivity for much more ideal than they would be in a con- with suspected pulmonary embolism, a pro-
lymph node staging compared with a group of secutive group of patients being scanned for spective investigation of all patients with sus-
all patients with known or suspected lung can- pulmonary embolism, who are commonly pected pulmonary embolism is necessary,
cer selected to undergo imaging. So within the short of breath and might have lung parenchy- using a reference standard such as conventional
same study there are reasons to think that the re- mal or pleural abnormalities, or alterations in pulmonary angiography. The goal should be to
ported sensitivity of CT and MR imaging for cardiac function that might reduce the technical prospectively recruit all patients with sus-
staging the lymph nodes is exaggerated and un- adequacy of the study. Although this study of pected pulmonary embolism and have all pa-
derestimated. The more thoroughly the sample collimation showed that with thinner collima- tients undergo the test under evaluation— CT
and the excluded patients are defined, the easier tion more small vessels were well seen, it is un- pulmonary angiography, and the reference
it is to know whether they are similar or dissim- clear whether this finding would translate to a test—conventional angiography. Consider the
ilar and how that might impact these reported more realistic clinical population. impact of retrospective selection of the sample
measures of test performance. on diagnostic accuracy in the following scenar-
Omitting the results of studies that are techni- ios. If all patients undergoing both CT pulmo-
Retrospective Versus Prospective
cally inadequate and therefore uninterpretable, nary angiography and conventional pulmonary
Selection
or including in a study only patients who can angiography over the previous 2-year period
cooperate sufficiently to produce a technically When patients are selected retrospectively, formed the sample, the reasons that patients un-
optimal diagnostic test can lead to an overesti- it is important to know why they were selected derwent both tests, and not just CT pulmonary
mate of the test’s sensitivity. For example, one for imaging. Rather than representing all pa- angiography, impact sensitivity. If a large pro-
cause of suboptimal-quality CT pulmonary an- tients with a suspected disease or all patients in portion of the conventional angiograms were
giography for acute pulmonary embolism is res- a specific clinical circumstance who presented obtained because of inconclusive findings or a
piratory motion, because many patients with for evaluation, it is more likely that patients technically poor CT pulmonary angiogram,
suspected pulmonary embolism are short of might have been sent for imaging for clinical then the sensitivity of CT pulmonary angiogra-
breath. If the sample is selected using clinical reasons that make them different than if the di- phy will appear artificially low compared with
and demographic characteristics, and then the agnostic test had been applied to all patients sensitivity in the general population. If a nor-
examinations of suboptimal quality are ex- with the same disease or symptoms. Biases mal CT pulmonary angiography is the predom-
cluded from the final analysis, the reported sen- will be introduced by such patient selection inant reason for obtaining conventional
sitivity will be higher than if these patients were that might overestimate the value of the diag- angiograms, the sensitivity of CT pulmonary
included in the analysis as cases in which no nostic test being studied or the frequency with angiography will again be low. In this case, the
pulmonary embolism was detected on these which specific abnormal findings are reported. frequency of subsegmental emboli found at an-
studies (i.e., as negatives). When looking at pulmonary embolism, the giography will also be higher than would be
Using another CT pulmonary angiography sensitivity of CT pulmonary angiography for found in the general population of patients with
example, Remy-Jardin et al. [16] compared the small emboli has been questioned, leading in- pulmonary embolism because patients with
findings in 20 patients who underwent pulmo- vestigators to look at the frequency with larger and more obvious emboli will not have
nary angiography studies using 3-mm collima- which isolated subsegmental or smaller pul- undergone conventional angiography.
tion, pitch of 1.7, and 1.0 sec per rotation with monary embolisms occur. Reported percent- Which physicians accept and begin to use a
findings in 20 patients who underwent CT pul- ages have ranged broadly from 4% to 36% new imaging test might also bias the results. For
monary angiography studies using 2-mm colli- [17–20]. In one study, consecutive patients example, if physicians in the emergency depart-
mation, pitch of 2, and 0.75 sec per rotation. undergoing conventional angiography were ment began using CT pulmonary angiography
Remy-Jardin et al. stated the purpose of their studied, and 30% were found to have emboli before most of the physicians taking care of in-
study was to “analyze the influence of collima- in only subsegmental or smaller pulmonary patients, then the sensitivity of CT pulmonary
tion on identification of segmental and subseg- arteries [20]. As the methods stated, these angiography might be high, but would be biased
mental pulmonary arteries.” The frequency of were consecutive patients undergoing pulmo- by the type of patients that are seen in the emer-
arteries that were sufficiently well seen to be nary angiography, not consecutive patients gency department, who in general might be
analyzable for emboli was reported for both with suspected pulmonary embolism. In fact, healthier, younger, able to hold their breath bet-
groups, with statistically significantly more Oser et al. [20] stated in the discussion of ter, or have less lung disease than hospitalized
segmental and subsegmental arteries seen with their publication that patients. On the other hand, if critical care med-
the thinner collimation protocol. When the icine physicians accept CT pulmonary angiog-
sample is scrutinized, the scans included in the … the vast majority of our patients had raphy earlier for intensive care unit patients, the
study had to be “technically acceptable,” with intermediate-probability lung scans; thus, sensitivity of CT pulmonary angiography might
strict inspiratory apnea and good or excellent the patients with a larger embolic burden, appear low because of the extensive parenchy-
arterial contrast opacification. Patients with namely, those with high-probability mal consolidation and pleural effusions that are
prior lung surgery, lung distortion, or parenchy- scans, were potentially excluded. This often present in this population of patients who
mal infiltration on CT were excluded. Thirty-five selection bias is difficult to avoid in a ret- are often ventilator-dependent. In this way, the
patients were evaluated for suspected pulmonary rospective series, as it reflects the hospital spectrum of disease or the case mix in the sam-
embolism, all of whom had negative findings referral pattern. ple impacts the measured accuracy of the diag-

996 AJR:177, November 2001


Fundamentals of Clinical Research for Radiologists

nostic test in question. This point reinforces the tention hours after the onset of the acute event, To compare the accuracy of CT pulmonary an-
need to thoroughly describe the patient popula- whereas patients coming during the day might giography and conventional angiography, these
tion studied. have had symptoms of shorter duration. Be- investigators instilled colored methacrylate
Retrospective studies also suffer from recall cause the time from onset of symptoms is criti- beads into the pulmonary artery circulation of
bias. Suppose an investigator wants to deter- cal to outcome after a myocardial infarction, pigs, with a methacrylate cast of the pulmonary
mine the severity of dyspnea in patients with patients presenting during the day might have a arteries used as the reference standard. These re-
suspected pulmonary embolism, hypothesizing better outcome than patients presenting at searchers found no statistically significant differ-
that patients with more severe dyspnea have a night, independent of any therapeutic interven- ence in CT pulmonary angiography and
higher frequency of pulmonary embolism than tion. conventional angiography for the detection of
patients with lesser degrees of dyspnea or no emboli. However, if conventional angiography
dyspnea at all. The investigator might be ap- Reference Standard were used as the reference standard to which 1-
proaching this as a way to evaluate the likeli- The choice of a reference standard impacts mm CT pulmonary angiography was compared,
hood of a patient’s having pulmonary embolism measurements of test accuracy. In contrast to conventional angiography would, by definition
and thus to triage patients to a diagnostic test the ideal scenario for evaluating the accuracy of as the reference test, be 100% sensitive with a
within 1 hr versus within 4–6 hr, given the avail- CT pulmonary angiography described in the 100% positive predictive value, whereas CT pul-
able imaging facilities. If an investigator ques- previous section, a methods section might read: monary angiography would be considered only
tions all patients evaluated over the past year for “All patients with pulmonary embolism con- 76% sensitive with a positive predictive value of
suspected pulmonary embolism about their dys- firmed at autopsy who had undergone CT pul- only 86%. If the sensitivity of a test is in ques-
pnea, it is likely that the patients who were monary angiography formed the sample.” In tion, surrogate measurements might be used to
diagnosed with pulmonary embolism and hos- this case, the sensitivity of CT pulmonary an- support the value of a negative test, such as pa-
pitalized for treatment will remember their dys- giography might be higher than in the general tient outcome. For CT pulmonary angiography,
pnea more vividly and rate it as more severe population because patients dying from pulmo- most investigators have looked at series of pa-
than patients not diagnosed with pulmonary nary embolism might have larger emboli than tients gathered retrospectively with negative
embolism who were sent home. This would ex- patients not dying from pulmonary embolism. findings for pulmonary embolism on CT pulmo-
aggerate the difference in reported dyspnea in Another problem is commonly referred to nary angiography, and looked at the incidence of
the two groups, compared with what would be as “workup bias” [21]. Whenever the reference pulmonary embolism over the next 3–12
seen if all of the patients were asked about dysp- test is selectively applied only to patients with months. These studies have shown that pulmo-
nea before undergoing any diagnostic test for a positive result on the test in question—for nary embolism occurs with the same frequency
pulmonary embolism and would thereby in- example, only patients with a positive CT pul- after negative findings on CT pulmonary an-
crease the likelihood that the investigator’s hy- monary angiography—the reported sensitivity giography as after negative findings on conven-
pothesis would be proven correct on analysis. of CT pulmonary angiography will be artifi- tional angiography [29, 30].
cially high at 100%, whereas the specificity
Consecutive Versus Nonconsecutive will be artificially low. Imaging-Based Selection
Selection When a new technology is compared with ac- It is often convenient to select patients who
If patients are selected in a nonconsecutive cepted reference tests or gold standards, the ac- have undergone an imaging test, or patients
manner, they might be inherently different curacy of the reference test is often called into who are going to be sent for imaging, to form a
from a population of all patients who meet in- question [22–27]. In the example of CT pulmo- sample. This is referred to as imaging-based
clusion criteria for a study. Suppose that the nary angiography, the validity of conventional selection. However, patients who undergo im-
strategy were to recruit only the first patient pulmonary angiography has been questioned. aging might not be representative of all pa-
seen each day who met the inclusion criteria for Several studies have reported poor interobserver tients with a specific diagnosis or symptom.
the study. It is possible that patients who are agreement as to the presence or absence of em- Consider describing the appearance of lung
able to come for a 7:00 A.M. clinic appointment boli in subsegmental pulmonary arteries on cancer on MR imaging. Investigators could
are different from patients who come later in conventional angiography. The Prospective In- generate a list of all patients at their facility
the day. Perhaps they are less sick, resulting in vestigation of Pulmonary Embolism Diagnosis who underwent thoracic MR imaging in the
a bias toward milder disease. Suppose that the investigators (PIOPED) [28] found only 66% past or will be undergoing MR imaging over
strategy were to recruit only those patients agreement among observers for isolated subseg- the next year, who have a diagnosis of lung
meeting the inclusion criteria who are seen mental emboli, compared with 98% at the lobar cancer. A fairly high proportion of these pa-
Monday to Friday between 8:00 A.M. and 5:00 level and 90% at the segmental artery level. Sim- tients will likely have masses that abut or in-
P.M. If the study were looking at lung cancer ilarly, Diffin et al. [17] reported interobserver vade the mediastinum. This does not mean that
staging accuracy, there might be little, if any, agreement of only 45% for isolated subsegmen- this proportion of all patients presenting with
bias. However, in other circumstances, the pa- tal emboli at conventional angiography. If ob- lung cancer have mediastinal invasion, be-
tients might be inherently different from pa- servers cannot agree on the gold standard, how cause the patients undergoing MR imaging for
tients presenting to the emergency department can the new test, CT pulmonary angiography, be lung cancer are usually preselected because of
in the evening with the same symptom com- compared with it? This problem might lead in- a suspicion of mediastinal invasion on CT, and
plex. For example, if the study involved sus- vestigators to look for a new sample population therefore the high incidence should not be sur-
pected myocardial infarction, the patients and apply a new gold standard. To do so might prising. To know what the appearance of lung
coming to the emergency department in the require an animal study with autopsy confirma- cancer is on MR imaging or to determine the
evening after a day of work might have had tion as the reference standard. For CT pulmo- accuracy with which MR imaging can detect
chest pain all day long and sought medical at- nary angiography, Baile et al. [27] did just that. lung cancer requires that all consecutive pa-

AJR:177, November 2001 997


Kazerooni

tients with a diagnosis of lung cancer over a easiest patients for CT to evaluate, were not Conclusion
specified period of time undergo MR imaging. studied with CT. Therefore, it is likely that This article has reviewed the current con-
Although this example might seen fairly obvi- 36% is an overestimate of the frequency cepts of target population, study population,
ous, the literature is full of examples in which with which isolated subsegmental pulmo- and sample. These terms need to be used ap-
this type of selection bias impacts study re- nary embolism occurs. The latter study was propriately in the design, execution, and report-
sults, although the impact on the results might the PIOPED study [18, 28], in which pa- ing of clinical research in radiology. The article
be less obvious than in the example and not tients with suspected pulmonary embolism also has discussed considerations for the defini-
initially apparent. were prospectively enrolled at multiple medi- tion and selection of these entities. Other con-
cal centers, and all patients underwent venti- siderations, such as randomization, statistical
Generalizability lation–perfusion scannning and conventional power, and sample size, that are relevant specif-
Who was studied impacts to whom the re- pulmonary angiography. ically to the selection of sample, will be the
sults can be applied. If all patients presenting The results described by Goodman et al. [19] subject of future articles in this series.
with suspected pulmonary embolism undergo a can be generalized only to patients with an un-
diagnostic test, the results will be different than resolved clinical suspicion for pulmonary em-
if only patients with acute right heart failure bolism after ventilation–perfusion scanning
References
and suspected massive pulmonary embolism who underwent CT, as the title of that investiga-
1. Elston R, Johnson W. Populations, samples and
are studied, or if patients who have an incon- tion states clearly. The results can also be gener-
study design. In: Essentials of biostatistics. 2nd
clusive result from another diagnostic test, such alized only to patients undergoing CT with the ed. Philadelphia: Davis, 1994:15–16
as a ventilation–perfusion scan, are studied. technique that was reported (5-mm collimation, 2. Webb WR, Gatsonis C, Zerhouni EA, et al. CT
Similarly, how the test performs on inpatients pitch of 1:1, covering 12 cm of the thorax, and and MR imaging in staging non-small cell bron-
or intensive care unit patients might be different viewed on hard-copy film). Imaging technology chogenic carcinoma: report of the Radiologic Di-
from how it performs in outpatients or patients rapidly evolves. Several researchers after Good- agnostic Oncology Group. Radiology 1991;178:
presenting to an emergency department, who man et al. have reported on CT pulmonary an- 705–713
3. Blackmore CC, Black WC, Jarvik JG, Langlotz
are less likely to have coexisting lung disease giography at 3-mm collimation [32–34]. The CP. A critical synopsis of the diagnostic and
or abnormal chest radiographic findings. In ability to perform multidetector CT pulmonary screening radiology outcomes literature. Acad
selecting a population to study for an investiga- angiography using 1.25-mm collimation of the Radiol 1999;6[suppl 1]:S8–S18
tion, it is important to consider to whom the entire thorax is now possible, and interpretation 4. Hillman BJ. Research in radiology departments.
information derived from that investigation can on workstations has been shown to improve de- Invest Radiol 1993;28[suppl 2]:S44–S48
be applied. tection of pulmonary embolism compared with 5. Applegate KE. Study design: pros and cons. In:
2000 annual meeting scientific session. Oak
For example, recently the prevalence of iso- film-based interpretation [35]. However, the
Brook, IL: Society of Health Services Research
lated subsegmental pulmonary embolism has published literature lags behind what the tech- in Radiology, 2000
been debated as part of the question of how ac- nology of today is capable of. As investigators 6. Holman BL. The research that radiologists do:
curate CT pulmonary angiography needs to be plan to study a new technology, they should perspective based on a survey of the literature.
for the detection of subsegmental pulmonary consider ways to recruit a larger number of pa- Radiology 1990;176:329–332
embolism. If isolated subsegmental pulmonary tients more quickly to answer the question they 7. Green SB, Byar DP. Using observational data
from registries to compare treatments: the fallacy
embolism rarely occurs, then the technology propose before the technology is outdated [36].
of omnimetrics. Stat Med 1984;3:361–373
might not need to be accurate for vessels of this Several studies have reported the findings of 8. Godwin JD, Webb WR, Gamsu G, Ovenfors CO.
size. However, if isolated subsegmental emboli pulmonary embolism detected incidentally on Computed tomography of pulmonary embolism.
are commonly seen, then the technology might CT scans obtained for other reasons [13, 35, AJR 1980;135:691–695
need to be accurate. In one study, isolated sub- 37–39]. It would be incorrect to draw a conclu- 9. Sinner WN. Computed tomography of pulmonary
segmental pulmonary embolism was reported sion that the anatomic distribution of pulmo- thromboembolism. Eur J Radiol 1982;2:8–13
10. Ovenfors CO, Godwin JD, Brito AC. Diagnosis of
to occur in 36% of patients diagnosed with pul- nary emboli in these patients is the same as in
peripheral pulmonary emboli by computed tomogra-
monary embolism [19]. In another study, iso- a population of patients presenting with clini- phy in the living dog. Radiology 1981;141:519–523
lated subsegmental pulmonary embolism was cal signs or symptoms of pulmonary em- 11. Cholankeril JV, Ketyer S, Ramamurti S, Millman
reported to occur in only 6% of patients diag- bolism. In one series of nine patients, no AE. Pulmonary embolism demonstrated by com-
nosed with pulmonary embolism [18, 31]. incidentally detected emboli were seen beyond puterized tomography. J Comput Assist Tomogr
Which more realistically represents a popula- the segmental arteries [39]. This result does 1982;6:135–139
tion of all patients with suspected pulmonary not mean that subsegmental pulmonary embo- 12. Breatnach E, Stanley RJ. CT diagnosis of seg-
mental pulmonary artery embolus. J Comput As-
embolism? The former study was performed to lism does not occur as an incidental finding. sist Tomogr 1984;8:762–764
prospectively compare helical CT with pulmo- The CT scans in this study might have been 13. Allen BT, Day DL, Dehner LP. CT demonstration
nary angiography for the detection of pulmo- done with protocols used for general thoracic of asymptomatic pulmonary emboli after bone
nary embolism in patients with an unresolved CT, rather than using a thin-section, rapid marrow transplantation: case report. Pediatr Ra-
clinical and ventilation–perfusion scan diagno- IV–contrast injection protocol CT, or the re- diol 1987;17:65–67
sis of pulmonary embolism. Patients with searchers may not have used a workstation 14. Browner WS, Newman TB, Cummings SR. Design-
ing a new study. III. Diagnostic tests. In: Hulley SB,
either a normal perfusion scan or a high-proba- for interpretation—both factors that improve
Cummings SR, eds. Designing clinical research.
bility scan, the two groups for whom no pul- the accuracy of CT pulmonary angiography Baltimore: Williams & Wilkins, 1988:87–97
monary embolism and definite pulmonary for pulmonary embolism, particularly for 15. Hulley SB, Gove S, Browner WS, Cummings SR.
embolism were diagnosed, and perhaps the small arteries. Choosing the study subjects: specification and

998 AJR:177, November 2001


Fundamentals of Clinical Research for Radiologists

sampling. In: Hulley SB, Cummings SR, eds. De- ST-segment resolution analysis: reexamining the study. J Nucl Med 1995;36:2380–2387
signing clinical research. Baltimore: Williams & “gold standard” for myocardial reperfusion as- 32. Garg K, Welsh CH, Feyerabend AJ, et al. Pulmo-
Wilkins, 1988:10–30 sessment. J Am Coll Cardiol 2000;35:666–672 nary embolism: diagnosis with spiral CT and ven-
16. Remy-Jardin M, Remy J, Artaud D, Deschildre F, 24. Koretz RL. Prospective randomized controlled tilation-perfusion scanning—correlation with
Duhamel A. Peripheral pulmonary arteries: opti- trials: when the gold in the gold standard isn’t pulmonary angiographic results or clinical out-
mization of the spiral CT acquisition protocol. pure. (commentary) J Parenter Enteral Nutr come. Radiology 1998;208:201–208
Radiology 1997;204:157–163 2000;24:5–6 33. Mayo JR, Remy-Jardin M, Muller NL, et al. Pul-
17. Diffin DC, Leyendecker JR, Johnson SP, Zucker 25. Rolfe MW, Solomon DA. Lower extremity monary embolism: prospective comparison of
RJ, Grebe PJ. Effect of anatomic distribution of venography: still the gold standard. (editorial) spiral CT with ventilation-perfusion scintigra-
pulmonary emboli on interobserver agreement in Chest 1999;116:853–854 phy. Radiology 1997;205:447–452
the interpretation of pulmonary angiography. AJR 26. Kalodiki E, Nicolaides AN, Al-Kutoubi A, Cun- 34. Remy-Jardin M, Remy J, Deschildre F, et al. Di-
1998;171:1085–1089 ningham DA, Mandalia S. How “gold” is the agnosis of pulmonary embolism with spiral CT:
18. Stein PD, Henry JW. Prevalence of acute pulmo- standard? interobservers’ variation on venograms. comparison with pulmonary angiography and
nary embolism in central and subsegmental pul- Int Angiol 1998;17:83–88 scintigraphy. Radiology 1996;200:699–706
monary arteries and relation to probability 27. Baile EM, King GG, Muller NL, et al. Spiral com- 35. Gosselin MV, Rubin GD, Leung AN, Huang J,
interpretation of ventilation/perfusion lung scans. puted tomography is comparable to angiography Rizk NW. Unsuspected pulmonary embolism:
Chest 1997;111:1246–1248 for the diagnosis of pulmonary embolism. Am J prospective detection on routine helical CT scans.
19. Goodman LR, Curtin JJ, Mewissen MW, et al. Respir Crit Care Med 2000;161:1010–1015 Radiology 1998;208:209–215
Detection of pulmonary embolism in patients 28. The PIOPED Investigators. Value of the ventila- 36. Baum RA, Rutter CM, Sunshine JH, et al. Multi-
with unresolved clinical and scintigraphic diag- tion/perfusion scan in acute pulmonary embo- center trial to evaluate vascular magnetic resonance
nosis: helical CT versus angiography. AJR lism: results of the prospective investigation of angiography of the lower extremity: American Col-
1995;164:1369–1374 pulmonary embolism diagnosis (PIOPED). JAMA lege of Radiology Rapid Technology Assessment
20. Oser RF, Zuckerman DA, Gutierrez FR, Brink 1990;263:2753–2759 Group. JAMA 1995;274:875–880
JA. Anatomic distribution of pulmonary emboli at 29. Goodman LR, Lipchik RJ, Kuzo RS, Liu Y, 37. Verschakelen JA, Vanwijck E, Bogaert J, Baert
pulmonary angiography: implications for cross- McAuliffe TL, O’Brien DJ. Subsequent pulmo- AL. Detection of unsuspected central pulmonary
sectional imaging. Radiology 1996;199:31–35 nary embolism: risk after a negative helical CT embolism with conventional contrast-enhanced
21. Begg CB, McNeil BJ. Assessment of radiologic pulmonary angiogram—prospective comparison CT. Radiology 1993;188:847–850
tests: control of bias and other design consider- with scintigraphy. Radiology 2000;215:535–542 38. Winston CB, Wechsler RJ, Salazar AM, Kurtz
ations. Radiology 1988;167:565–569 30. Garg K, Sieler H, Welsh CH, Johnston RJ, Russ AB, Spirn PW. Incidental pulmonary emboli de-
22. Chugh SK. Stress echo training: need for a better PD. Clinical validity of helical CT being interpreted tected at helical CT: effect on patient care. Radi-
gold standard—the invasive viewpoint. Eur Heart as negative for pulmonary embolism: implications ology 1996;201:23–27
J 2000;21:859–860 for patient treatment. AJR 1999;172:1627–1631 39. Romano WM, Cascade PN, Korobkin MT, Quint
23. Shah A, Wagner GS, Granger CB, et al. Prognos- 31. Worsley DF, Alavi A. Comprehensive analysis of LE, Francis IR. Implications of unsuspected pul-
tic implications of TIMI flow grade in the infarct the results of the PIOPED study: prospective in- monary embolism detected by computed tomog-
related artery compared with continuous 12-lead vestigation of pulmonary embolism diagnosis raphy. Can Assoc Radiol J 1995;46:363–367

The reader’s attention is directed to the earlier articles in the Clinical Research Series: Introduction,
which appeared in the February 2001 issue; Framework, April 2001; Protocol, June 2001; and Data Collec-
tion, October 2001.

AJR:177, November 2001 999


Kazerooni

1000 AJR:177, November 2001


Fundamentals of Clinical Research
for Radiologists
Craig A. Beam 1
Statistically Engineering the Study
for Success
scientific study is a dynamic en- Minimizing Bias

A deavor the outcome of which can


never be wholly determined in
advance. However, over years of experience,
Statistical Meaning of the Word “Bias”
As with many other words, the word “bias”
is interpreted differently by different individu-
the art and science of engineering a scientific
als. However, statistical science has a definite
study have evolved so that the savvy investi-
and precise meaning for this word, and be-
gator can dictate the limits of risk and the
cause statistical science provides the founda-
likelihood of outcomes from this dynamic
tion of modern experimental design, it is this
process of discovery. This particular form of
interpretation that must be addressed by suc-
art and science is commonly referred to as
cessful scientific studies in clinical radiology.
“experimental design.”
Statistically, bias is a property of averages.
When reading the scientific literature or
A statistical measure is said to be biased if,
designing studies, every clinical radiologist
on average, it does not equal what it is in-
should be aware of and concerned about
tended to estimate. To say that a study is bi-
three main considerations of modern experi-
ased is to say that it was conducted in such a
mental design that apply to research in clini-
fashion that, on average, the measurements
cal radiology (Fig. 1). The first consideration
from the study are biased.
is the extent to which the findings of the
study might mislead (“bias”). Another con-
Received November 15, 2001; accepted after revision sideration is the ability of the study to reveal What Is the Weight of a 1-Oz Marble?
January 22, 2002.
something important (“power”). The final Suppose that a group of researchers had a
Series editors: Craig A. Beam, C. Craig Blackmore, consideration is the desire to create useful in- reliable spring scale with which to measure
Stephen J. Karlik, and Caroline Reinhold.
formation (“precision”) from the research. the weight of marbles. Reliable means that
This is the sixth in the series designed by the American
College of Radiology (ACR), the Canadian Association of
The deceptively simple statistical concept of the researchers generally get the same value
Radiologists, and the American Journal of Roentgenology. the “average” will be shown to be central to each time they weigh the same marble. Now
The series, which will ultimately comprise 22 articles, is many of these considerations. suppose that the researchers have a marble
designed to progressively educate radiologists in the
methodologies of rigorous clinical research, from the most In this article, I will review these three key that they know weighs exactly 1 oz and thus
basic principles to a level of considerable sophistication. considerations, each of which needs to be ad- that marble becomes the gold standard. They
The articles are intended to complement interactive equately appreciated and addressed by inves- weigh this marble five times and get the fol-
software that permits the user to work with what he or she
has learned, which is available on the ACR Web site tigators seeking to design a successful lowing values: 1.1, 1.2, 1.2, 1.1, and 1.1 oz.
(www.acr.org). diagnostic radiology study. Because success- The values are always slightly more than the
Project coordinator: Bruce J. Hillman, Chair, ACR fully engineering the scientific study requires marble’s true weight of 1.0 oz. Sometimes
Commission on Research and Technology Assessment. drawing heavily on both clinical and statisti- the “error” is 0.1 oz, and other times it is 0.2
1
Department of Radiology, Medical College of Wisconsin, cal sciences, interdisciplinary collaboration oz. The average of these errors is 0.14, and
8701 Watertown Plank Rd., Milwaukee, WI 53226. Address should be encouraged and nurtured. In this so, on average, the scale errs by 0.14 oz.
correspondence to C. A. Beam.
way, research in clinical radiology will ma- Statistically, this measurement would be
AJR 2002;179:47–52
ture into a modern scientific discipline. Moti- described as biased: it tends to overestimate
0361–803X/02/1791–47 vating such collaborations is a goal of this true weight by 0.14 oz. Knowing this bias,
© American Roentgen Ray Society series of articles. the researchers could correct the scale by ad-

AJR:179, July 2002 47


Beam

vising users to always subtract 0.14 from the By randomly sampling, we follow a proce- dices of the investigator and of the patient do
reading. Then, although individual measure- dure that guarantees that every sample has the not influence treatment allocation.”
ments may be a little off, on average, the us- same chance of being selected for our study. If Randomization, in fact, has become the
ers will get the correct value. Thus, the we decided to do a study with a sample of 100 gold standard for the clinical trial. For exam-
corrected measuring device would be said to randomly selected subjects from our study ple, popular guidelines for evaluating the
be “unbiased” for the weight of marbles. population, we would have to follow a method quality of research are based on the assump-
The previous case is an example of mea- of sampling so that every possible sample of tion that the controlled randomized trial is
surement bias. Studies can be affected by 100 subjects would be equally likely to be se- the epitome of study design. Some scientists
other biases as well. Studies of diagnostic lected. How does random sampling ensure that advise using randomization simply because
technologies have their own special biases our results will not be biased? The answer to it is a good strategy for success in publica-
[1] with which the reader of the literature of this question requires the logic of statistical tion: “Without proper randomization, the in-
diagnostic radiology should be familiar. science. However, an intuitive answer is that vestigator is immediately on the defensive
These specific biases will be the subject of a measures that are simple averages will be un- and increases his vulnerability to the critical
subsequent article on the clinical assessment biased for the population average when the onslaught of his peers.” [3].
of diagnostic technologies in this series. For measures are based on random samples. What actually is randomization? First, let us
the present discussion, however, I will focus Are simple averages relevant to clinical specify what it is not. Colton [3] admonishes:
on two biases that affect every type of clini- radiology research? Thankfully, the answer
cal study. These biases come about by the in many cases is yes. Many published clini- It is worthwhile to point out that one
way subjects are selected for, and participate cal studies report means (which, of course, should not confuse randomization…
in, a study. are averages) of measurements. Measures of with haphazard assignment…. The pat-
diagnostic accuracy such as sensitivity and tern of assignment to treatment may
Selection Bias specificity, which are frequently reported, are appear to be haphazard, but this arises
The article in this series by Ella Kazerooni averages as well. Other commonly used from the haphazard nature with which
[2], “Population and Sample,” makes it very measures in clinical radiology are not simple digits appear in a table of random num-
clear that subjects selected for a study must averages but do enjoy the property of being bers, and not the haphazard whim of the
be representative of some clinically relevant unbiased when based on random samples. investigator in allocating patients.
population. One of several statistical motiva- Examples of these are the slope in linear re-
tions for this notion has to do with bias: We gression and the nonparametric receiver op- Randomization is an objective process
want the measures from our study to reflect erating characteristic curve area. that takes group assignment out of the hands
the value of the measures in the general pop- of humans and gives the responsibility to the
ulation. We do not want to be off the mark, Participation Bias random number generator. Once the human
so to speak. To accomplish this objective, we When conducting research that compares factor in group assignment is eliminated, we
must have a sample that in some way reflects groups of subjects, care should be taken to en- can make the important assertion that the
the population being studied. sure that the group assignments are free of process of allocation was unbiased. The sta-
Recalling that the statistical meaning of bias bias. In other words, the way in which subjects tistical significance of this step is that each
involves averages, we can restate our consider- participate should not bias the findings of the possible allocation had an equal chance of
ation as seeking to sample from the study pop- study. The mechanism by which this bias is occurring so that, on average, the findings
ulation in such a way that our measurements typically eliminated is randomization. In the from the study are not affected by the way
on average equal the value in the population. valuable reference book Statistics in Medicine, the subjects participated in the study.
Luckily, this goal can be accomplished by the Theodore Colton [3] writes, “Randomization It is widely held that randomization “aver-
well-known mechanism of random sampling. ensures that the personal judgment and preju- ages out” the effect of influencing factors

Contrast-to-Noise Ratios
(CNRs) for Six Subjects
TABLE 1 Assigned to Unenhanced or Contrast-to-Noise Ratio
Enhanced MR Imaging (CNR) Data Grouped by
Groups Using Randomization TABLE 2 Presence of Cirrhosis in
Enhanced and Unenhanced
CNR in CNR in
Imaging Groups
Bias Subjects Unenhanced Enhanced
Group Group Presence of
CNR in CNR in
Cirrhosis in
1 and 2 8 9 Unenhanced Enhanced
Subjects
3 and 4 13 Group Group
7 (n = 6)
Power Precision
5 and 6 15 20 Yes (n = 3) 8 9 and 7
Means 12 12 No (n = 3) 13 and 15 20
Note.—CNRs given in second and third columns apply to
Fig. 1.—Diagram illustrates the three elements of subjects retrospectively.
study design.

48 AJR:179, July 2002


Fundamentals of Clinical Research for Radiologists

that are unknown to the investigator. This eraging seen with random sampling. To say 20 different ways to assign these six subjects
tenet is true and provides another example of that randomization averages out the influence to the two groups. The investigators used a
how the concept of the average is fundamen- of unknown effects is to say that, on average, method that picked one of these assignments at
tal to our modern understanding of experi- the values resulting from a study will equal random—that is, in a way that each assign-
mental design. However, the benefit of the average of the values resulting from every ment was equally likely (one in 20) to be
randomization is realized only if the averag- possible experimental allocation of subjects picked. Thus, they randomly assigned subjects
ing is performed across the many different to treatments. to the groups. This time, randomization just
ways of allocating subjects to treatment. In Suppose that randomization was followed, happened by chance to come up with the as-
any one study, which can have only one such and the data in Table 1 were observed. One signment of two subjects with cirrhosis to the
allocation, an imbalance of factors could in- would probably conclude from this study that treatment group and two subjects without cir-
fluence the findings. Randomization does not the use of gadolinium does not improve the rhosis to the control group.
guarantee an equitable allocation in any par- CNR because the mean CNR of the two treat- The investigators are concerned because
ticular study; its benefits accrue as we con- ment groups are equal. Because randomiza- they believe that the presence of cirrhosis is
sider the process of averaging across studies. tion was used, researchers would trust that any likely to have dampened the enhancement of
Consider the following study: Six subjects effects that might have biased the findings the gadolinium. What can they do? They con-
are selected for a clinical study of gadolinium have been averaged out. “Trust” is the opera- sult their statistician who then generates Table
enhancement of breath-hold T2-weighted MR tive word: Randomization does not guarantee 3. From this analysis, it becomes obvious that
imaging of hepatic lesions. Suppose that en- that the group allocation actually realized in there is no benefit for subjects with cirrhosis
hancement will be measured with a contrast-to- this particular instance was equal with respect but a big benefit for other subjects.
noise ratio (CNR) determined by dividing the to characteristics that might be important. The need to be cautious with the results
difference between the lesion and liver signal Randomization is only a property of averages. from even the most carefully planned random-
intensities by the standard deviation of the back- Any one particular randomization can, by ized trial is appreciated by experienced re-
ground noise. Now suppose that the researchers chance, lead to severe disparities between the searchers. Colton [3], for example, observes:
wish to compare the CNR in the unenhanced two groups in some characteristic.
section of the liver with the CNR in the en- Actually, the principal investigator of this Randomization achieves a balance in the
hanced section of the liver. However, the institu- supposed study was wise enough to design long run. However, with a small series of
tional review board requires the use of separate into it the collection of extra information about patients, randomization may not always
groups of subjects. Therefore, the subjects must the subjects. One extra (or concomitant) vari- produce groups that are alike in every
be assigned to one of two “treatment” groups. able measured was whether the subject had respect…. [A]s a general rule, a report of
How should the assignments be made? cirrhosis of the liver (determined indepen- a clinical trial should include among its
If the investigators were to use randomiza- dently of the measurement of the CNR). Table first tables one in which the treatment
tion in this study, they would have to apply a 2 presents the raw data from this study catego- and control groups are compared on the
mechanism that would give each possible al- rized by the treatment received (i.e., enhanced several important characteristics relating
location of three subjects the same chance to or unenhanced MR imaging) and by the pres- to the disease under study.
be in the enhanced MR imaging group. Note ence of cirrhosis in the six subjects.
that randomization does not mean assigning Examination of this table shows that three In sum, the gadolinium-enhanced imaging
individuals to treatments according to no dis- of the subjects selected for the study had cir- example shows that successful study design
cernible plan or pattern. For example, ran- rhosis and that two of these subjects were as- requires collection of data that could plausi-
domization would not occur if the first three signed by the process of randomization to the bly influence the outcome of the study, good
patients who showed up at clinic were as- treatment (gadolinium-enhanced MR imaging) statistical methods by which to adjust the out-
signed to the gadolinium-enhanced imaging group. Conversely, two of the subjects without comes for these concomitant variables, and
group and the next three to the unenhanced cirrhosis were assigned to the “control” (unen- randomization of subjects to average out the
imaging group. That is not randomization be- hanced MR imaging) group. Obviously, the possible influence of unrecognized factors.
cause the researchers have not ensured that occurrence of cirrhosis was not equally repre-
every allocation of three individuals to the en- sented in the two groups. Did randomization
hanced imaging group was equally likely. fail? No. The allocation used in this study was Power in Comparisons
The researches cannot feign ignorance either. just one possible allocation of the six subjects A successful study finds something. If a
Perhaps those three individuals who were as- to the two treatment groups. There are, in fact, study does not find something, then the re-
signed to the enhanced imaging group always
show up early in the morning, and so the oth-
ers would never have a chance to be in the
group that undergoes enhanced MR imaging. Mean Contrast-to-Noise Ratios (CNRs) of the Unenhanced and Enhanced
TABLE 3
Imaging Groups Controlling for Cirrhosis
In sum, to say that subjects were randomly
assigned to treatments is to say that complete Mean CNR in Mean CNR in
Cirrhosis Present Difference in Means
control had been exercised over the allocation Unenhanced Group Enhanced Group
mechanism in a quite definite way. Yes 8 8 8–8=0
Randomization controls the bias of allocat- No 14 20 20 – 14 = 6
ing individuals to treatments by the same av-

AJR:179, July 2002 49


Beam

Fig. 2.—Graph plots relationship be- power is thus a requirement for the success-
tween study design and power ful study. Determination of the sample size is
Total Sample Size 350 comparing two designs commonly
300 used in diagnostic test evaluation: the purview of statistical science, and so the
250 independent groups and paired required sample size for a study is often the
groups. Paired groups study design contribution of the collaborating statistician.
200 is shown as requiring fewer sub-
jects than independent groups de- However, determination of sample size and
150
sign for any desired power in study. power also requires specification of the
100 ◆ = independent groups study,  = smallest clinically important difference for
50 minimal disagreement in paired
the problem at hand. This determination is
groups study, ▲ = maximal dis-
0 agreement in paired groups study. the purview of clinical medicine. Thus, sta-
65 70 75 80 85 90 95 100
tistically engineering the study for power
Desired Study Power (%) should be a collaborative undertaking be-
tween clinical and statistical scientists.
Although the role of sample size and
searchers in a successful study have the confi- cally significant difference exists between the power is well known in medical circles, I do
dence to say that if there had been something modalities. The difference might, indeed, lie not think the role of experimental design and
to find, they probably would have found it. The between 10% and 20%, a range we consider power is as well appreciated. The graph in
ability of a study to detect a specific difference clinically important. We would have to regard Figure 2 illustrates the importance of the re-
among study groups is its power. The logical our study as unsuccessful. lationship. This graph depicts sample size re-
expectation is that the power of any study is Statistically, power is expressed as the quirements for two basic study design types
greater when measuring greater differences. probability of rejecting the hypothesis of no that one might consider when comparing the
For example, collecting data to show that two difference (the “null” hypothesis) when, in diagnostic accuracies of two modalities.
imaging modalities differ by 50% in their sen- fact, a specific, clinically important differ- Our scenario is that a clinical radiologist
sitivities should be easier than collecting data ence does exist. The concept of power de- seeks to compare the sensitivity of a new di-
to show that they only differ by 1%. pends on specification of hypotheses and agnostic modality against that of an estab-
To be clinically useful, a successful study definition of a specific, clinically important lished modality. Based on her understanding
must have the power to detect the smallest dif- difference. To assess the power of a study, it of the medical literature, and of the costs and
ference that is deemed clinically important. If is not enough to say that the sensitivity of benefits to her patients in testing for this par-
a difference in sensitivity as small as 1% leads the new test is greater than that of the stan- ticular condition, the clinical researcher has
to clinically important differences in patient dard. A definite value for this difference determined that the smallest clinically rele-
outcome, we then are required to design a must be specified. vant difference in sensitivities for this diag-
study that has adequate power to detect a dif- Two important aspects of study design de- nostic problem is 5%.
ference as small as 1% in the sensitivities of termine the power of a study. One is sample The two basic study designs for this sort of
the two modalities. If, however, our study was size, and the other is the design itself. A suc- clinical trial are the “independent groups” de-
able to detect only a larger difference—for ex- cessful study is one that has sufficient power sign and the “paired groups” design. The inde-
ample, 20%—and gave negative results, we to detect the smallest clinically significant pendent groups design specifies that the
could not say with confidence that no clini- difference. The sample size that ensures this assignment of each of the study’s subjects to

100
Sensitivity (%)

90
Estimated

80
70
60
50
40
0 10 20 30 40 50 60 70 80 90 100

100 Confidence Intervals

Fig. 3.—Graph shows computer-simulated sampling of 100 confidence interval (CI) point estimates of test sensitivity to illustrate term “95% CI.” For 100 samples of subjects from
large population, the sensitivity and 95% confidence interval are plotted in order. Horizontal line at 70% represents true sensitivity. Point estimates (◆) fall around true value. Results
from some samples overestimate and some underestimate. Approximately 11 of 100 simulated point estimates appear to be exactly correct. Bars around each point represent as-
sociated 95% CI. In estimates in which bars overlap horizontal line (true sensitivity), CI contains true value of quantity being estimated. In estimates in which bars do not overlap
line, CI failed to capture true value. (Intervals that failed to capture true value are represented by ▲.) Of 100 CIs randomly generated, five failed to capture true value and 95 did
capture it. In large series of such intervals, CIs will give range that captures true value in 95% of cases.

50 AJR:179, July 2002


Fundamentals of Clinical Research for Radiologists

one of two groups should be randomized. One Precision in Estimation port an estimate and provide an assessment
group will be imaged using the reference mo- of probable error is through the use of statis-
In recent years, the trend in the medical lit-
dality, and the other group will be imaged us- tical confidence intervals (CIs).
erature has been to pay less attention to tests
ing the new modality. In the paired group A statistical CI of a quantity is a range of
of hypotheses and give more attention to esti-
design, each of the subjects is imaged using values along with a statement of the level of
mations. In fact, several authors have debated
both of the modalities being studied. Prefera- confidence. Usually, the CI accompanies a
the issue. In the context of diagnostic radiol-
bly, the interpretation of each modality is done single value (or point) estimate of the quan-
ogy, the seminal article by James Hanley [5]
independently of the result of the other modal- tity. For example, if the sample previously
“The Place of Statistical Methods in Radiol-
ity, and the order in which the subjects are im- discussed yielded an estimated sensitivity of
ogy (and in the Bigger Picture)” is worthy of
aged with each modality is also randomized. 75% with an accompanying 95% CI for val-
special attention. In that article, Hanley notes:
Figure 2 shows the total sample size re- ues ranging from 67% to 83%, how should
quired to achieve various levels of statistical The biggest objection to a statistical we interpret these values?
power for the two designs. In fact, there are test is that it answers with a “yes” or a The observed sensitivity is 75%, so that is
two sets of points for the paired groups de- “no” an overly simplistic question: Is our point estimate. However, we estimate the
sign because the power of this design also de- there some difference? The emphasis on value might be within the range of values
pends on the extent to which the two significant differences…distracts from from 67% to 83% with 95% confidence. The
modalities disagree (i.e., the proportion of pa- the real (issue), which is how big is the adjective “confidence” in the phrase “confi-
tients for whom one modality is positive and difference…. dence interval” is not an assertion of personal
the proportion for whom one is negative and belief. The term has an explicit statistical
vice versa). One set of points shows sample Given this recent trend, the design of the meaning that, not surprisingly, is related to
size required when the disagreement between modern study in clinical radiology research the long-term process of sampling. To say
the modalities is minimal, and the other set must ensure success in estimation. The phrase that the interval is a 95% CI means that the
shows the power of the study when the mo- “success in estimation” means that, statisti- interval was formed by a statistical method in
dalities disagree as much as possible. (More cally, the study has been designed to achieve such a way that if a large number of random
details about these considerations and com- sufficient precision in estimation with a de- samples were taken from the study popula-
putations can be found in an earlier article sired level of confidence. Understanding how tion and an interval were computed for each
that I wrote for AJR [4].) to design a study to be successful in estimation sample, 95% of these intervals would contain
Figure 2 provides confirmation of the intui- requires, then, an understanding of the statisti- the true value of the sensitivity of the test.
tive realization that greater power in a study re- cal concepts of precision and confidence. Figure 3 is a graph depicting this concept us-
quires a larger total sample size, or, conversely, To estimate the sensitivity of a new diag- ing a computer-simulated experiment.
the intuitive realization that a larger sample nostic technology, we would do well to fol- Another important feature of a confidence
size means greater power. This relationship be- low the direction given by Kazerooni in interval is its width. Wide confidence inter-
tween power and sample size is true regardless “Population and Sample” [2] and perform vals are less informative than narrow ones.
of which study design is chosen. the test on a random sample from the study For example, to say that the sensitivity of a
However, note that the paired groups design population. Because our sample is, of neces- test falls between 68% and 72% is much
requires a smaller total sample size for any sity, not the complete population of interest, more informative than saying the sensitivity
power we may wish to achieve. For example, we would expect imprecision in our estimate falls somewhere between 0% and 100%.
to achieve 90% power requires approximately of sensitivity from this one sample. Being The width of a confidence interval is its
200 subjects with the independent groups de- scientifically sophisticated, we are not satis- precision. Successful studies provide precise
sign but only approximately 50 subjects when fied in reporting only the estimated sensitiv- estimates. Therefore, engineering the suc-
the paired groups design is used and the mea- ity but also want to assess the probable error cessful study requires first specifying the
sures of the two modalities under study have in our estimate. The standard way to both re- precision the investigators wish to obtain. As
the lowest level of disagreement possible.
Even in the worst-case scenario, in which there
is maximal disagreement between the modali-
ties, the paired groups design requires only ap-
100
proximately 80 subjects.
Confidence

90
Level (%)

The previous example illustrates that study


80
design can greatly increase power for a given
70
sample size or, conversely, that the study de-
60
sign can reduce the sample size needed to ob- Fig. 4.—Graph depicts relationship 50
tain a specific power. Successful studies are between confidence interval (CI) pre- 40
ones that achieve the desired power economi- cision, or width, and confidence level. 50 60 70 80 90
cally. Knowledge of study design is, there- Bars represent CIs (estimated sensi-
tivity) and ◆ represents true sensitiv- Estimated Sensitivity (%)
fore, essential to the engineering of powerful ity. As confidence level decreases,
and economical scientific studies. precision increases.

AJR:179, July 2002 51


Beam

in considering power, specification of preci- for CIs is the 95% level, there is nothing sa- of these considerations—bias, power, and preci-
sion should in some way reflect a clinically cred about this number. Other levels of confi- sion—should be addressed by investigators
relevant definition of precision. For example, dence could be considered. The problem is, who want to design a successful study in diag-
if the researchers want to estimate sensitivity, however, justifying this break from tradition. nostic radiology. Because engineering a suc-
it might be relevant clinically to require the Figure 4 shows the impact of changing the cessful scientific study requires the expertise of
precision of estimation to be within 5% of level of confidence on the precision of CIs both the clinical and statistical sciences, collab-
the true value if the researchers conclude that based on the same sample size. Precision (in- oration between these disciplines should be nur-
sensitivities this similar are virtually equal terval width) is greatest for smallest confi- tured. In this way, research in clinical radiology
for clinical purposes. dence. In other words, precision and level of will mature into a modern scientific discipline.
Having specified the precision to be confidence exist in a trade-off relationship.
achieved by the study, the researchers have Precision can be increased by decreasing
basically two design considerations by confidence. In most cases, choosing preci-
References
which to achieve this goal. One consider- sion at the expense of confidence will proba-
1. Begg CB. Assessment of radiologic tests: control of
ation is sample size. As expected, the preci- bly not be an acceptable trade-off. To alter bias and other design considerations. Radiology
sion of a confidence interval increases (i.e., the confidence level, one has to argue effec- 1988;167:565–569
its width decreases) with a larger sample tively that not following the status quo 95% 2. Kazerooni E. Population and sample. AJR
size. Therefore, when designing a study for level was appropriate. Generally, however, 2001;177:995–999
estimation, one must select a sample size people set the level at 95% and find the sam- 3. Colton T. Statistics in medicine. Boston: Little,
large enough to achieve a desired precision ple size required to obtain adequate precision Brown, 1974
4. Beam CA. Strategies for improving power in diag-
in the confidence intervals. in estimation.
nostic radiology research. AJR 1992;159:631–638
Another way by which to achieve greater In this article, I have reviewed some of the 5. Hanley JA. The place of statistical methods in ra-
precision is by manipulating the level of con- key considerations in modern experimental de- diology (and in the bigger picture). Invest Radiol
fidence. Although the standard, by and large, sign as they apply to diagnostic radiology. Each 1989;24:10–16

The reader’s attention is directed to the earlier articles in the Clinical Research Series: Introduction, which ap-
peared in the February 2001 issue; Framework, April 2001; Protocol, June 2001; Data Collection, October 2001;
and Population and Sample, November 2001.

52 AJR:179, July 2002


Fundamentals of Clinical Research
for Radiologists

Cheryl R. Herman1
Harmindar K. Gill
Screening for Preclinical Disease:
John Eng
Laurie L. Fajardo
Test and Disease Characteristics

S creening is the application of a test


to detect a potential disease or con-
dition in an individual who has no
known signs or symptoms of that disease or
“induced” procedures) must be weighed against
the benefits. Thus, any new application of an
imaging procedure to screen for disease should
be considered an unproven method of disease
condition [1]. In general, screening has two control until its risks, benefits, and costs have
major objectives. One is the early detection of been rigorously evaluated. Ideally, such evalua-
disease at a point when treatment is more ef- tions should be completed before widespread
fective, less expensive, or both. Here, the im- use of the procedure for disease screening is un-
plicit assumption underlying the concept of dertaken or recommended [12].
screening is that early detection—before the Making and evaluating recommendations on
development of symptoms—will lead to a more the use of imaging studies for disease screening
favorable prognosis because intervention initi- is one of the more difficult problems in medical
ated before the disease is clinically manifested imaging and clinical medicine. This article will
will be more effective than treatment provided discuss the use of screening tests for detecting
at a later stage of the disease [2, 3]. The second early disease or for detecting risk factors for de-
objective in screening is to identify risk factors veloping disease. Consideration will be given to
that render an individual at a higher than aver- the appropriateness criteria for two major ele-
age risk for developing a disease, with the goal ments of health screening programs: the condi-
of modifying the risk factors to prevent or min- tion or disease for which screening is being
imize the disease [4–6]. The application of im- performed and the screening test itself. Within
aging examinations for disease screening is the context of these two elements, potential bi-
Received March 20, 2002; accepted after revision most often based on the first objective. ases in the evaluation of screening programs
April 22, 2002. Although medical imaging is used in the di- and other critical issues in the evaluation of
This is the seventh in a series designed by the American agnosis of most human ailments, mammogra- screening programs will be presented.
College or Radiology (ACR ), the Canadian Association of phy is the only diagnostic imaging examination
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is currently in widespread use as a screening tool
designed to progressively educate radiologists in the [7]. Multidetector CT is being evaluated as a Appropriateness Criteria: The Disease
methodologies of rigorous clinical research, from the most means of detecting early-stage lung carcinoma or Condition Being Screened
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive [8, 9] and colorectal adenomatous polyps [10, To be appropriate for screening, a disease
software that permits the user to work with what he or she 11], but it is not yet an accepted routine screen- should be serious, and the preclinical phase of
has learned, which is available on the ACR Web site
(www.acr.org ).
ing examination. Indeed, the concept of disease the disease (Appendix 1) should have a high
1
screening, including its appropriateness and prevalence among the population targeted for
All authors: The Russell Morgan Department of Radiology
and Radiological Sciences, JHOC Rm. 4155, P. O. Box 0814, evaluation, is not as straightforward as it may screening. Furthermore, screening initiated be-
Johns Hopkins Medical Institutions, 601 N. Caroline St., first appear. Even the basic assumption that fore a critical point in the natural history of the
Baltimore, MD 21287. Address correspondence to early treatment will improve prognosis may not disease should result in treatment being initi-
L. L. Fajardo.
be true in all circumstances. Moreover, even if ated before the onset of symptoms (Fig. 1).
AJR 2002;179:825–831
this assumption is justifiable for a particular This treatment should be more beneficial in re-
0361–803X/02/1794–825 condition, the risks or costs that are associated ducing morbidity or mortality than treatment
© American Roentgen Ray Society with any screening test (and any consequent given after symptoms develop. Finally, the

AJR:179, October 2002 825


Herman et al.

Fig. 1.—Diagram shows natural his-


Detectable preclinical phase of disease tory of disease. Progression from bio-
logic onset of disease to death is
divided into preclinical and clinical
Preclinical phase of disease Clinical phase phases. Detectable preclinical phase
of disease is period during which
screening tests are applied to detect a
Lead time condition early in its natural history,
before onset of symptoms.
A B Dx CP S D

Biologic Disease Preclinical Critical Symptoms Death


onset of detectable via disease point develop
disease screening detected via
screening

screening for the disease should not result in a for rare conditions can be accomplished using but detectable on a screening test. The detectable
significant incidence of pseudodisease. tests that are accurate, inexpensive, and non- preclinical phase of disease is defined as the in-
invasive. Although phenylketonuria occurs in terval between the point at which the disease can
Substantial Morbidity or Mortality If Untreated only one of 15,000 neonates, widespread be detected on screening (point B in Fig. 1) and
The criterion of seriousness relates primarily screening is justified by the effectiveness and the point at which symptoms develop [13] (point
to issues of both cost-effectiveness and ethics. low cost of the test and by the serious public S in Fig. 1).
The elimination or amelioration of adverse health consequences of not detecting the disease For screening to be beneficial, treatment ini-
health consequences must justify resource ex- in its preclinical phase. tiated during the detectable preclinical phase
penditures on radiologic imaging for disease must result in a better prognosis than therapy
screening. Likewise, the consequences of fail- Existence of a Critical Point and Appropriate Therapy given after symptoms develop. For example,
ing to detect and treat the disease early must be Screening tests are only effective if the con- some subtypes of breast cancer develop for 3–8
sufficiently grave to ethically warrant exposing dition or disease has a critical point (point CP years before becoming palpable at routine clini-
individuals to the risks (e.g., radiation exposure in Fig. 1) so that treatment instituted before the cal breast examinations. During this stage, non-
or false-positive diagnosis) and discomforts of critical point is more efficacious than treatment palpable breast carcinomas may be detected on
the screening procedure itself. Life-threatening provided later. In the case of screening for pre- mammography. Many of these carcinomas are
conditions, such as heart disease and cancer, clinical neoplastic conditions, the critical point confined to the breast and are not associated
and those known to have serious and irrever- coincides with the onset of metastasis [12]. with lymph node metastasis. Diagnosing and
sible consequences, such as congenital hy- Thus, the critical point must occur during the treating breast cancer during the preclinical
pothyroidism and phenylketonuria, clearly meet detectable preclinical phase of the disease be- phase result in a higher percentage of the cases
the criterion of seriousness. On the other hand, cause screening is ineffective (and, indeed, remaining noninvasive (i.e., ductal carcinoma in
medical imaging tests should be thoroughly unnecessary) after the onset of symptoms situ), a lower percentage of cases of axillary
evaluated for risks and benefits before being (i.e., during the clinical phase of the disease). lymph node metastasis, and a better 5-year pa-
used to screen for certain asymptomatic condi- If the critical point occurs soon after the onset tient survival rate than when breast cancer is di-
tions, such as gallstones. Although asympto- of the detectable preclinical phase, screening agnosed during the clinical phase [14].
matic gallstones are fairly prevalent, rarely are may be too late to be useful. Conversely, Conversely, if early treatment engenders no
they life-threatening and, in fact, the condition screening may also be less effective early in difference in the patient’s prognosis or health
may never become symptomatic. the onset of the detectable preclinical phase if outcome, then the application of a screening
lesions are extremely small and are just at the test is neither necessary nor effective. For ex-
High Preclinical Prevalence threshold of detectability. ample, screening for lung carcinoma with
For a screening test to be effective, it must re- For screening to improve patient outcomes, an chest radiography has historically been dis-
veal a sufficient number of preclinical disease effective treatment for the disease must be avail- couraged because the disease has a poor prog-
cases to justify the testing costs. Thus, the prev- able. A critical question in evaluating the impor- nosis regardless of the phase during which
alence of preclinical disease must be high in the tance of screening for a condition is whether treatment is initiated. Similarly, little justifica-
population for which screening is recom- treatment of the preclinical disease detected on tion exists in screening for conditions that are
mended. Targeting high-risk populations can screening is more effective than intervention ini- completely curable during the clinical phase of
increase the prevalence of the detectable pre- tiated after the disease becomes symptomatic. their natural history.
clinical phase of the disease and thus the Here, the natural history of the disease should be
number of cases detected on screening. This carefully considered. Figure 1 illustrates that the Low Incidence of Pseudodisease
strategy will likely be applied to the emerging natural history of disease can be divided into pre- A pseudodisease is a disease that does not
approaches to lung cancer screening using clinical and clinical phases. The preclinical phase require treatment because it does not affect pa-
multidetector CT. Exceptions to the criterion is the period from the biologic onset of disease to tients’ length or quality of life in a significant
concerning high prevalence of the detectable the onset of clinical manifestations of the disease. way. Screening for a disease will be ineffective
preclinical disease should be made if screening During this phase, the condition is asymptomatic if the screening test reveals substantial pseudo-

826 AJR:179, October 2002


Fundamentals of Clinical Research for Radiologists

disease. Two sources of pseudodisease have its simplest form, the assessment of the accu-
been described [12, 15]. A type I pseudodis- racy of a diagnostic technology involves two di-
ease is a condition that is diagnosed via a chotomies: disease that is present (+) or absent
screening test and does not progress to symp- (–) and test results that are positive (+) or nega-
tomatic disease; it may even regress over time. tive (–). A 2 × 2 matrix (Fig. 2) is frequently
This is a recognized phenomenon in screening used to illustrate the four outcome combina-
for breast carcinoma; not all cases of ductal car- tions in which n , the total number of test re-
cinoma in situ progress to invasive or metastatic sults examined, is expressed by the equation
disease [16, 17]. A type II pseudodisease is an n = a + b + c + d. Two of the counts, a and d,
indolent, slowly progressive disease found in correspond to correct test results (true-positive
conditions with long detectable preclinical and true-negative, respectively), whereas b is
phases or among patients with short life ex- the number of false-positive results and c is the
pectancies who may die from other causes number of false-negative results.
[12]. This latter type of pseudodisease has Because the counts for the four outcomes are
Fig. 2.—Diagram of 2 × 2 matrix illustrates test outcomes
been described in prostate carcinoma. Al- highly dependent on the sample size, it is cus- and test accuracy for individuals with and without dis-
though the prevalence of clinically apparent tomary to express them as rates. For example, ease. Disease + = disease present, disease – = disease
prostate carcinoma in men aged 60–70 years a / (a + c) is equal to the proportion of individu- absent, a = number of true-positive results, b = number of
false-positive results, c = number of false-negative re-
is only about 1% [18], more than 40% of men als who have the disease and who have positive sults, and d = number of true-negative results.
in their 60s who have normal findings at rec- test results, or the rate of true-positives, also
tal examinations have histologic evidence of known as the sensitivity of the test; d / (b +d) is
disease [19] when prostate tissue is removed equal to the proportion of individuals who do not classifying healthy persons as having the dis-
during cystectomy performed for bladder have the disease and who have negative test re- ease (false-positives). In general, sensitivity
cancer. Because patients with pseudodisease sults, or the rate of true-negatives, also known as should be increased at the expense of speci-
do not die from the disease for which screen- the specificity of the test; c / (a + c) is equal to the ficity if the consequences of missing preclin-
ing is performed, the survival of these pa- proportion of individuals who have the disease but ical disease are great, such as when the
tients is erroneously attributed to early have falsely negative test results, or the rate of disease is serious, detectable during its pre-
treatment. If adjustments are not made for the false-negatives; and b / (b + d) is equal to the pro- clinical phase, and curable. Conversely, high
detection of pseudodisease in a screening portion of individuals who do not have the disease specificity is desirable when the costs or
program, an overdiagnosis bias occurs [12]. but who have falsely positive test results, or the rate risks associated with further diagnostic tests
For both types of pseudodisease, a screening of false-positives. Thus, sensitivity is the probabil- (i.e., surgical biopsy) are substantial. In this
test with positive results may cause the pa- ity of an individual having positive test results circumstance, ethics require that the
tients to undergo unnecessary tests and ther- when the disease is truly present, and specificity is screened population be informed that a nega-
apy. For these reasons, screening for the probability of an individual having negative tive result on the screening test does not ab-
conditions with a high frequency of pseudo- test results when the disease is truly absent. solutely guarantee that the disease is not
disease is not cost-effective. The usefulness of a screening test is evalu- present, only that the likelihood of having
ated by its positive and negative predictive the disease is low.
values. The predictive value of a negative test One way to address the problem of the
Appropriateness Criteria: (d / [c + d ]) is the probability that a patient trade-off between the sensitivity and specific-
The Screening Test with a negative result on the diagnostic test truly ity is by administering several screening tests
A successful disease-screening program re- does not have the disease for which the screen- in parallel or sequentially. The former involves
quires not only that the disease have character- ing was conducted. Conversely, the predictive performing all the screening tests at the same
istics appropriate for screening but also that a value of a positive test (a / [a + b]) is the proba- time and considering individuals with positive
valid screening test be available. Ideally, the bility that a patient with a positive result on the results on any of the tests to be true-positive
test should be widely accessible, simple to ad- screening test truly has the disease for which the cases. This approach gives greater sensitivity
minister, inexpensive, and associated with screening was conducted. The positive and neg- than that achievable by performing each test
minimal discomfort and morbidity to the pop- ative predictive values of a test are dependent on alone because the condition is less likely to be
ulation screened. Moreover, the screening test the prevalence of the disease. missed; however, the approach lowers speci-
results must be valid and reproducible. Finally, As the sensitivity of a screening test in- ficity because false-positive diagnoses are also
as discussed earlier, the test should be able to creases, the number of individuals with pre- more likely. When screening tests are adminis-
reveal the detectable preclinical phase of the clinical disease not diagnosed by the test tered sequentially, an initial screening test is
disease accurately before the critical point of decreases. A highly specific test has a low performed, and only those individuals with
the disease. percentage of healthy individuals who are positive test results undergo an additional
misclassified as having positive test results. screening procedure. Generally, sequential
Test Accuracy Decisions regarding specific criteria for ac- testing results in higher specificity than that
A screening test is 100% accurate if it can be ceptable levels of sensitivity and specificity achievable with a single test because positive
used to correctly classify individuals having for a given preclinical disease involve weigh- results on a series of tests are more likely to
preclinical disease as test-positive and those ing the consequences of leaving cases unde- represent a true-positive finding. This method,
without preclinical disease as test-negative. In tected (false-negatives) against erroneously however, also lowers sensitivity.

AJR:179, October 2002 827


Herman et al.

Test Reproducibility Other sources of morbidity that affect an indi- direction of potential patient selection bias
Any test being considered for use in a screen- vidual’s decision to undergo or forego screening may be difficult to predict and the magnitude
ing program must have reproducible results. For include the discomfort associated with the test of such events even more difficult to quantify.
imaging tests, four important sources of vari- (e.g., compression with screening mammogra- Randomization schemes are used to overcome
ability can affect the reproducibility of results. phy or bowel preparation for screening barium self-selection bias in studies evaluating poten-
The first relates to a biologic variation that enema examinations). tial screening tests by assigning individuals to
might affect the performance of the test (i.e., pa- The screening test should be accessible to the screened and unscreened study groups after
tient size or cardiac motion). The second relates population for whom it is indicated. Screening they agree to participate in the study.
to the reproducibility of the test itself (i.e., pa- cannot be effective if the screening test is avail-
tient positioning or film processing in the acqui- able only at large medical centers. Likewise, if Lead-Time and Length-Time Biases
sition and production of mammograms). Third, the examination is costly, insurers may choose Showing the benefit of treatment initiated
intraobserver variability refers to differences in not to provide screening coverage, and patients during the preclinical phase of a disease is sur-
the way the same radiologist interprets a spe- may be unwilling or unable to pay for the test prisingly difficult. Two widely recognized prob-
cific screening test at different times. Finally, in- out of pocket. lems that arise when the benefits of screening
terobserver variability refers to inconsistencies are evaluated by comparing screened to un-
attributable to differences in the way different Evaluating the Effectiveness of screened populations are lead-time bias and
radiologists interpret the same screening exami- Screening length-time bias.
nation. Interobserver variability is minimized if Evaluations of effectiveness of a screening Lead time is the interval between the diag-
the interpretation criteria and end points are de- program should be based on outcomes and nosis of a disease at screening and the time at
fined and quantifiable and is greater if the crite- measures reflecting the impact of the program which it would have been detected via the on-
ria are vague and subjective. Both intra- and on the course of the disease. Here, the critical set of clinical symptoms [29]. Lead time,
interobserver variabilities have been reported outcomes of interest are the assurance that the therefore, is the amount of time that the diag-
[20–22] in the interpretation of screening mam- screened and unscreened populations are com- nosis was advanced as a result of screening
mograms, description of specific lesions, and parable, the estimates of lead-time and length- (Figs. 1 and 3). Because screening is applied
recommendations for follow-up examinations, time biases, a comparison of cause-specific to asymptomatic individuals, every case of
using the American College of Radiology mortality rates between the screened and disease detected at screening has had its time
Breast Imaging Reporting and Data System unscreened groups, and the measurement of of diagnosis advanced. Whether that lead
(BI-RADS) [23]. relative and absolute risks. time is a matter of days, months, or years var-
A common but flawed approach to measuring ies by disease, individual, and screening pro-
the accuracy of a potential screening test is to ex- Comparability of the Screened and Unscreened cedure. For a disease that progresses rapidly
trapolate data on tests performed in populations Groups from the preclinical to the clinical phase, less
with symptomatic disease to screening popula- In determining the efficacy of a screening lead time will be gained from screening than
tions [13]. However, using an asymptomatic pop- test, the screened and unscreened groups must for a disease that develops slowly and has a
ulation involves testing many subjects to identify be comparable with regard to all factors affect- longer preclinical phase.
a group with disease and following up those sub- ing the end point under evaluation, with the ex- Lead time also varies with how soon the
jects to ascertain the true disease status. Both pos- ception of the screening experience. In this screening test is performed after the preclini-
itive and negative test results in the subjects regard, patient recruitment and self-selection cal disease becomes detectable. For screened
should be verified by acceptable methods such as bias (volunteer bias) should be taken into ac- patients, cause-specific survival is measured
histopathology and clinical or imaging follow- count. People who choose to participate in a as the length of time from disease detection
up. With respect to the latter, a follow-up period screening program are likely to differ from on the screening test to death from the dis-
of sufficient length is critical. If the follow-up pe- those who do not volunteer in several ways ease. For patients not screened, cause-specific
riod is too short, false-negative cases may be that may affect survival [27, 28]. Volunteers survival is measured as the length of time
missed; if it is too long, new cases of disease tend to have better health and lower mortality from clinical diagnosis to death from the dis-
(e.g., “interval cancer”) may be inaccurately clas- rates than the general population and are more ease. For example, Figure 3 illustrates the hy-
sified as false-negatives. likely to adhere to prescribed medical regimes. pothetical histories of two women with breast
Consequently, an observational study design cancer. We assumed that the age of both
Test Safety, Availability, and Cost-Effectiveness comparing mortality rates of screened and un- women at the biologic onset of disease was
Because screening tests are performed on screened groups is likely to show that those 35 years and that the disease was detectable
asymptomatic individuals—most of whom are who volunteer to undergo screening have on screening when the women were 44 years
healthy and do not have preclinical disease— lower mortality rates, regardless of any effect old. One women (A) was screened at age 47,
the tests must not be associated with significant of screening. On the other hand, those who and her breast cancer was detected at that
morbidity or mortality. Even a minor side effect volunteer for screening programs may repre- time. The other woman (B) did not undergo
or adverse consequence to the screened popula- sent the “worried well,” or asymptomatic indi- screening mammography; her breast cancer
tion will likely offset the benefits of screening viduals who are at higher risk of developing was diagnosed when she was 50 after she dis-
[12]. Radiation dose and the likelihood that the disease because of medical or family history or covered a lump in her breast. Both women
screening test itself may induce malignancies lifestyle factors. Such individuals might have died at the age of 53. Because woman A sur-
are frequently considered adverse consequences an increased risk of mortality regardless of the vived 3 years longer after detection of breast
of screening tests involving imaging [24–26]. efficacy of the screening program. Thus, the cancer than woman B, screening appears to

828 AJR:179, October 2002


Fundamentals of Clinical Research for Radiologists

Fig. 3.—Diagram depicts how lead-


time bias can result in apparent in- Woman B lives 3 years after diagnosis
crease in survival attributable to
screening. Shown are hypothetical Woman A lives 6 years after diagnosis
case histories of two women with
breast cancer. Screening appears to Lead time
be beneficial when, in fact, it only
pushed time of diagnosis forward.
35 44 47 50 53

Biologic Breast cancer Woman A Woman B has cancer Woman A and


onset of could have has cancer diagnosed at onset woman B both die
breast cancer been detected diagnosed on of symptoms from breast cancer
in woman A on mammography screening
and woman B mammography

Age (years)

be beneficial when in fact it only pushed the Comparison of Cause-Specific and All-Cause as the product of risk and relative risk reduc-
time of diagnosis forward. This phenomenon Mortality Rates tion. For example, suppose a screening-eligi-
is commonly referred to as lead-time bias The most definitive measure of the efficacy of ble individual has a 2% probability of dying of
[30–36]. If an estimate of lead time is not the screening program is a comparison of the a particular disease over the next 20 years. If
taken into account when comparing mortality cause-specific mortality rates of those whose the relative risk reduction from screening is
between screened versus unscreened popula- disease was diagnosed on screening and those 50%, the absolute risk reduction is 1%. Re-
tions, survival will be erroneously overesti- whose diagnosis was made after the develop- porting absolute risk reduction is especially
mated for the screening-detected cases ment of symptoms. Because the target disease appropriate for screening because the overall
simply because the diagnosis was made ear- causes only a small proportion of deaths in a risk to be averted is usually small. The abso-
lier in the natural history of the disease. A screening-eligible population, a statistically pre- lute risk reduction puts the potential benefit in
second way to account for the effect of lead cise estimate of differences in mortality rates or a proper perspective so that an individual or his
time on the efficacy of a screening program is statistically significant effect of screening on all- or her health care provider can weigh it against
to compare the age-specific death rates in the cause mortality rates can rarely be shown. How- the potential side effects and costs. The recip-
screened and unscreened groups rather than ever, evaluating the all-cause mortality rates may rocal of the absolute risk reduction is the num-
the length of survival from diagnosis to death. help to ensure that a major harm or benefit is not ber of individuals who must be screened to
Length-time bias refers to the overrepre- being missed. An all-cause mortality rate is all- prevent one death or adverse event. In our ex-
sentation among screening-detected cases of inclusive and provides data relevant to the ques- ample, this number is 100 or 1/0.01. The per-
those diseases with long preclinical phases tion of whether other risks are somehow ception of the absolute risk reduction from
and thus more favorable prognoses. Diseases changed along the continuum of the application screening may be significantly affected by the
with a long preclinical phase are more readily of the screening test, the diagnosis of a disease, detection of a pseudodisease that, as discussed
detected on screening tests than are the more and the treatment. Second, an all-cause mortality previously, falsely increases the perceived risk
rapidly progressing diseases with shorter pre- rate provides an important perspective on the of developing the disease and the perceived ef-
clinical phases. An assumption underlying magnitude of benefit from screening. It puts fectiveness of earlier treatment.
the concept of length-time bias is that dis- cause-specific mortality reduction in the context
eases with long preclinical phases are more of other competing risks and thus permits an es-
indolent and would have more favorable timate of the overall benefit to be reasonably ex- Study Designs for Evaluation of
prognoses, regardless of any effect of the pected by a particular individual who undergoes Screening Tests
screening program itself. Thus, length-time a screening evaluation [35] Many epidemiologic design strategies are
bias could lead to an erroneous conclusion used to evaluate the efficacy of screening tests,
that screening is beneficial when, in fact, ob- Absolute Risk Versus Relative Risk including correlational studies, observational
served differences in mortality rates resulted The effectiveness of screening can be ex- studies, and randomized trials. Correlational
merely from detection of cases of less rapidly pressed in terms of the relative risk, which is studies are used to examine trends in disease
fatal diseases, whereas cases of diseases that the ratio of the cause-specific mortality rate in rates relative to screening frequencies in a popu-
are more rapidly fatal were diagnosed after the study group to that in the control group, or lation or to compare the relationship between
symptoms developed. Length-time bias is to the relative risk reduction, which is 1 minus frequencies of screening and disease rates for
difficult to quantify. Its effect is greatest for this ratio. Although calculations of relative risk different populations. Such descriptive studies
cases detected at the initial screening; thus, are valid, they can be misleading because they are useful in suggesting a relationship between
one method of controlling for length-time convey no information about an individual’s screening and a decline in the morbidity or mor-
bias is to compare cases detected at a subse- baseline risk. The absolute risk reduction is in- tality rate. However, correlational studies have
quent screening (i.e., after the initial screen- creasingly recognized as a more appropriate inherent limitations. First, because information
ing) to those detected clinically (when the measure of effectiveness of screening interven- from such studies concerns populations rather
patient develops symptoms). tions [37]. Absolute risk reduction is expressed than individuals, it is not possible to establish

AJR:179, October 2002 829


Herman et al.

that those experiencing the decreased mortality uate whether periodic breast cancer screening rationale. Gastroenterology 1997;1123:594–642
rate are in fact the same persons who received with mammography and physical examination 11. Frazier AL, Colditz GA, Fuchs CS, Kuntz KM.
Cost-effectiveness of screening for colorectal
the screening tests. Moreover, such studies do would result in reduced breast cancer mortality
cancer in the general population. JAMA 2000;
not allow control of potential confounding fac- rates among women whose ages ranged from 40 284:1954–1961
tors, such as socioeconomic status. Finally, the to 64 years old. After 9 years of follow-up, an 12. Black WC, Welch HG. Screening for disease.
measure of screening frequency used is usually overall statistically significant reduction in breast AJR 1997;168:3–11
an average value for the population, so identify- cancer mortality was found among women who 13. Cole P, Morrison AS. Basic issues in population
ing the optimal screening strategy for an individ- were offered screening compared with women screening for cancer. J Natl Cancer Inst 1980;
ual is impossible. Thus, correlational studies can who were assigned to usual medical care. 64:1263–1272
14. Bassett LW, Lui TH, Giuliano AE, Gold RH. Preva-
suggest the possibility of a benefit from a screen- Although randomized trials provide the best
lence of carcinoma in palpable vs. impalpable,
ing test, but they cannot test that hypothesis. and most valid data on the efficacy of screen- mammographically detected lesions. AJR 1991;
Observational analytic studies, both case- ing programs, a fair amount of evidence on 151:21–24
control and cohort, are also used to evaluate screening programs has come from nonexperi- 15. Morrison AS. Screening in chronic disease. New
the efficacy of screening programs. In the mental study designs. Cost, feasibility, and York: Oxford Univ. Press, 1992:125–127
case-control design, individuals with and with- ethical concerns can make randomized trials 16. Page DL, Dupont WD, Rogers LW, Landenberger
M. Intraductal carcinoma of the breast: follow-up
out the disease are compared with respect to controversial. As radiologic screening for dis-
after biopsy only. Cancer 1982;49:751–758
their prior exposure to the screening test. As ease becomes more common, considerations 17. Rosen PP, Braun DW Jr, Kinne DE. The clinical
with any case-control study, the definition and of new evaluation methodologies to determine significance of pre-invasive breast carcinoma.
selection of the cases and controls are of criti- costs and benefits may be needed. The chal- Cancer 1980;46:919–925
cal importance to the validity of the findings lenge for the future is to better identify which 18. Feldman AR, Kessler L, Myers MH, Naughton
[38, 39]. In a cohort study, the case-fatality rate screening tests are appropriate for which popu- MD. The prevalence of cancer: estimates based
on the Connecticut Tumor Registry. N Engl J Med
of those who chose to be screened is compared lations. Emerging quantitative techniques of
1986;315:1394–1397
with the case-fatality rate among those whose eliciting patient preferences [42] and of ana- 19. Montie JE, Wood DP Jr, Pontes E, Boyett JM,
diagnoses were made due to the onset of symp- lyzing benefits, harms, and costs over time [43, Levin HS. Adenocarcinoma of the prostate in cy-
toms. Interpretion of the results of cohort stud- 44] may help radiology meet this challenge. toprostatectomy specimens removed for bladder
ies requires consideration of the potential effects cancer. Cancer 1989;63:381–385
of the self-selection of participants as well as 20. Kerlikowske K, Grady D, Barclay J, Frankel SD,
lead-time and length-time biases [40]. Ominsky SH, et al. Variability and accuracy in
References mammographic interpretation using the American
Because the chief threat to validity is that
1. Eddy DM. How to think about screening. In: College of Radiology Breast Imaging Reporting
screened and unscreened cases cannot be com- Eddy DM, ed. Common screening tests, 1st ed. and Data System. J Natl Cancer Inst 1998;90:
pared, the optimal assessment of the efficacy Philadelphia: American College of Physicians, 1801–1809
of a screening program derives from random- 1991:1–21 21. Elmore JG, Wells CK, Lee CH, Howard DH,
ized trials. If the sample size is sufficiently 2. Eddy DM. Screening for cervical cancer. In: Feinstein AR. Variability in radiologists’ interpre-
large, the process of randomization controls Eddy DM, ed. Common screening tests, 1st ed. tations of mammograms. N Engl J Med 1994;
Philadelphia: American College of Physicians, 331:1493–1499
any potential confounding variables. Patient
1991:255–285 22. Beam CA, Layde PM, Sullivan DC. Variability in
self-selection or volunteer bias, a problem 3. Eddy DM. Screening for breast cancer. In: Eddy DM, the interpretation of screening mammograms by
when comparing screened and unscreened ed. Common screening tests, 1st ed. Philadelphia: U.S. radiologists: findings from a national sam-
groups in observational studies, does not influ- American College of Physicians, 1991:229–254 ple. Arch Intern Med 1996;156:209–213
ence the validity of randomized trials: after a 4. Garber AM, Sox HC Jr, Littenberg B. Screening 23. American College of Radiology. Breast imaging
group of volunteers agrees to participate in the asymptomatic adults for cardiac risk factors: the reporting and data system (BI-RADS), 3rd ed.
study, individuals who are to undergo screen- serum cholesterol level. Ann Intern Med 1989; Reston, VA: American College of Radiology, 1998
110:622–639 24. Dixon AK, Dendy P. Spiral CT: how much does ra-
ing are chosen at random from the group by
5. Sox HC Jr, Garber AM, Littenberg B. The resting diation dose matter? Lancet 1998;352:1082–1083
the investigators. Adjusting for the lead-time electrocardiogram as a screening test: a clinical 25. Faulkner K, Moores BM. Radiation dose and so-
average can eliminate lead-time bias in com- analysis. Ann Intern Med 1989;111:489–502 matic risk from computed tomography. Acta Ra-
parisons of survival rates of patients whose 6. Sox HC Jr, Littenberg B, Garber AM. The role of diol 1987;28:483–488
disease was detected via screening versus exercise testing in screening for coronary artery 26. Mossman KL. Analysis of risk in computerized to-
those whose disease was detected clinically or, disease. Ann Intern Med 1989;110:456–469 mography and other diagnostic radiology proce-
7. Smith RA, Mettlin CJ, Johnston DK, Eyre H. Ameri- dures. Comput Radiol 1982;6:251–256
preferably, in comparisons of the age-specific
can Cancer Society guidelines for the early detection 27. Greenlick MR, Bailey JW, Wild J, Grover J.
mortality rates for the screened and the un- of cancer. CA Cancer J Clin 2000;50:34–49 Characteristics of men most likely to respond to
screened groups. Trials can also address the 8. Henschke CI, McCauley DI, Yankelevitz DF, et an invitation to be screened. Am J Public Health
potential for length-time bias by comparing al. Early lung cancer action project: overall de- 1979;69:1011–1016
the mortality experience of the groups after re- sign and findings from baseline screening. Lancet 28. Wilhelmsen L, Ljungberg S, Wedel H, Werko L.
peated screenings. 1999;354:99–105 A comparison between participants and non-par-
In the United States, few randomized trials 9. Kaneko M, Eguchi K, Ohmatsu H, et al. Periph- ticipants in a primary preventive trial. J Chronic
eral lung cancer: screening and detection with Dis 1976;29:331–339
have evaluated programs that use imaging to low-dose spiral CT versus radiography. Radiol- 29. Sackett DL, Haynes RB, Tugwell P. Clinical epi-
screen for preclinical disease. The Health In- ogy 1996;201:798–802 demiology: a basic science for clinical medicine.
surance Plan Breast Cancer Screening Project 10. Winawer SJ, Fletcher RH, Miller L, et al. Colo- Boston: Little Brown, 1985:172–176
[41] was a randomized trial conducted to eval- rectal cancer screening: clinical guidelines and 30. Hutchison GB, Shapiro S. Lead time gained by

830 AJR:179, October 2002


Fundamentals of Clinical Research for Radiologists

diagnostic screening for breast cancer. J Natl 35. Black WC, Welch HG. Advances in diagnostic 40. Morrison AS. The effects of early treatment, lead
Cancer Inst 1968;41:665–681 imaging and overestimation of disease prevalence time, and length bias on the mortality experienced
31. Morrison AS. The effects of early treatment, lead and the benefits of therapy. N Engl J Med 1993; by cases detected by screening. Int J Epidemiol 1982;
time, and length bias on the mortality experi- 328:1237–1243 111:261–267
enced by cases detected by screening. Int J Epi- 36. Shwartz M. Estimates of lead time and length 41. Shapiro S. Evidence on screening for breast cancer
demiol 1982;111:261–267 time bias in a breast cancer screening program. from a randomized trial. Cancer 1977;39[suppl
32. Shapiro S, Goldberg JD, Hutchison GB. Lead Cancer 1980;46:844–851 6]:2772–2782
time in breast cancer detection and implications 37. Laupacis A, Sackett DL, Roberts RS. An assess- 42. Nease RF, Tsai R, Hynes LH, Littenberg B. Auto-
for periodicity of screening. Am J Epidemiol ment of clinically useful measures of the conse- mated utility assessment of global health. Qual
1974;100:357–366 quences of treatment. N Engl J Med 1988;318: Life Res 1996;5:175–182
33. Prorok PC. The theory of periodic screening. I. 1728–1733 43. De Koning HJ, Ineveld BM, van Oortmarssen GJ,
Lead time and proportion detected. Adv Appl 38. Morrison AS. Case definition in case-control et al. Breast cancer screening and cost effective-
Prob 1976;8:127–143 studies of the efficacy of screening. Am J Epidemiol ness: policy alternatives, quality of life consider-
34. Prorok PC. The theory of periodic screening. II. 1982;115:6–8 ations, and the possible impact of uncertain
Doubly bounded recurrence times and mean lead 39. Weiss NS. Control definition in case-control stud- factors. Int J Cancer 1991;49:531–537
time and detection probability estimation. Adv ies of the efficacy of screening and diagnostic 44. Black WC, Welch HG. A Markov model of early
Appl Prob 1976;8:460–476 testing. Am J Epidemiol 1983;116:457–460 diagnosis. Acad Radiol 1996;3[suppl 1]:S10–S12

APPENDIX 1. Screening for Preclinical Disease: Glossary of Terms

Screening—The application of a test to detect a potential disease Correlational study—A study conducted to examine trends in
or condition in an individual who has no known signs or symptoms of disease rates relative to screening frequencies in a population or to
that disease or condition. compare the relationship between frequencies of screening and dis-
Preclinical phase of disease—The period of time from the biologic ease rates for different populations.
onset of disease to the onset of clinical manifestations of the disease. Cohort study—A comparative study of two or more groups that
Sensitivity—The probability of having a positive test result when differ according to their exposure to a risk factor or other characteris-
the disease is truly present. tic (such as whether or not they have undergone screening). The
Specificity—The probability of having a negative test result when groups are then followed up prospectively to assess the incidence of a
the disease is truly absent. disease or other outcome hypothesized to be associated with the risk
Lead time—The interval between the diagnosis of a disease at factor or characteristic.
screening and the time it would have been detected via the onset of Efficacy—The magnitude of the beneficial effect produced by a
clinical symptoms. specific intervention or procedure under ideal conditions. It is ideally
Length-time bias—The overrepresentation among screening-de- determined by a randomized controlled trial.
tected cases of diseases with long preclinical phases and thus more Recommended reference for standard epidemiologic terms:
favorable prognoses. Last JM, ed. Dictionary of epidemiology, 4th ed. New York: Oxford
Relative risk—The ratio of the incidence rate of a disease among Univ. Press, 2001.
individuals exposed to a particular risk factor to the incidence rate
among unexposed individuals.

AJR:179, October 2002 831


Fundamentals of Clinical Research
for Radiologists
Stephen J. Karlik 1
Exploring and Summarizing
Radiologic Data

I
n this series, we have been learn- Each of these variables can have a wide
ing about the use of statistics to range of values whose precision of measure-
plan, execute, and analyze our re- ment can vary significantly. Another way to
search. This module is designed to help define think about continuous data is that a possible
and categorize data into conventional mea- value between two other values always ex-
sures for display and analysis. Display, or ists. An example would be a patient with a
visualization, of the data is an important con- systolic blood pressure of 111.5 mm Hg that
cept and one that is at the root of our under- lies between two other patients with pres-
standing of various types of data. Before sures of 111.2 and 111.9 mm Hg.
addressing which types of graphs, presenta-
tions, or analyses are useful and appropriate, Discrete Data
we need to define exactly what type of data to A discrete variable is characterized by hav-
analyze. In our studies, we choose different ing only certain values (usually integers). For
variables with which to collect the data that example, a patient can have only a whole inte-
can be divided into two primary types by ger representing the number of breast tumors.
quantity or category. The quantity types are There are never cases of “2.7 tumors detected
continuous (measuring) and discrete (count- on a mammogram” (although a group of pa-
Received June 13, 2002; accepted after revision
ing), and the category types are nominal tients might have a mean of 2.7 tumors). An-
July 9, 2002. (named) and ordinal (ordered). The following other example might be, “The study used
Series editors: Craig A. Beam, C. Craig Blackmore, section defines and gives examples of each. eight radiographs for archiving the images for
Stephen J. Karlik, and Caroline Reinhold. a study.” In the previous example, it seems ob-
This is the eighth in the series designed by the American vious that we use only a whole radiograph,
Quantitative Variables
College of Radiology (ACR), the Canadian Association of not 7.5 or 8.5. The distinction may not be so
Radiologists, and the American Journal of Roentgenology. Continuous Data apparent: consider WBC. Because one counts
The series, which will ultimately comprise 22 articles, is
designed to progressively educate radiologists in the Continuous data are probably the least fre- the number of cells per millimeter cubed, the
methodologies of rigorous clinical research, from the most quently reported in the radiology literature data appear (e.g., 33 cells/mm3 ) like a ratio
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive because our work has been traditionally one scale (which is discussed in the next section of
software that permits the user to work with what he or she of dichotomous interpretation: either an im- this article). Because there are never partial
has learned, which is available on the ACR Web site aging study successfully reveals an abnormal cells, the data are defined as discrete.
(www.acr.org).
from a normal finding or it does not. Contin-
Project Coordinator: Bruce J. Hillman, Chair, ACR Comparing Ratio and Interval Scales
Commission on Research and Technology Assessment.
uous data are found in which the data of in-
1
terest exist in a quantifiable range of values Ratio scales of measurement have a con-
Diagnostic Radiology and Nuclear Medicine, Rm. 2MR21,
University of Western Ontario, London Health Sciences that can take any conceivable value in that stant interval size and a true zero point. If one
Center-University Campus, 339 Windermere Rd., London, range. The degree of precision is based on patient has a 6-cm kidney tumor and a second
Ontario, Canada N6A 5A5. Address correspondence to the technology used for its measurement. has a 3-cm tumor, then we can state that the
S. J. Karlik.
Some examples are blood pressure (mm Hg), second tumor is half as large as the first. Ratio
AJR 2003;180:47–54
size of a tumor (cm), serum cholesterol (µg/ scales also include capacities (mL), volumes
0361–803X/03/1801–47 mL), length of an MR imaging sequence (cm3), rates (mL/min), weights (kg), and
© American Roentgen Ray Society (sec), and amount of contrast material (mL). lengths of time (min).

AJR:180, January 2003 47


Karlik

Contrast Agent Transit


as between 5°C and 10°C, 50°C cannot be posed by our measuring technique. Some
TABLE 1 Time for Maximal Renal considered twice as hot as 25°C because the examples are time, the size of a tumor, and
Enhancement zero point is arbitrary. Actually, the tempera- blood pressure. Table 1 lists the time taken
Patient No. Transit Time (sec)
ture scale of kelvin is a ratio scale, because for a bolus injection of radiographic contrast
the zero value is real at absolute zero. material to reach a maximum in the kidney
1 16 with a range of 8–28 sec for 30 patients.
2 27 These data are raw in the sense that they are
3 19 Categoric Variables
unadulterated, unmodified, and untrans-
4 24 Nominal Data formed. Time is a continuous measurement,
5 28 Nominal variables often describe charac- which can take any value whatsoever, but the
6 17 teristics, such as male and female, and are precision of its measurement is dependent on
7 13 commonly used in radiologic studies. Nomi- our measurement tool (wall clock vs stop
8 8
nal scales name the values of the nominal watch, accurate to a millisecond). By the es-
variable. For example, a breast tumor type tablished rules of science, a reported time of
9 15
could be classified as benign, malignant, or 21 sec is actually all times from 20.5 sec up
10 23
containing calcifications. to and including 21.4 sec. The next section
11 16 illustrates a variety of ways of exploring
12 13 Ordinal Data these data.
13 18 This type of data deals with comparisons A preliminary and easy way to look at
14 13 that are relative, rather than quantitative. continuous (raw) data is to use the “stem-
15 21 Thus, the data consist of an ordering or a and-leaf” plot. Although likely unfamiliar to
16 15 ranking of measurements. When one orders the radiologist, it is easy to construct without
17 15 the finding, then the scale becomes ordinal, computerized graphing packages and shows
18 21 even if the steps in the order are different. An the distribution of the data in a rudimentary
19 14
example is the Kurtzke [1] expanded disabil- way. The common “stem” is along the left for
ity scale (0–10) for the neurologic assess- each decade (0 for units = 0–9, 1 for teens =
20 21
ment of patients with multiple sclerosis. In 10–19, and 2 for twenties = 20–29), and the
21 9
this widely used scale, a worsening in the pa- different values are sorted by increasing val-
22 14 tient status of one unit from 1 (minor signs) ues in the second column (Table 2). Most
23 18 to 2 (elevated thresholds) is dramatically dif- values are in the decade from 10–19, and
24 17 ferent from 6 (walks with assistance) to 7 there are no values exceeding 28 sec. This
25 17 (wheelchair bound). A common form used in plot style would clearly identify a highly un-
26 16 radiology is to classify image interpretability usual value (84 sec) from a large number of
27 18 as poor, moderate, or excellent and perhaps points—for example, if a value of 8 in the
28 12 grade as 1, 2, and 3. stem and 4 in the leaf were seen. Although a
29 13 It is also possible to have exactly the stem-and-leaf plot can allow an easy appreci-
30 16
same original data portrayed in several dif- ation of a data set, the details of the distribu-
ferent data types. Using an example of ex- tion are missing.
amination marks, we can have raw marks of To obtain a more detailed examination of
97, 75, 68, and 51 (discrete data) that can be our example of enhancement time, we created
expressed as the grades A, B, C, and D (or- a dot plot that shows the frequency of occur-
dinal data) or pass, pass, pass, and fail (nom- rence of any individual data values (Fig. 1). In
“Stem-and-Leaf” Plot of inal data). Although this latter example a way analogous to the stem-and-leaf plot, the
TABLE 2
Contrast Agent Times
appears trivial, this exact type of data reduc- dot plot uses a stem value for each unique time
Stem Leaf tion is common in radiology, in which a value in the whole set. Then, a single dot is
0 8, 9 complex data set is reduced to presence or plotted for each occurrence of that value in the
1 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, absence to facilitate the common 2 × 2 chi- data set—for our example, one dot for 8 or 9
7, 7, 8, 8, 8, 9 square analysis of diagnostic accuracy. The sec and 4 dots for 16 sec. Although possibly
2 1, 1, 1, 3, 3, 4, 7, 8 problem with data reduction is that it can re- also unfamiliar, this method is another way to
sult in a loss of information. picture the raw data and is analogous to a his-
togram for each time point. In this type of plot,
each data point simultaneously shows the ac-
Interval scale data are those derived from a Plotting Methods tual value, occupies space, and represents one
measurement scale that possesses a uniform Let us take the different types of measure- counting unit. Compared with the stem-and-
interval, but interval scale data have no true ment in turn and examine exploring, summa- leaf visual, the dot plot permits a more detailed
zero, as, for example, the centigrade tempera- rizing, and presenting each type. appreciation of the variability in the data and is
ture scale (degrees Celsius). Although the dif- Continuous data have no discrete divi- close to a histogram (albeit one that has been
ference between 20°C and 25°C is the same sions between elements apart from those im- stood on its end).

48 AJR:180, January 2003


Fundamentals of Clinical Research for Radiologists

No. of
Time (sec) Occurrences
8 ∗
9 ∗ 5
12 ∗
13 ∗∗∗
14 ∗∗ 4

15 ∗∗∗
∗∗∗∗ N0. of Occurrences
16
3
17 ∗∗∗
18 ∗∗∗
19 ∗ 2

21 ∗∗∗
23 ∗∗ 1

24 ∗
27 ∗
0
28 ∗ 5 10 15 20 25 30
Time (sec)

Fig. 1.—Dot plot of transit time data (found in Table 1) Fig. 2.—Conventional frequency histogram shows all raw data for transit time (found in Table 1).
shows each asterisk as representing actual occur-
rence of specific time (sec) beside it.

The data set that is organized as a conven-


12
tional histogram shows the frequency of each
data value as a bar (Fig. 2). When the data are
scattered or the data intervals are too numer- 10
ous, it is customary to reduce the number of
intervals, remembering that there should be
enough intervals or bins to show any relevant 8
pattern. Because the data in Table 1 consist of
Frequency

30 values, one published rule is to use ap-


proximately √n (square root) intervals, where 6
n is the total number of values [2]. With an n
value of 30 and a √n value of 5.5, five or six
intervals are appropriate. If we choose six in- 4
tervals, then the resulting histogram shows a
maximum in the 15- to 17-sec interval (Fig.
3). Although this reduction of the data by de- 2
creasing the number of intervals loses some
of the details of the exact measurements seen
in Figure 2, the essential character of the data 0
8–11 12–14 15–17 18–21 22–24 25–28
is illustrated, in that the maximal enhance-
ment time values are identified in the 12- to Interval (sec)
21-sec area in intervals containing 12–14,
15–17, and 18–21 sec. If the transit time vari- Fig. 3.—Frequency histogram (for data found in Table 1) shows that number of bins has been decreased to √n
ability is important, then you might prefer to (square root).

AJR:180, January 2003 49


Karlik

35
choose Figure 2. Conversely, if showing the
typical time were your goal (say to choose an
optimal imaging time), then the expression of
30
data in Figure 3 would be appropriate. Nei-
ther choice is artificial; each emphasizes a
Cumulative Frequency

25
Q different aspect of the data.
Having plotted our data and appreciated its
20 distribution, we must determine three primary
attributes: the center, the dispersion, and the
M
15 symmetry of the data distribution.

10
q Measurements of Central Tendency
The central tendency is the tendency of
5
the observations to accumulate at a particular
value or in a particular category. The three
0 ways of describing this phenomenon are
mean, median, and mode.
The most widely used measure of central
7 11 14 17 21 24 28
tendency is the familiar mean, in which the cal-
Interval (sec) culation of the mean is simply adding all values
in the data set and dividing the sum by the num-
Fig. 4.—Graph shows distribution of enhancement data converted to cumulative data (see Table 2). Conversion
ber of samples. This procedure yields a mean
from histogram format permits easy visualization of quartiles; Q = third quartile, M = median, q = first quartile. value of 17.2 sec for our time data. The mean is
only applicable to ratio or interval scale data.
Another way to look at this data is to make
a cumulative frequency diagram. We first
TABLE 3 Cumulative Frequency of Contrast Agent Transit Times convert the frequency histogram (Fig. 3) to a
cumulative frequency table (Table 3) and
Interval (sec)
Frequencies then plot Table 3 as the final cumulative fre-
8–11 12–14 15–17 18–21 22–24 25–28 quency diagram (Fig. 4). The conversion is
Frequency 2 6 10 7 3 2 started by listing the number of occurrences
Cumulative frequency 2 8 18 25 28 30 for each interval under the interval values in
Table 3. Then we calculate the cumulative
frequency for each interval as the total fre-
quency of that interval, plus the frequency of
all lower intervals. For example, the cumula-
30
tive frequency for the interval 18–21 sec is
the actual frequency (seven occurrences)
25 plus the total frequency in all smaller inter-
90th Percentile
vals (n = 18) to yield 25. It is also possible to
convert the raw frequency histogram to the
20 cumulative frequency diagram in an entirely
75th Percentile
analogous way using the individual data val-
Time (sec)

Median ues rather than the intervals.


15
The cumulative frequency diagram pro-
25th Percentile vides the investigator with an opportunity to
10th Percentile
10 visualize three important measures of the
data: the first quartile (q), the median value
(M, or the second quartile), and the third
5 quartile (Q). The median is the middle value
from the data set. Because there is an even
number of observations (n = 30), we take the
0 15th and 16th values from a list of the data
No. of Events (rated as percentile) with increasing values (16 and 17 sec here)
and take the average, which is 16.5 sec. The
Fig. 5.—Box-and-whiskers plot (for interval data found in Table 1) shows median and percentiles as marked. median divides the data into two equal parts
Compare this graph with Figure 4 that expresses the same data with quartiles. (by the number of observations); the quar-

50 AJR:180, January 2003


Fundamentals of Clinical Research for Radiologists

tiles divide each of these halves into two or Comparison of Vessels Revealed on Digital Subtraction Angiography (DSA)
four parts total. The values of q, M, and Q TABLE 4
and MR Imaging Techniques
can help to show whether the data are sym-
Contrast-Enhanced 3D Time-of-Flight Dynamic MR
metric in the interquartile range, which hap- Patient No. DSA
Time-of-Flight Angiography Angiography Angiography
pens if the M–q and Q–M ranges are
approximately equal. This determination of 1 ++ +++ + +++
interquartile ranges is our first introduction 2 (+) ++ ++ ++
to measures that characterize the dispersion 3 +++ +++ +++ +++
or spread of the observed data. 4 ++ ++ (+) ++
Imagine that the histogram illustrated in 5 +++ ++ +++ +++
Figure 3 could be physically weighed instead
Note.—3D = three-dimensional, (+) = partial visibility, + = fair visibility, ++ = moderate visibility, +++ = excellent visibility.
of occupying some space in a plot. The mean
can be conceptually thought of as dividing
the histogram into two equal parts by weight, measurement, which is 16 sec for our en- samples, we are applying a test of signifi-
whereas the median is simply the middle hancement data. It is possible that the data set cance. “Statistically significant” may not
measurement in the data set. The median has more than one mode. Hence, it is possible equate to “interesting” or “important.”
also expresses less information than the to see the descriptor “bimodal” for a distribu-
mean because the median is based on the tion of data having two modes or two peaks
rank of the individual data values (not the ac- on a plot of the data. Ordinal Data
tual values). When the data set has many val- Tables are effective for the presentation of
ues that are low or high compared with the ordinal data. Table 4 illustrates an example
average, the median is less sensitive to these Measurements of Dispersion of the reporting of vessel conspicuity for
values and may be a preferential way to de- As seen in Figure 2, our enhancement max- different visualization techniques: digital sub-
scribe the central tendency. Thus, the median ima do not all occur at the same time and are traction angiography, contrast-enhanced time-
is insensitive to the data extremes. In our ex- spread over a substantial range (8–28 sec). We of-flight MR angiography, three-dimensional
ample, this insensitivity could happen if we can exactly express this dispersion or nonuni- time-of-flight MR angiography, and dynamic
exchanged the highest value in our set (28 formity in the data. The most commonly used MR angiography. The ordinal scale is partial
sec) with a larger data point (100 sec). Al- measure of dispersion for a single sample of visibility to excellent visibility in four steps
though the median value would remain the continuous data is the SD, and, like the mean, represented in the table by “+” to “+++” in an
same (16.5 sec), the new mean is 19.6 sec. the SD takes all the data into account. The SD intuitively obvious way.
Thus, the median retains its ability to iden- is a statistical measure that expresses the aver-
tify a value more consistent with the spirit of age amount by which all data values in the set
Proportions and Rates
the data compared with the mean, which has deviate from the mean value: the smaller the
been increased by the extreme value. differences, the smaller the deviations, and the Proportions and rates are descriptive pa-
In addition to the quartile divisions men- smaller the SD (and vice versa). For our data rameters for a population that can be esti-
tioned previously, the distribution can also be set, the mean is 17.2 sec with an SD of 4.7 sec. mated from a sample. Rate is the occurrence
divided into other parts, such as percentiles If we can assume that the data we collect is of a particular event in a sample and is given
(or 100 parts). A representative example of normally distributed, the SD has some useful in- as a percentage. Table 5 shows an example in
this division is the use of lethal dose 50 terpretations. For example, 68% of all observa- which the number of events (and rate as a
(LD50) from pharmacologic studies. The tions will lie within ± 1 SD of the mean value. percentile) is listed for four possible catego-
LD50 is actually the dose at which 50% of Ninety-five percent of the data lies within ± 2 ries of neurologic outcome resulting from
the experimental animals died, or the 50th SDs, and 99.7% lies within ± 3 SDs of the carotid artery stenting. “Proportion” is a de-
percentile of lethal doses, or the median le- mean. Hence, the SD is approximately one sixth
thal dose. Similarly, q (first quartile) is the of the total data range for a normal distribution.
Complications Associated
25th percentile and Q (third quartile) is the The mean and SD of a normally distrib- TABLE 5 with Stenting of the Carotid
75th percentile. uted data set tell us about the internal struc- Artery
A useful way to depict this type of data is the ture or internal proportions. Another term that
Hemisphere
box-and-whiskers plot (Fig. 5), which is effec- is often seen is the standard error. The SD of
tive in summarizing the properties of a data set. the means of many samples from the same Neurologic Outcomes Ipsilateral Contralateral
The bottom and top of the box are the 25th and population is called the standard error. The (n = 156) (n = 88)
75th percentiles (which are q and Q in Fig. 4), standard error depends on the sample SD, the Major stroke 2 (1.2) 2 (2.2)
the line in the box is the median value (M), and number of samples, and the proportion of the Minor stroke 18 (11.5) 13 (14.8)
the “whiskers” (looking like error bars) extend population in the sample. These three statisti-
Neurologic death 1 (0.6) 0 (0)
to the 10th and 90th percentiles. cal measures—mean, SD, and standard er-
Nonneurologic death 3 (1.9) 5 (5.7)
The mode is another term used to describe ror—are used to determine whether two
the central tendency of a data set. The mode experimentally determined samples are from Total events (%) 24 (15) 20 (23)
is defined as the most frequently occurring different populations. When we compare Note.—Data are numbers (%) of patients.

AJR:180, January 2003 51


Karlik

scriptor that is applicable to categoric data. A


30
stacked bar chart permits visualization of the
proportions of three measures in three differ-
ent patient groups (Fig. 6). In radiology, we 25
frequently use a common statistical test (chi-
square) to determine whether the rate or pro-
20

No. of Patients
portion of observations is different in two or
more populations.
15

Relationship Between Two Variables 10


At times, we take two simultaneous mea-
surements of our study population for the
5
purpose of determining whether a relation-
ship exists. In some instances, the measure-
ments are taken to establish a pattern in the 0
Group 1 Group 2 Group 3
data (e.g., body weight and X-ray attenua-
tion) or to search for an easy-to-measure Patients
surrogate marker for a hard-to-measure
value (e.g., to measure the amount of iodi-
Fig. 6.—Bar chart shows proportion of patients in three treatment groups who were found with no change in size
nated contrast agent in a solution using its of prostate (black bar), enlargement (white bar), or decrease in size of prostate (gray bar). Note proportion of
optical absorbance). patients in each classification in each of three differently sized groups.
No. of Occurrences

No. of Occurrences

No. of Occurrences
35
20 20
30
15 15 25
20
10 10
15

5 5 10
5

1 2 3 4 1 2 3 4 5 6 7 1 2 3 4
Time (sec) Time (sec) Time (sec)
A B C
No. of Occurrences
No. of Occurrences

No. of Occurrences

20 20 20

15 15 15

10 10 10

5 5 5

1 2 3 4 1 2 3 4 5 6 7 1 2 3 4
Time (sec) Time (sec) Time (sec)
D E F

Fig. 7.—Scatterplots for six data sets show different data distributions.
A–E, Pearson’s product moment correlation coefficients for data sets are as follows: 0.864 (A), 0.991 (B), –0.992 (C), –0.549 (D), 0.078 (E), and 0.247 (F).

52 AJR:180, January 2003


Fundamentals of Clinical Research for Radiologists

we can predict the outcome of an analysis of


10
the correlation coefficients. The data in Figure
7A would have a good correlation, which is
supported by a Pearson’s test yielding an r
Angiographic Flow Rate (mL/min)

8 value of 0.864 (strong correlation). Figures


7B and 7C are obviously linear and have an r
value of 0.99 and an r value of –0.99 (very
6 strong correlation). Figure 7D is somewhat
ambiguous. However, r is equal to –0.549 and
thus a moderate correlation exists. The data in
4
Figure 7E are clearly related, but because the
relationship is nonlinear, r is equal to 0.078.
Even Figure 7F has a higher correlation coef-
ficient, an r value of 0.247, than Figure 7E. A
2
look at the correlation values alone for these
data sets would suggest that the data in Figure
7E had no relationship, whereas the data have
0 an interesting one that is immediately visible
0 1 2 3 4 5 6 7 8
in the scatterplot.
Sonographic Flow Rate (mL/min) When the scatterplot of the data for two
variables looks like a linear relationship ex-
Fig. 8.—Graph shows hypothetic data set (●) with linear regression (solid line) and 95% confidence intervals ists, then it is tempting to try and describe the
(dashed lines) plotted. Note that confidence intervals permit appreciation of strength of regression. r 2 = 0.927, relationship as linear and calculate the rela-
slope (m = 1.28), and x-intercept = –0.286.
tionship between them using linear regres-
sion. This approach compares a dependent
A scatterplot is the first step in examining culations is that correlation can be confused variable (y) in relation to an independent
the relationship between two sets of mea- with causality, and caution should be used variable (x), which yields the familiar y = mx
sures. The correlation coefficient (r) mea- about such an interpretation. The possibility + b, where m is the slope of line and b is the
sures how close the relationship between two of an indirect relationship, via a third and un- y-intercept (when x = 0). Our hypothetic ex-
measurements is to linearity. The maximal measured variable, should be eliminated. It is ample shows the plot of raw data, regression
values for r are 1 or –1, and the two variables up to the scientist to prove that these third line, and 95% confidence limits (Fig. 8). The
can be positively or negatively correlated. If variables have no effect on the observed cor- difference between correlation and regres-
the two variables show a nonlinear relation- relation. Another caution is that Pearson’s sion is that in a correlation, neither variable
ship (e.g., parabolic), then r equals zero, even correlation coefficient is only dependable can be fixed, whereas in regression, one mea-
though a strong relationship exists. The two when the two compared variables are nor- surement is a variable (y) and depends on the
calculations for correlation coefficient are mally distributed because an outlier point other (x). Often, the value of x is assumed to
Pearson’s product moment correlation for can dominate the correlation. be fixed, is capable of observation without
normal data, and for ordinal data, Spearman’s In interpreting the strength of a correlation error, and is normally distributed. Should
rank correlation. coefficient, we found no common consensus on there be no logical argument to define one
When a correlation coefficient is used, the scale descriptors. A useful published exam- variable as dependent and the other as inde-
three steps should be adhered to: first, plot ple of descriptors might be: 0.0–0.2, very weak pendent, then the solution is to use a calcula-
the raw data in a scatterplot; second, observe or negligible; 0.2–0.4, weak or low; 0.4–0.7, tion of correlation and avoid the concept of
whether a relationship exists between the moderate; 0.7–0.9, strong, high, or marked; dependence altogether. The importance of
variables; and third, if the data suggest a lin- 0.9–1.0, very strong or very high [3]. confidence limits should not be underesti-
ear, but not a curvilinear relationship, then Plotting data sets in scatterplots (Fig. 7) mated, either here with regression [4], or
calculate r. The problem with correlation cal- permits us to visually evaluate the data, and elsewhere with statements of sensitivity and
specificity [5] or proportions and rates. For
example, if we claim no side effects from
TABLE 6 Sample Contingency Table contrast injections in 20 patients (rate = 0%),
the upper 95% confidence limit of the rate of
Target Disorder
Diagnostic Test Result occurrence is actually 19%.
Present Absent Total
Positive 653 (A) 127 (B) 780 (A + B) Sensitivity and Specificity
Negative 77 (C) 1400 (D) 1477 (C + D) Sensitivity and specificity are ratios funda-
Total 730 (A + C) 1527 (B + D) 2257 (A + B + C + D) mental to the radiology discipline. They re-
Note.—Sample contingency table summarizes the number of patients with and without a target disorder that is positive or late the ability of an imaging technique to
negative on a single diagnostic test. reveal disease when present (sensitivity) and

AJR:180, January 2003 53


Karlik

to rule out disease when absent (specificity). regardless of the diagnostic test [7]? Both dispersion were shown. The relationship be-
The numbers are generated using the familiar negative and positive predictive values can tween two variables was determined by the cor-
2 × 2 table, which we have seen previously in also be calculated from the 2 × 2 table, as relation coefficient with a consideration of the
this series [6], for proportions used to com- well as prevalence, pre- and posttest odds, caveat that correlation should not be confused
pare diagnostic determination (presence or likelihood ratios, and posttest probability. with causality. The familiar 2 × 2 contingency
absence of disease) with a standard of refer- Usually, these statistical measurements are table and derived values were explored. Identi-
ence. The better the latter (e.g., surgical con- portrayed in simple tables or in the text of an fying variable types and choosing their appro-
firmation), the more valuable and accurate article. It is useful to show all 2 × 2 contin- priate displays should be a more straightforward
the diagnostic measurement will be. Al- gency tables because it is then possible for task after studying these examples.
though the analysis of a 2 × 2 contingency ta- the reader to calculate all these values. Even
ble has been shown previously in this series when the 2 × 2 is expanded into a receiver References
of articles, we will use the example in Table operating characteristic analysis (to be de- 1. Kurtzke JF. On the evaluation of disability in mul-
6 to calculate these values. Sensitivity is a(a scribed later in the series), the relevant mea- tiple sclerosis. Neurology 1998;50:1961–1970
+ c ), which is equal to 653/730 or 89%; sure (usually area under the curve) can be 2. Clarke GM. Statistics and experimental design.
London: Edward Arnold, 1994:7
specificity is d / (b + d), which is equal to expressed in table format with the appropri-
3. Rowntree D. Statistics without tears. London:
1400/1537 or 92%. Missing from most re- ate confidence intervals. Penguin, 1991:170
ports in the radiology literature is the confi- 4. Glanz SA. Primer of biostatistics. New York:
dence interval based on the binomial theorem Summary McGraw-Hill, 1992:211
[7]. There are a few key questions to consider The purpose of this article was to define the 5. Harper R, Reeves B. Reporting of precision of es-
when evaluating sensitivity and specificity different variables that radiologists routinely use timates for diagnostic accuracy: a review. BMJ
values: Was there an independent and blind to describe their data. Categoric and continuous 1999;318:1322–1323
6. Jarvik JG. The research framework. AJR 2001;
comparison with the standard of reference? data types were identified, and suitable graphs
176:873–878
Was the diagnostic test evaluated in a group and tables were shown to depict the findings in 7. Sackett DL, Richardson WS, Rosenberg W,
of patients appropriate to the target popula- an informative and succinct manner. Continu- Haynes RB. Evidence-based medicine. New
tion? Was the standard of reference applied ous data and measures of central tendency and York: Churchill Livingston, 1997:118–128

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series: Introduction, which ap-
peared in the February 2001 issue; Framework, April 2001; Protocol, June 2001; Data Collection, October 2001; Population
and Sample, November 2001; Statistically Engineering the Study for Success, July 2002; and Screening for Preclinical Dis-
ease: Test and Disease Characteristics, October 2002.

54 AJR:180, January 2003


Fundamentals of Clinical Research
for Radiologists
Stephen J. Karlik 1
Visualizing Radiologic Data

I t should come as no surprise to radi-


ologists, who earn their living by
analysis of visual information, that
the analysis and presentation of scientific data
eral variables simultaneously, and honesty in
revealing the data [1].

should also have a significant visual compo- Graphic Integrity


nent. Not only does the visual presentation en- This article is an examination of a series of
hance the clarity of the data, whether for figures from the recent radiology literature,
presentation or for publication, but in funda- with careful attention to the basis of graphic
mental ways it also assists in our understanding integrity as outlined by Tufte [1]. Graphic in-
of it. In fact, modern data graphics should be tegrity includes using the physical size of
considered instruments for reasoning about numbers or symbols in proportion to the ac-
quantitative information. Sometimes the best tual values; showing data variation—not de-
way to understand, describe, or summarize nu- sign variation; using clear and unambiguous
meric data is to look at a picture of it. In consid- labeling; not quoting data out of context; and
eration of statistical graphics, one publication avoiding having the number of graphic dimen-
rises above all others: The Visual Display of sions exceed the number of dimensions in the
Quantitative Information, by E. R. Tufte [1]. Far data [1]. Other key definitions and concepts
from being a cold and clinical tome, as the title brought up by Tufte are illustrated in the fig-
Received June 13, 2002; accepted after revision
suggests, this is an entertaining and approach- ures. Because these figures are reprinted to il-
July 9, 2002. able work from which we can take important lustrate points about graphic design, none was
Series editors: Craig A. Beam, C. Craig Blackmore, concepts and apply them to the expression of ra- changed to conform to the American Journal
Steven Karlik, and Caroline Reinhold. diologic data. of Roentgenology style for figures.
This is the ninth in the series designed by the American The display of data in graphs, charts, and di- One fundamental concept in judging graphic
College of Radiology (ACR), the Canadian Association of agrams has a specific aim: to discover any pat- competence is that of “data ink,” in which the
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is
terns in the data. Although this data display is data–ink ratio equals the ink used for data (data
designed to progressively educate radiologists in the particularly useful in continuous data (de- ink) divided by the total ink used in the graph
methodologies of rigorous clinical research, from the most scribed in “Exploring and Summarizing Ra- [1]. Therefore, background grids, three-dimen-
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive diologic Data” in this series [2]), this article sional pictures, shading, and hyperactive bar
software that permits the user to work with what he or she will use examples to illustrate those instances fills are unproductive ink, diluting the data–ink
has learned, which is available on the ACR Web site in which other graphic styles can help us con- ratio. For clarity, then, nondata ink should be
(www.acr.org).
ceptualize the phenomena underlying our ob- erased. The overall principles to optimize the
Project coordinator: Bruce J. Hillman, Chair, ACR
Commission on Research and Technology Assessment.
servations. When data are prepared for data–ink ratio include showing the data, maxi-
1
publication, it is customary to choose the most mizing the data–ink ratio, erasing nondata ink,
Department of Diagnostic Radiology and Nuclear
Medicine, Rm. 2MR21, University of Western Ontario, relevant and smallest number of illustrations or and erasing redundant data ink [1]. Some of the
London Health Sciences Center–University Campus, radiographic images to describe the findings. radiologic examples in this article pertain to the
339 Windermere Rd., London, Ontario N6A 5A5, Canada. Statistical graphs can assist in this process by issue of data ink.
Address correspondence to S. J. Karlik.
revealing and concentrating large amounts of Furthermore, there are annoying charts and
AJR 2003;180:607–619
data into a manageable size for portrayal. graphs that substitute graphic variation for
0361–803X/03/1803–607 Graphic excellence has certain properties: clar- data variation. One type of colorfully named
© American Roentgen Ray Society ity, precision, efficiency, consideration of sev- “chartjunk” [1] is the moiré optical effect

AJR:180, March 2003 607


Karlik

A B

Fig. 1.—Example of presentation with high data–ink ratio. (Reprinted from [3])
A and B, Multivariable graphs show control MR imaging–determined parotid gland size () for male (A) and female (B) patients. Each patient data point represents parotid
gland size, age, and patient condition. Parotid gland size increased in patients with hyperlipidemia (■) but not Sjögren’s syndrome (▲). Mean values ± two standard devi-
ations are plotted (containing 95% of data) to provide visualization of spread of control data versus patient values.

caused by closely spaced lines. You have seen the efficiency of communication. In the design substantial amount of data is concentrated in a
this effect on the television screen (particu- of statistical graphs, the ability to portray format that permits visualization of data varia-
larly in striped clothing), and now, with the complexity, structure, and density of data tion in patients (Fig. 1) and temporal relation-
promulgation of computerized graphing pro- should always be considered. ships (Fig. 2). Figure 1 has a high data–ink ratio
grams, it is becoming more common in re- and allows the reader to easily comprehend the
search reports. Although background grids The Good, the Bad, and the Ugly control-versus-patient differences in the MR
can assist in the reading of a complex data set, This article reviews 21 figures taken from the imaging determination of parotid gland size.
de-enhancing the grid to a lighter shade of recent radiologic literature to examine how Figure 2, although containing a background
gray may help to minimize the optical assault. these graphic presentation issues have been grid that dilutes the data–ink ratio somewhat, is
Most of the data ink should be devoted to data dealt with. Figures 1 and 2 are examples of effective in coordinating the temporal events as-
variation. Following this premise enhances data-intense multivariable graphs in which a sociated with contrast enhancement.

Fig. 2.—Example of graph with high data–ink ratio that


portrays related data in one presentation. Multivari-
able graph depicts attenuation versus time for several
tissues after contrast injection. Conspicuity () and
attenuation of liver (), tumor (✱), aorta (▲), and por-
tal vein (■) are plotted. Phases of hepatic enhance-
ment are also illustrated. (Reprinted from [4])

608 AJR:180, March 2003


Visualizing Radiologic Data

Fig. 3.—Example of figure that successfully illustrates


temporal relationship between two dependent variables.
Graph shows plotting relationship between two different
but related phenomena using two different y axes: dis-
placement (on left) and velocity (on right) for mean
through-plane motion of prosthetic valve. This figure has
high data–ink ratio, especially with error bars included.
Choice for x-axis position is compromised, leading to
some obscuring of data values and of x-axis tick labels.
(Reprinted with permission from [5])

A previous article in this series discussed x-axis tick labels and modestly confuses the Figure 5 shows an example of the ubiqui-
graphing two variables to show a relationship interpretation of the data. tous receiver operating characteristic curve.
in their correlation [2]; the graph style shown Figure 4 shows box-and-whiskers plots for In this instance, however, the straightforward
in Figure 3 allows the comparison of two char- contrast-to-noise ratios for a variety of different curve with 10 data points is obscured in a sea
acteristics. Two phenomena that are different coronary vessel segments. No statistical differ- of nondata ink, including the background
but related are plotted with displacement on ences exist, and the plot permits the reader to vi- grid, line of unity, extra axis tick marks, and
the left y-axis and velocity on the right y-axis. sualize that result. However, the plot does not the inserted legend, which is clearly not
The graphic is data intensive with a high give the number observed for each artery seg- needed because only one data set is plotted
data–ink ratio. Unfortunately, the choice of ment and contains additional lines of division on the graph. All these represent chartjunk
the x-axis position has obscured some of the that are nondata ink and could be erased. and should be eliminated. Compare Figure 5

Fig. 4.—Box-and-whiskers plots. Graph shows contrast-


to-noise ratio in electron-beam CT coronary angiography
for different coronary vessel segments. Bottom and top
edges of box are 25th and 75th percentiles, horizontal line
represents the median, and error bars delimit extent of
10th and 90th percentiles. No statistical differences were
observed, and this type of plot effectively portrays this data
variability. LM = left main coronary artery, LAD = left ante-
rior descending coronary artery, LCX = left circumflex cor-
onary artery, RCA = right coronary artery, p = proximal
segment, m = middle segment, d = distal segment. (Re-
printed from [6])

AJR:180, March 2003 609


Karlik

differences exist in detectability between any


measurements for any of the lesion types. A re-
plot of the data values that decreases the repe-
tition and clearly portrays the paucity of data
(Fig. 9) still shows a lack of significance.
The bar charts shown in Figure 10 are also
dominated by optical effects. The graphs de-
pict the area under the receiver operating
characteristic curve for 20 radiologists inter-
preting from four different displays. This fig-
ure occupies a considerable amount of visual
real estate to show virtually no significant
differences. Although minimal differences
are indicated, no correction for multiple
comparisons is indicated nor are confidence
intervals shown.
The three panels of Figure 11 show mild
moiré patterns and background grids. The
graphs illustrate the decrease in number of ver-
Fig. 5.—Example of poor data–ink ratio for receiver operating characteristic curve. Graph shows only 10 data
points, which are obscured by tremendous amount of nondata ink, including background grid, tick marks, and
tebral disks seen with a decrease in radiation
line of unity. DAFL = differential air–fluid level. (Reprinted from [7]) dose. However, no statistical tests were indi-
cated. Normally, it is sufficient to plot only the
with Figure 6, in which four curves are plot- stent placement after percutaneous translumi- upgoing section of the error bars on the top of
ted and the data–ink ratio is high. It is clear nal angioplasty. However, no confidence in- each bar. However, in this instance, the error
from the curves that no differences were seen tervals are shown, and the input confidence bars are actually the data range (the same as the
for the four display formats and three abnor- interval and proportion indicated by the ar- range whiskers in the box-and-whiskers plot),
malities (a–c). These data could be presented rows suggest that the three groups may not be so this is an unfamiliar hybrid plot.
in tables because no significant differences differentiated. Figure 12 shows the upward shift in re-
were observed. Moiré effects (optical noise) are one design ceiver operating characteristic area values
Figure 7 is our first example of moiré opti- fault in Figure 8. Additional chartjunk is seen observed when a group of radiologists used
cal vibrations. This complex figure describes in the threefold repetition of the type of radiol- computer-aided diagnosis. The choice of bar
the calculated optimum treatment strategy in ogists and the actual data values sitting atop fills does not dominate the visual picture.
a two-way sensitivity analysis varying the rel- each bar. An examination of the amount of The shift to higher receiver operating charac-
ative risk of failure after stent placement. The data actually shown in the figure reveals very teristic areas is clearly seen, and the variabil-
graph shows a decrease in relative risk with few data points considering the amount of ink ity of the distribution in performance is intact
the enlarging proportion of patients requiring used to represent them. Also, no significant and interpretable.

A B C
Fig. 6.—Example of data that could have been handled in table format. (Reprinted with permission from [8])
A–C, Graphs show findings for reticular (A), small nodular (B), and ground-glass (C) abnormalities in four display formats. Appropriate receiver operating characteristic
curves are used, but curves are not significantly different for any abnormalities. Repetition is unproductive. In each graph, it is difficult to discern individual curves and
their identification.

610 AJR:180, March 2003


Visualizing Radiologic Data

Fig. 7.—Figure in which data–ink ratio and optical vi-


brations (moiré effect) are poor. Graph shows com-
plex theoretic analysis of optimal treatment strategy
using two-way sensitivity analysis. PTA = percutane-
ous transluminal angioplasty, SS = selective stent
placement, CI = confidence interval. (Reprinted with
permission from [9])

The use of filled versus open bars is an ef- pretation. However, unless the graphs are care- the mean and standard deviation plots to the
fective method of delineation between groups fully considered, even one with copious data side of the raw data.
in Figure 13. However, the graph does contain ink can be confusing. Figure 15 shows the alterations in propor-
superfluous background grids, and design vari- Figure 14 shows the raw and mean values tion for a group of 20 radiologists interpreting
ation was chosen over data variation. One of for a number of measures of FDG uptake for images from two formats. They were given
the rules for graphic design suggested by Tufte 10 patients. The reader cannot follow the ac- three types of images to view and asked which
[1] is that the graph’s dimensions should not tual values from each patient because too gave the best processing. No significant differ-
exceed the data dimension. Here we have a many overlapping symbols appear. Although ences were reported. Although the bar graphs
three-dimensional plot of only two-dimen- the mean (the only filled symbol) is easy to show interradiologist variability well, much
sional data. The graph design adds substantial pick out, the error bars add to the confusion. ink is used to show an absence of significant
visual ink without adding anything to the inter- Clutter could have been avoided by offsetting changes between formats. Because all the bars

Fig. 8.—Example of figure that could have been simpli-


fied. Bar chart shows average detectability of lung ab-
normalities divided into severity for two groups of
radiologists and four display methods. The presenta-
tion has two principal problems: moiré vibrations (op-
tical noise) and redundancy, with the two groups of
radiologists repeated for each degree of abnormality.
Reprinted with permission from [8])

AJR:180, March 2003 611


Karlik

Fig. 9.—Example of another way data in Figure 8 might


have been presented. Plot uses much less data ink
without losing portrayal of any raw data. Different
symbols are used to represent each radiologist.

Fig. 10.—Example of bar charts dom-


inated by moiré patterns. Illustration
of all raw data for many areas from
receiver operating characteristic
analyses hides fact that multiple
comparisons would require additional
statistical tests. There is little value in
occupying so much visual real estate
for not much significant data. Re-
printed with permission from [8])

612 AJR:180, March 2003


Visualizing Radiologic Data

add up to unity (one), the black infill for the right panels show the same data for one time variability of the data that would be associ-
third proportion is redundant. point only, and the three scatterplots have ated with such low numbers is not illustrated.
Kaplan-Meier curves (Fig. 16) are rarely different axis ranges. The inclusion of all the Similarly, the three-dimensional bar graphs
seen in radiology but are common in clini- raw data is commendable, but no indication for MR imaging findings in Figure 19 show
cal studies. These curves are excellent for of statistical differences is shown. Using one an appropriate number of dimensions (three:
showing how rapidly a proportion of differ- graph could have eliminated repetition, and grade, cohort, and age). No statistical analy-
ent populations reaches a predetermined additional lines could have joined the same sis is indicated nor are confidence intervals
clinical outcome (in this instance, stroke) tumor at each time point to show whatever shown. It appears that the three grades of the
for two populations divided by sonographic trends were found in the temporal evolution three panels increase with age in the whole
criteria on day 0. The left panel represents of the enhancement. cohort independent of the group subdivision.
less than 50% stenosis and the right panel, Figure 18 shows three-dimensional graphs A considerable amount of visual real estate is
greater than 50% stenosis. It is unfortunate for four spectroscopic measurements and clini- used to illustrate data that have a common
that the two panels have different y-axis cal outcome in three groups of patients. Three- pattern. The findings from these three graphs
ranges. Visually, it appears that the patients dimensional graphs are intuitively difficult to could be summarized in a few sentences in
with nonhypoechoic findings in the greater- comprehend and these examples also show the results section of the text.
than-50% group have about the same number moiré effects. The use of three graph dimensions An odd combination of two measurements
of strokes as both groups in the less-than- is appropriate to the three data dimensions: pro- in one graph is seen in Figure 20. The main
50% panel on the left. They appear about portion, MR-spectroscopy measurement, and panel represents the mean and 95% confi-
equal, however, because of the change in clinical outcome. The number of patients whose dence intervals for the loss of cartilage thick-
scale between the panels. findings contribute to each of the bars is ness under pressure for 210 min. The inserted
Changing the axis to accentuate the differ- small, however, so this graph format over- panel has a different time axis, although the
ences between groups is also shown in Fig- states the value of the data. Also the lack of scale is the same. Perhaps a better way to
ure 17. Whereas the left panel has time confidence intervals allows the graph to ap- show these data would be to use the release
points for 30, 60, and 90 sec, the center and pear to tell a definitive story, whereas the point at 210 min as the zero point with times

A B

Fig. 11.—Examples of moiré patterns associated with bars filled with opposing hash
lines (each representing a different observer) and effect of including nondata ink
(grids). Values for p are not indicated. (Reprinted with permission from [10])
A–C, Bar charts show findings in lung-equivalent (A), heart-equivalent (B), and sub-
diaphragm-equivalent (C) regions.
C

AJR:180, March 2003 613


Karlik

negative before (during compression) and tween 6 and 12, and the scale is smaller data the same skills they use in the selection
times positive afterward (during decompres- above the break, giving an emphasis to the of radiographic images for presentation or
sion). The two y-axes should be either the lower values. The graphs also lack an indica- publication. This article has reviewed the fun-
same or better coordinated. tion of the reliability of the measurements damentals for visual display of quantitative
The final figure of this review, Figure 21, and a statistical evaluation of the results. information from radiologic studies. The
has two panels showing the change in two The critique of the graphs in this article truth about the data should be shown in an ef-
phenomena as a function of time after angio- was designed to help the reader understand ficient manner and the chartjunk minimized.
plasty. The graphs have a good data–ink ratio, the principles of good data presentation, in Clarity and honesty are paramount. Although
and actual measurements for 10 patients are which economy, clarity, and honesty are the meeting these criteria seems a valuable goal
illustrated. Although the overall patterns can essential guides. and an easy task to accomplish, these exam-
be discerned, the mean values (dashed lines) ples of graphics from the recent literature
are partially obscured, and the line indicat- Summary suggest that we need to scrutinize more care-
ing abnormal values is also a dashed line. Radiologists should apply to the selection fully. Clarity of graphing leads to clarity of
The y-axis on panel b has been broken be- and content of graphics conveying radiologic thinking and of presentation.

Fig. 12.—Example of figure that provides value-filled


expression of improvement in diagnostic accuracy
and leaves variability visible. (Reprinted with permis-
sion from [11])
A and B, Bar charts show diagnostic accuracy with-
out (A) and with (B) computer-aided diagnosis (CAD).
Bars have muted moiré effect and charts have more
pleasing overall appearance compared with those of
Figures 8, 10, and 11. Panel B shows that using CAD re-
sulted in increase in diagnostic accuracy for all
groups of radiologists.
B

614 AJR:180, March 2003


Visualizing Radiologic Data

Fig. 13.—Example of visually effective use of filled ver-


sus open bars for comparing distribution of number of
cases per channels visualized. Use of three-dimen-
sional bars gives graphic variation but adds no value
to depiction of data. Figure also has nondata ink in
background. (Reprinted with permission from [12])

Fig. 14.—Example of complicated


scatterplot. Figure depicts large
amount of information for variety of
FDG parameters for 10 patients. It is
difficult to follow specific values for in-
dividual patients and to discern mean
percentage differences (●). Error
bars are confusing. (Reprinted with
permission from [13])

AJR:180, March 2003 615


Karlik

Fig. 15.—Example of interesting use


of data ink to show proportions for
two variables and 20 observers, with
change in display parameter. No dif-
ference exists in discrimination be-
tween modalities; therefore, much
ink is used to show no differences.
(Reprinted with permission from [8])
A and B, Bar charts show differ-
ences in observer interpretation of
nonzooming (A) and twofold zoom-
ing (B) soft-copy displays.

A B
Fig. 16.—Example of Kaplan-Meier survival graphs. (Reprinted with permission from [14])
A and B, Graphs illustrate proportion of individuals who remain without stroke divided by degree of stenosis of less than 50% (A) and greater than 50% (B). Each group is
further divided by nonhypoechoic and hypoechoic findings. Although patients with nonhypoechoic findings in B have higher occurrence of strokes than those of both
groups in A, difference in y-axis range in B makes proportions appear nearly identical.

616 AJR:180, March 2003


Visualizing Radiologic Data

A B C
Fig. 17.—Three scatterplots showing attenuation of early-enhanced CT images of adenomas and nonadenomas at different times after injection of contrast material. No
statistical differences were indicated. (Reprinted with permission from [15])
A–C, Scatterplots show data at different time intervals: 30, 60, and 90 sec (A); 180 sec only (B); and 30 min only (C). Because y-axis scales are changed for each part, this
presentation visually suggests that discrimination between groups is noted at 30 min. Parts B and C should have also been plotted with attenuation versus all times of
observation to reduce redundancy and nondata ink.

A B

Fig. 18.—Example of complicated three-dimensional bar graphs that are difficult to


understand. Moiré effects are present also. (Reprinted with permission from [16])
A–C, Graphs illustrate complex relationships between four measures and clinical out-
come for three groups of patients: neonates (A), children (B) infants (C). Graphs appear
to hold substantial amount of information, but close examination reveals that each bar
represents few individuals and findings are visually overstated. This combination of
moiré effects and complex data presentation makes data difficult to apprehend.
C

AJR:180, March 2003 617


Karlik

A B C
Fig. 19.—Example of material that could have been presented in text or table format because no significant differences were found and data content is minimal. (Reprinted
with permission from [17])
A–C, Three-dimensional graphs show grade-scoring changes for subgroups sulcai (A), ventricular (B), and white matter (C) grades and ages. No error bars are shown, and numbers
of subjects in each subgroup are not given. CHS = cardiovascular health study, NF = nonblack female, BF = black female, NM = nonblack male, BM = black male.

Fig. 20.—Unusual figure that inserts graph of completely different phe-


nomenon within main (enclosing) graph. Although it is sometimes useful
to have different plots using different axes in one figure, this combination
is both confusing and potentially misleading. Minimum acceptable figure
would have identical time axis, perhaps with release point at which time
equals zero. (Reprinted with permission from [18])

A B
Fig. 21.—Examples of graphs in which changes in values for individual patients are almost impossible to follow. A large amount of data ink was used. (Reprinted with per-
mission from [19])
A and B, Graphs illustrate changes before and after angioplasty in two vascular phenomena, ankle–brachial pressure (A) and peak velocity (B). Discerning mean values (thick
dashed lines) is difficult. Limits for abnormal values (thin dashed lines) are useful. Y-axis scaling for part B is different below and above axis break, emphasizing lower values. No
indication of reliability or statistical tests for measurements are provided, even for individual cases, so we cannot judge whether differences are significant.

618 AJR:180, March 2003


Visualizing Radiologic Data

References levels at different heights in the same loop of 14. Polak JF, Shemanski L, O’Leary DH, et al. Hypo-
1. Tufte ER. The visual display of quantitative informa- bowel. AJR 1993;161:291–295 echoic plaque at ultrasound of the carotid artery:
tion. Cheshire, CT: Graphics, 1983:51–111 8. Ishigaki T, Endo T, Ikeda M, et al. Subtle pulmo- an independent risk factor for incident stroke in
2. Karlik SJ. Exploring and summarizing radiologic nary disease: detection with computed radiogra- adults age 65 years or older. Radiology 1998;208:
data. AJR 2003;180:47–54 phy versus conventional chest radiography. 649–654
3. Izumi M, Hida A, Takagi Y, Kawabe Y, Eguchi K, Radiology 1996;201:51–60 15. Szolar DH, Kammerhuter F. Quantitative CT
Takashi N. MR imaging of the salivary glands in 9. Bosch JL, Tetteroo E, Mali WP, Hunik MGM. Iliac evaluation of adrenal gland masses: a step for-
Sicca syndrome: comparison of lipid profiles and im- artery occlusive disease: cost-effectiveness analysis ward in the differentiation between adenomas and
aging in patients with hyperlipidemia and patients of stent placement versus percutaneous translumi- non-adenomas? Radiology 1997;202:517–521
with Sjögren’s syndrome. AJR 2000;175:829–834 nal angioplasty. Radiology 1998;208: 641–648 16. Holhouser BA, Ashwal S, Luy GY, et al. Proton
4. Kuszyk BS, Bluemke DA, Choti MA, Horton 10. Harrell GC, Floyd CE, Johnston GA, Ravin CE. MR spectroscopy after acute central nervous sys-
KM, Magee CA, Fishman EK. Contrast-en- Quality control phantom for digital chest radiog- tem injury: outcome prediction in neonates, in-
hanced CT of small hypovascular hepatic tumors: raphy. Radiology 1997;202:111–116 fants and children. Radiology 1997;202:487–496
effect of lesion enhancement on conspicuity in 11. McMahon H, Engelmann R, Behlen FM, et al. 17. Chang Yue N, Arnold AM, Lonsteth WT, et al.
rabbits. AJR 2000;174:471–475 Computer-aided diagnosis of pulmonary nodules: Sulcal, ventricular and white matter changes at
5. Kozerke S, Hasenkam JM, Nygaard H, Paulsen PK, results of a large-scale observer test. Radiology MR imaging in the aging brain: data from the car-
Pedersen EM, Boesiger P. Heart-motion-adapted 1999;213:723–726 diovascular health study. Radiology 1997;
MR velocity mapping of blood velocity distribution 12. Goldfarb LR, Alazraki NP, Eshima D, Eshima 202:33–37
downstream of aortic valve prostheses: initial expe- LA, Herda SC, Halkar RK. Lymphoscintigraphic 18. Rubenstein JD, Kim JK, Henkelman RM. Effects
rience. Radiology 2001;218: 548–555 identification of sentinel lymph nodes: clinical of comparison and recovery on bovine articular
6. Chernoff DM, Ritchie CJ, Higgins CB. Evalua- evaluation of 0.22 mm filtration of Tc-99m sulfur cartilage: appearance on MR images. Radiology
tion of electron beam CT coronary angiography colloid. Radiology 1998;208:505–509 1996;201:843–850
in healthy subjects. AJR 1997;169:93–99 13. Minn H, Zasadny KR, Quint LE, Wall RL. Lung can- 19. Minar E, Pokrajac B, Ahmadi R, et al. Brachy-
7. Harlow CL, Stears RLG, Zeligman BE, Archer cer: reproducibility of quantitative measurements for therapy for prophylaxis of restenosis after long-
DG. Diagnosis of bowel obstruction on plain ab- evaluating 2-[F-18]-Fluoro-2-deoxy-d-glucose up- segment femoropopliteal angioplasty: pilot study.
dominal radiographs: significance of air–fluid take at PET. Radiology 1995;196:167–173 Radiology 1998;208:173–179

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 6. Statistically Engineering the Study for Success, July 2002
2. Framework, April 2001 7. Screening for Preclinical Disease: Test and Disease
3. Protocol, June 2001 Characteristics, October 2002
4. Data Collection, October 2001 8. Exploring and Summarizing Radiologic Data, January 2003
5. Population and Sample, November 2001

AJR:180, March 2003 619


Fundamentals of Clinical Research
for Radiologists
Lawrence Joseph 1,2
Caroline Reinhold 3
Introduction to Probability Theory
and Sampling Distributions

S
tatistical inference allows one to This calculation largely depends on a result
draw conclusions about the char- from probability called Bayes’ theorem. Simi-
acteristics of a population on the larly, all statistical inferences, whether compari-
basis of data collected from a sample of sub- sons of two proportions representing diagnostic
jects from that population. Almost all the sta- accuracies from two instruments or inferences
tistical inferences typically seen in the from a more complex model, are based on
medical literature are based on probability probabilistic reasoning. Therefore, a thorough
models that connect summary statistics cal- understanding of the meaning and proper inter-
culated using the observed data to estimates pretation of statistical inferences, crucial to
of parameter values in the population. This daily decision making in a radiology depart-
article will cover the basic principles behind ment, depends on an understanding of probabil-
probability theory and examine a few simple ity and probability models.
Received July 9, 2002; accepted after revision probability models that are commonly used, This article is composed of three main parts.
August 1, 2002.
including the binomial, normal, and Poisson We begin with an introduction to probability,
Series editors: Craig A. Beam, C. Craig Blackmore, distributions. We will then see how sampling including the definitions of probability, the dif-
Stephen Karlik, and Caroline Reinhold.
distributions are used as the basis for statisti- ferent schools of thought about the interpreta-
This is the tenth in the series designed by the American
College of Radiology (ACR), the Canadian Association of
cal inference and how they are related to tion of probabilities, and some simple
Radiologists, and the American Journal of Roentgenology. simple probability models. Thus, this article examples. We continue by defining conditional
The series, which will ultimately comprise 22 articles, is forms the foundation for future articles in the probabilities and present Bayes’ theorem,
designed to progressively educate radiologists in the
methodologies of rigorous clinical research, from the most series that will present the details of statisti- which is used to manipulate conditional proba-
basic principles to a level of considerable sophistication. cal inference in particular clinical situations. bilities. The most common simple probability
The articles are intended to complement interactive Making medical decisions on the basis of models, including the binomial, normal, and
software that permits the user to work with what he or she
has learned, which is available on the ACR Web site findings from various radiologic diagnostic Poisson distributions, are presented next, along
(www.acr.org). tools is an everyday occurrence in clinical with the types of situations in which we would
Project coordinator: Bruce J. Hillman, Chair, ACR practice. In radiologic research, one often be most likely to use them. Finally, sampling
Commission on Reseach and Technology Assessment needs to draw conclusions about the relative strategies are examined. Armed with these ba-
1
Department of Medicine, Division of Clinical performance of one diagnostic tool com- sics of probability and sampling, we conclude
Epidemiology, Montreal General Hospital, 1650 Cedar Ave., pared with another for the detection of a with a discussion of how the outcome of inter-
Montreal, Quebec, H3G 1A4, Canada. Address
correspondence to L. Joseph. given condition of interest. Both of these est defines the model parameter on which to
2
Department of Epidemiology and Biostatistics, McGill
tasks depend, in large part, on probability focus inferences and how the sampling distri-
University, 1020 Pine Ave. W., Montreal, Quebec, theory and its applications. In diagnosis, we bution of the estimator of that parameter en-
H3A 1A2, Canada. are interested in calculating the probability ables valid inferences from the data collected
3
Department of Diagnostic Radiology, Montreal General that the condition of interest is present on the in the sample about the population at large.
Hospital, McGill University Health Centre, 1650 Cedar Ave., basis of results of a radiologic test. This
Montreal, Quebec, H3G 1A4, Canada.
probability depends on how sensitive and Probability
AJR 2003;180:917–923
specific that test is in diagnosing the condi- Definitions of Probability
0361–803X/03/1804–917 tion and on the background rate of the condi- Last’s Dictionary of Epidemiology [1] pre-
© American Roentgen Ray Society tion in the population. sents two main definitions for probability. The

AJR:180, April 2003 917


Joseph and Reinhold

first definition, which represents the view of the and thus different radiologists may state dif- cle, through the use of set theory and proba-
frequentist school of statistics, defines the proba- ferent probabilities for the success rate of the bility notation.
bility of an event as the number of times the new technique. In general, no single “correct” The first rule states that, by convention, all
event occurs divided by the number of trials in probability statement may be made about any probabilities are numbers between 0 and 1. A
which it could have occurred, n, as n approaches event, because such statements reflect per- probability of 0 indicates an impossible event,
infinity. For example, the probability that a coin sonal subjective beliefs. Supporters of the and a probability of 1 indicates an event cer-
will come up heads is 0.5 because, assuming the Bayesian viewpoint counter that the frequen- tain to happen. Most events of interest have
coin is fair, as the number of trials (flips of the tist definition of probability is difficult to ap- probabilities that fall between these extremes.
coin) gets larger and larger, the observed propor- ply in practice and does not pertain to many The second rule is that events are termed
tion will be, on average, closer and closer to 0.5. important situations. Furthermore, the possi- “disjoint” if they have no outcomes in com-
Similarly, the probability that an intervention for ble lack of agreement as to the correct proba- mon. For example, the event of a patient hav-
back pain is successful would be defined as the bility for any given event can be viewed as an ing cancer is disjoint from the event of the
number of times it is observed to be successful advantage, because it will correctly mirror the same patient not having cancer, because both
in a large (theoretically infinite) number of trials range of beliefs that may exist about any cannot happen simultaneously. On the other
in patients with back pain. event that does not have a large amount of hand, the event of cancer is not disjoint from
Although this definition has a certain logic, data from which to accurately estimate its the event that the patient has cancer with me-
it has some problems. For example, what is the probability. Hence, having a range of proba- tastases because in both cases the outcome of
probability that team A will beat team B in their bilities depending on the personal beliefs of a cancer is present. If events are disjoint, then
game tonight? Because this is a unique event community of clinicians is a useful reflection the probability that one or the other of these
that will not happen an infinite number of of reality. As more data accumulate, Bayesian events occurs is given by the sum of the indi-
times, the definition cannot be applied. Never- and frequentists probabilities tend to agree, vidual probabilities of these events. For exam-
theless, we often hear statements such as each essentially converging to the mean of ple, in looking at an MR image of the liver, if
“There is a 60% chance that team A will win the data. When this occurs, similar inferences the probability that the diagnosis is a hepatoma
tonight.” Similarly, suppose that a new inter- will be reached from either viewpoint. is 0.5 (meaning 50%) and the probability of a
vention for back pain has just been developed, Discussion of these two ways of defining metastases is 0.3, then the probability of either
and a radiologist is debating whether to apply it probability may seem to be of little relevance hepatoma or metastases must be 0.8, or 80%.
to his or her next patient. Surely the probability to radiologists but, later in this series, it will The third rule is expressed as follows: If one
of success of the new intervention compared become apparent that it has direct implications could list the set of all possible disjoint events
with the probability of success of the standard for the type of statistical analysis to be per- of an experiment, then the probability of one of
procedure for back pain will play a large role in formed. Different definitions of probability these events happening is 1. For example, if a
the decision. However, no trials (and certainly lead to different schools of statistical inference patient is diagnosed according to a 5-point
not an infinite number of trials) as yet exist on and, most importantly, often to different con- scale in which 1 is defined as no disease; 2, as
which to define the probability. Although we clusions based on the same set of data. Any probably no disease; 3, as uncertain disease sta-
can conceptualize an infinite number of trials given statistical problem can be approached tus; 4, as probably diseased; and 5, as definitely
that may occur in the future, this projection from either a frequentist or a Bayesian view- diseased, then the probability that one of these
does not help in defining a probability for to- point, and the choice often depends on the ex- states is chosen is 1.
day’s decision. Clearly, this definition is lim- perience of the user more than it does on one The fourth rule states that, if two events are
ited, not only because some events can happen or the other approach being more appropriate independent (i.e., knowing the outcome of one
only once, but also because one cannot observe for a given situation. In general, Bayesian provides no information concerning the likeli-
an infinite number of like events. analyses are more informative and allow one hood that the other will occur), then the proba-
The second definition, often referred to as to place results into the context of previous re- bility that both events will occur is given by
the Bayesian school, defines the probability of sults in the area [2], whereas frequentist meth- the product of their individual probabilities.
any event occurring as the personal degree of ods are often easier to carry out, especially Thus, if the probability that findings on an MR
belief that the event will occur. Therefore, if I with currently available commercial statistical image will result in a diagnosis of a malignant
personally believe that there is a 70% chance packages. Although most analyses in medical tumor is 0.1, and the probability that it will
that team A will win tonight’s game, then that journals currently follow the frequentist defini- rain today is 0.3 (an independent event, pre-
is my probability for this event. In coin tossing, tion, the Bayesian school is increasingly sumably, from the results of the MR imaging),
a Bayesian may assert, on the basis of the present, and it will be important for readers of then the probability of a malignant tumor and
physics of the problem and perhaps a number medical journals to understand both. rain today is 0.1 × 0.3 = 0.03, or 3%.
of test flips, that the probability of a coin flip The lack of a single definition of probability In summary, probabilities for events al-
coming up heads should be close to 0.5. Simi- may be disconcerting, but it is reassuring to ways follow these four rules, which are com-
larly, on the basis of an assessment that may know that whichever definition one chooses, patible with common sense. Such probability
include both previously available data and sub- the basic rules of probability are the same. calculations can be useful clinically, for ex-
jective beliefs about the new technique, a radi- ample, in deriving the probability of a certain
ologist may assert that the probability that a Rules of Probability diagnosis given one or more diagnostic test
procedure will be successful is 85%. Four basic rules of probability exist. These results. Many probability calculations used
The obvious objection to Bayesian proba- rules are usually expressed more rigorously in clinical research involve conditional prob-
bility statements is that they are subjective, than is necessary for the purposes of this arti- abilities. These are explained next.

918 AJR:180, April 2003


Probability Theory and Sampling Distributions

Conditional Probabilities and Bayes’ Theorem ing endometrial cancer is 90%. If we let A 90% sensitivity and Pr (not A | not B) = 80%
What is the probability that a given patient represent the event that the patient has a posi- specificity. What is the probability that a pa-
has endometrial cancer? Clearly, this depends tive endovaginal sonography, and let B repre- tient with positive test results in fact has en-
on a number of factors, including age, the pres- sent the probability of endometrial cancer in dometrial cancer? According to Bayes’
ence or absence of postmenopausal bleeding, this patient population, then we can summa- theorem, we calculate
and others. In addition, our assessment of this rize the above information as Pr (B) = 0.1 and
probability may drastically change between Pr (A | B) = 0.9. By using the formula de- Pr (B | A) =
the time of the patient’s initial clinic visit and scribed, we can deduce that the probability
the point at which diagnostic test results be- that a patient in this clinic has both endome- Pr (B) × Pr (A | B)
come known. Thus, the probability of endome- trial cancer and positive results on endovagi- Pr (B) × Pr (A | B) + Pr (not B) × Pr (A | not B)
trial cancer is conditional on other factors and nal sonography is 0.1 × 0.9 = 0.09 or 9%.
is not a single constant number by itself. Such In typical clinical situations, we may know 0.1 × 0.9
=
probabilities are known as conditional proba- the background rate of the disease in question in 0.1 × 0.9 + 0.9 × 0.2
bilities. Notationally, if unconditional probabil- the population referred to a particular clinic
ities can be denoted by Pr (cancer), then (which may differ from clinic to clinic), and we = 0.33
conditional probabilities can be denoted by may have some idea of the sensitivity and speci-
Pr (cancer | diagnostic test is positive), read as ficity of the test. Notice that in the terms used, or about 33%. In this case, even when a pa-
“the probability of cancer given or conditional sensitivity and specificity may be considered tient has a positive test result, the chances
on a positive diagnostic test result,” and, simi- conditional probabilities because they provide that the disease is present are less than 50%.
larly, Pr (cancer | diagnostic test is negative), the probability of testing positive given a subject Similarly, what is the probability that a
read as “the probability of cancer given a nega- who truly has the condition of interest (i.e., Pr [A subject testing negative has endometrial can-
tive diagnostic test result.” These probabilities | B], which is the sensitivity), and the probability cer? Again using Bayes’ theorem,
are highly relevant to radiologic practice and of not testing positive given the absence of the
clinical research in radiology. condition of interest (i.e., the specificity, Pr [not Pr (B | not A) =
Because they are a form of probability, con- A | not B]). What should a clinician conclude if a
ditional probabilities must follow all rules as patient walks through the door with a “positive” Pr (B) × Pr (not A | B)
outlined in the previous section. In addition, test result in hand? In this case, one would like to Pr (B) × Pr (notA | B) + Pr (not B) × Pr (notA | not B)
however, there is an important result that links know the probability of the patient’s being truly
conditional probabilities to unconditional positive for the condition, given that he or she 0.1 × 0.1
=
probability statements. In general, if we denote has just had a test with positive findings. Of 0.1 × 0.1 + 0.9 × 0.8
one event by A, and a second event by B, then course, if the diagnostic test is a perfect gold
we can write standard, one can simply look at the test result = 0.013.
and be confident of the conclusion.
Pr (A and B) However, most tests do not have perfect sen- Thus, starting from a background rate of
Pr (A | B) =
Pr(B). sitivity and specificity, and thus a probability 10% (pretest probability), the probability of
calculation is needed to find the probability of a cancer rises to 33% after a positive diagnosis
In words, the probability that event A occurs, true-positive, given the positive test result. In and falls to approximately 1% after a nega-
given that we already know that event B has our notation, we know the prevalence of the tive test (posttest probabilities). Thus, Bayes’
occurred, denoted by Pr (A | B), is given by condition in our population, Pr (B), and we theorem allows us to update our probabilities
dividing the unconditional probability that know the sensitivity and specificity of our test, after learning the test result, and it is thus of
these two events occur together by the un- given by Pr (A | B) and Pr (not A | not B), but we great usefulness to practicing radiologists.
conditional probability that B occurs. Of want to know Pr (B | A), which is opposite in The next module in this series covers Bayes’
course, this formula can be algebraically ma- terms of what is being conditioned on. How theorem and diagnostic tests in more detail.
nipulated, so that it must also be true that does one reverse the conditioning argument, in
effect making statements about Pr (B | A) when
Pr (A and B) = Pr (B) × Pr (A | B). we only know Pr (A | B)? The answer is to use a Probability Models
general result from probability theory, called Rather than working out all problems in-
For example, suppose that in a clinic dedi- Bayes’ theorem, which states volving probabilities by first principles using
cated to evaluating patients with postmeno- the basic probability rules as we have dis-
pausal bleeding, endovaginal sonography is Pr (B | A) = cussed, it is possible to use short cuts that
often used for the detection of endometrial have been devised for common situations,
cancer. Assume that the overall probability of Pr (B) × Pr (A | B) leading to probability functions and proba-
a patient in the clinic having endometrial can- Pr (B) × Pr (A | B) + Pr(not B) × Pr(A | not B). bility densities. Here we review three of the
cer is 10%. This probability is unconditional, most common distributions: the binomial,
that is, it is calculated from the overall preva- Suppose that the background rate of en- the normal, and the Poisson. Which distribu-
lence in the clinic; before any test results are dometrial cancer seen in patients referred to a tion to use depends on many situation-spe-
known. Furthermore, suppose that the sensi- particular radiology clinic is 10% and that a di- cific factors, but we provide some general
tivity of endovaginal sonography for diagnos- agnostic test is applied that has Pr (A | B) = guidelines for the appropriate use of each.

AJR:180, April 2003 919


Joseph and Reinhold

The Binomial Distribution so that there is slightly less than a one-in-four roughly meaning that although on average
One of the most commonly used probabil- chance of getting eight successful angioplasty one expects approximately 40 successes, one
ity functions is the binomial. The binomial procedures in 10 trials. Of course, these days also expects each result to deviate from 40 by
distribution allows one to calculate the prob- such calculations are usually done by com- an average of approximately five successes.
ability of obtaining a given number of “suc- puter, but seeing the formula and calculating a The binomial distribution can be used any
cesses” in a given number of trials, wherein probability using it at least once helps to avoid time one has a series of independent trials (dif-
the probability of a success on each trial is that “black box” feeling one can often get ferent patients in any trial can usually be consid-
assumed to be p. In general, the formula for when using a computer and adds to the under- ered as independent) wherein the probability of
the binomial probability function is standing of the basic principles behind statisti- success remains the same for each patient. For
cal inference. Similarly, the probability of example, suppose that one has a series of 100
Pr(x successes in n trials) = getting eight or more (that is, eight or nine or patients, all with known endometrial cancer. If
n! 10) successful angioplasty procedures is found each patient is asked to undergo MR imaging,
----------------------- p x ( 1 – p ) n – x ,
x! ( n – x )! by adding three probabilities of the type for example, and if the true sensitivity of this
shown, using the second probability rule be- test is 80%, what is the probability that 80 of
where n! is read “n factorial” and is short- cause these events are disjoint. As an exercise, them will in fact test positive? By plugging p =
hand for one can check that this probability is 0.3829. 0.8, n = 100, and x = 80 into the binomial prob-
See Figure 1 for a look at all probabilities for ability formula as discussed, one finds that this
n × (n – 1) × (n – 2) × (n – 3) × . . . × 3 × 2 × 1. this problem, in which x varies from zero to 10 probability is 0.0993, or about 10%. (One
successes for n = 10 and p = 0.7. would probably want to do this calculation on a
For example, 5! = 5 × 4 × 3 × 2 × 1 = 120, The binomial distribution has a theoretic computer because 100!, for example, would be
and so on. By convention, 0! = 1. Suppose mean of n × p, which is a nice intuitive result. a tedious calculation.)
we wish to calculate the probability of x = 8 For example, if one performs n = 100 trials,
and on each trial the probability of success is Normal Distribution
successful angioplasty procedures in n = 10
patients with unilateral renal artery stenosis, p = 0.4 or 40%, then one would intuitively ex- Perhaps the most common distribution used
wherein the probability of a successful an- pect 100 × 0.4 = 40 successes. The variance, in statistical practice is the normal distribution,
gioplasty each time is 70%. From the bino- σ 2, of a binomial distribution is n × p × (1 – the familiar bell-shaped curve, as seen in Fig-
mial formula, we can calculate p), so that in the example just given it would ure 2. Many clinical measurements follow nor-
be 100 × 0.4 × 0.6 = 24. Thus, the SD is mal or approximately normal distributions
10! (e.g., tumor sizes). Technically, the curve is
---------- 0.7 8( 1 – 0.7 ) 2 = 0.2335 ,
σ = σ = 24 = 4.90,
2
8!2! traced out by the normal density function

1  1 ( x – µ )2 
-------------- exp  – --- -------------------  ,
2 πσ  2 σ
2

0.25 where “exp” denotes the exponential func-


tion to the base e = 2.71828. The Greek letter
µ is the mean of the normal distribution set
0.20 to zero in the SD curve of Figure 2, and the
SD is σ , set to 1 in the standard normal
Probability

curve. Although Figure 2 presents the stan-


0.15 dard version of the normal curve ( µ = 0, σ 2 =
σ = 1), more generally, the mean µ can be
any real number and the SD can be any num-
0.10 ber greater than zero. Changing the mean
shifts the curve depicted in Figure 2 to the
left or right so that it remains centered at the
0.05 mean, whereas changing the SD stretches or
shrinks the curve around the mean, all while
keeping its bell shape. Note that the mean
0.00 (usual arithmetic average), median (middle
0 1 2 3 4 5 6 7 8 9 10 value, i.e., point at which 50% of the area un-
No. of Successes der the curve lies above and below), and
mode (most likely value, i.e., highest point
on the curve) of a normal distribution are al-
ways the same and equal to µ. Approxi-
Fig. 1.—Graph shows binomial distribution with sample size of 10 and probability of success p = 0.7. mately 95% of the area under the curve falls

920 AJR:180, April 2003


Probability Theory and Sampling Distributions

within 2 SDs on either side of the mean, and SD σ / n . As will be explained in future ar- mean, forming the basis for much of the sta-
approximately 68% of the area falls within 1 ticles, the sample average, x, is used to esti- tistical inference. In particular, notice that as
SD of the mean. mate the true (but unknown) population mean the sample size n increases, the SD (SE) σ /
The normal density function has been used µ. The SD about a sample mean, σ / n , is n of the sample mean around the true
to represent the distribution of many measures often called the standard error (SE). mean decreases so that on average the sam-
in medicine. For example, tumor size, bipari- This useful theorem has two immediate ple mean x gets closer and closer to µ. We re-
etal diameter, or bone mineral density in a consequences. First, it accounts for the popu- turn to this important point later, but first
given population may be said to follow a nor- larity of the normal distribution in statistical look at our last distribution, the Poisson.
mal distribution with a given mean and SD. It practice. Even if an underlying distribution
is highly unlikely that any of these or other in a population is nonnormal (e.g., if it is Poisson Distribution
quantities exactly follow a normal distribu- skewed or binomial), the distribution of the Suppose that we would like to calculate
tion. For instance, none of these quantities can sample average from this population be- probabilities relating to numbers of cancers over
have negative numbers, whereas the range of comes close to normal if the sample size is a given period of time in a given population. In
the normal distribution always includes all large enough. Thus, statistical inferences can principle, we can consider using a binomial dis-
negative (and all positive) numbers. Neverthe- often be based on the normal distribution, tribution because we are talking about numbers
less, for appropriately chosen mean and SD, even if the underlying population distribu- of events in a given number of trials. However,
the probability of out-of-range numbers will tion is nonnormal. Second, the result con- the numbers of events may be enormous (num-
be vanishingly small, so that this may be of nects the sample mean to the population ber of persons in the population times the num-
little concern in practice. We may say, for ex-
ample, that tumor size in a given population
follows a normal distribution with a mean of
20 mm and an SD of 10 mm, so that the prob-
ability of a value less than zero is only approx- 0.4
imately 2.5%. In the words of statistician Probability Density
George Box [3], “All models are wrong, but
some are useful.” 0.3
To calculate probabilities associated with
the normal distribution, one must find the 68% of area
area under the normal curve. Because doing 0.2
so is mathematically difficult, normal tables
or a computer program are usually used. For
example, the area under the standard normal 0.1
Fig. 2.—Graph shows the 95% of area
curve between –1 and 2 is 0.8186, as calcu- standard normal distribu-
lated via normal tables or via a computer tion with mean µ = 0 and
0.0
SD σ = 1. Approximately
package for statistics. 95% of area under curve –3 –2 –1 0 1 2 3
The normal distribution is central to statis- falls within 2 SDs on either
tical inference for an additional reason. Con- side of mean, and about x
sider taking a random sample of 500 patients 68% of area falls within 1
SD from mean.
visiting their family physicians for periodic
health examinations. If the blood pressure of
each patient were recorded and an average
were taken, one could use this value as an es-
timate of the average in the population of all
0.20
patients who might visit their family physi-
cians for routine checkups. However, if the
experiment were repeated, it would be unex- 0.15
pected for the second average of 500 patients
Probability

to be identical to the first average, although


0.10
one could expect it to be close.
How these averages vary from one sample
0.05
to another is given by the central limit theo-
rem, which in its simplest form is explained
as follows. Suppose that a population charac-
0.00
teristic has true (but possibly unknown) mean
15
µ and standard deviation σ. The distribution 0 5 10 20

of the sample average, x, based on a sample Count


Fig. 3.—Chart shows the
of size n, approaches a normal distribution as Poisson distribution with µ
the sample size grows large, with mean µ and (mean = 10).

AJR:180, April 2003 921


Joseph and Reinhold

ber of time periods). Furthermore, we may not Summary all, what is the sampling distribution of x? Sup-
even be certain of the denominator but may The binomial distribution is used for yes/ pose we were to take a second random sample
have some idea of the rate (e.g., per year) of no or success/fail dichotomous variables, the of size 100 and record its mean. It would not
cancers in this population from previous data. In normal distribution is often used for proba- likely be exactly 20 mm but perhaps be close to
such cases in which we have counts of events bilities concerning continuous variables, and that value, for example, 18 mm. If we repeated
through time rather than counts of successes in the Poisson distribution is used for outcomes this process for a third sample, we might get a
a given number of trials, we can consider using arising from counts. These three distribu- mean of 21 mm, and so on. Now imagine the
the Poisson distribution. More precisely, we tions, of course, are by no means the only thought experiment in which we would repeat
make the following assumptions: ones available, but they are among the most this process an infinite number of times and
First, we assume that the probability of an commonly used in practice. Deciding draw the histogram of these means of 100 sub-
event (e.g., a cancer) is proportional to the time whether they are appropriate in any given sit- jects. The resulting histogram would represent
of observation. We can notate this as Pr (cancer uation requires careful consideration of the sampling distribution of x for this problem.
occurs in time t) = λ × t, wherein λ is the rate many factors and verification of the assump- According to the central limit theorem, the
parameter, indicating the event rate in units of tions behind each distribution and its use. sampling distribution of x is a normal distribu-
events per time. Second, we assume that the This ends our brief tour of the world of tion, with mean µ representing the true but un-
time t is small enough that two events cannot probability and probability distributions. known mean tumor size (available only if a
occur in time t. For cancer in a population, t Armed with these basics, we are now ready to complete census is taken), and with an SE σ /
may be, for example, 1 min. The event rate λ is consider some simple statistical inferences. n . Therefore, the SE in our example is 10 /
assumed to be constant through time (homoge- 100 = 1 mm. So the sampling distribution of
neous Poisson process). Finally, we assume that x is normal, with unknown mean µ, and SE of
events (cancers) occur independently. Sampling Distributions 1. Although we do not know the mean of the
If all of these assumptions are true, then So far, we have seen the definitions of prob- sampling distribution, we do know, from our
we can derive the distribution of the number ability, the rules probabilities must follow, and facts about the normal distribution, that 95%
of counts in any given period of time. Let µ = three probability distributions. These ideas of all x’s sampled in this experiment will be
λ × t be the rate multiplied by time, which is form the basis for statistical inferences, but within ± 2 × 1 = 2 SEs from µ. Thus, although
the Poisson mean number of events in time t. how? The key is sampling distributions. µ remains unknown, we do expect it to be near
Then the Poisson distribution is given by First, we must distinguish sampling distri- x in this sense. Chances are very good that x
butions from probability distributions and will be within 2 mm of µ, allowing statements
Pr (x events occur in time t) = population distributions, which can be ex- called confidence intervals about µ that we
plained through an example: Suppose we will examine more closely in subsequent arti-
e–µ µx would like to measure the average tumor size cles in this series. If we observed only 10 tu-
x! , on detection at MR imaging for a certain mors rather than 100, our SE would have been
type of cancer. If we were able to collect the 10 / 10 = 3.2 mm, leading to less accuracy in
where e = 2.71828. . . , and x denotes factorial of tumor size for all patients with this disease estimating µ, whereas a sample size of 1000
x (the same as in the binomial distribution). Both (i.e., a complete census) and create a histo- would lead to an SE of 0.32, leading to in-
the mean and the variance of the Poisson distri- gram of these values, then these data would creased accuracy compared with a size of 100.
bution are equal to µ. The graph of the Poisson represent the population distribution. The To summarize, population distributions rep-
distribution for µ = 10 is given in Figure 3. mean of this distribution would represent the resent the spread of values of the variable of in-
As an example of the use of the Poisson dis- true average tumor size in this population. terest across individuals in the target population,
tribution, suppose that the incidence of a certain It is rare, if not impossible, for anyone to whereas sampling distributions show how the
type of cancer in a given region is 250 cases per perform a complete census, however. One will estimate of the population mean varies from one
year. What is the probability that there will be usually have the opportunity to observe only a sample to the next if the experiment were to be
exactly 135 cancer cases in the next 6 months? subset of the subjects in the target population repeated and the mean calculated each time.
Let t =1 year, then µ = 250 cancers per year. We (i.e., a sample). Suppose that we are able to The sampling distribution connects the estima-
are interested, however, in t = 0.5, which means take a random sample of subjects from this tor, here x, to the parameter of interest, here µ,
that µ = 125 cancers per 6-month period. Using population, of, for example, n = 100 patients. the mean tumor size in the population. Larger
the Poisson distribution, we can calculate In each case, we observe the tumor size and sample sizes lead to more accurate estimation.
record the average value. Suppose this average Similar inferences can be made from ob-
Pr (135 cancers | µ = 125) = value is x = 20 mm, with a SD of σ = 10 mm. servations that are dichotomous using the bi-
We can thus conclude that 20 mm, the average nomial distribution or for count data using
e– 125 125135 value in our sample, is a reasonable (unbiased) the Poisson distribution. Again, these topics
135! point estimate of the average tumor value in are relegated to a future article in this series.
our population, but how accurate is it? How Notice that we had to make various assump-
= 0.0232. does this accuracy vary if we change the sam- tions in the previous discussion—for example,
ple size to only 10 patients? What about if we that the distribution of tumor sizes in the popu-
Therefore, approximately 2.3% of a chance increase it to 1000 patients? lation is approximately normal and, most im-
exists of observing 135 cancers in the next 6 The answer to these questions lies in the portantly, that the subjects are representative of
months. sampling distribution of the estimator, x. First of the population to whom we wish to make infer-

922 AJR:180, April 2003


Probability Theory and Sampling Distributions

ences. The easiest way to ensure representative- though a tumor size of 20 mm may in fact be presented here, countless books explain basic
ness is through random selection, but this may the average in your sample, this estimate is statistical concepts—dozens with a focus on
not be possible in some situations for practical biased if patients with smaller or larger tu- biostatistics. Among them are the works of
reasons. For true random selection to occur, one mors are systematically left out. For example, Armitage and Berry [4], Colton [5], Rosen-
must have a list of all members of the popula- subjects with preclinical symptoms may not berg et al. [6], and Rosner [7].
tion and select subjects to form the study sam- visit your clinic, even if their tumors might
ple by random number generation or another have been detectable on MR imaging, result- References
random process. Lists of all members of the tar- ing in 20 mm being an overestimate of the
1. Last J. A dictionary of epidemiology, 2nd ed. New
get population are rare, however, so that differ- true average tumor size detectable on MR im- York: Oxford University Press, 1988:xx
ent mechanisms of subject selection are often aging in the clinic. Similarly, if patients with 2. Brophy JM, Joseph L. Placing trials in context us-
necessary. Case series, or consecutive patients advanced disease do not visit the clinic be- ing Bayesian analysis: GUSTO revisited by Rev-
in a clinic, may or may not be representative, cause their tumors were clinically detected by erend Bayes. JAMA 1995;273:871–875
depending on the particularities of the selection other means, 20 mm may in fact be an under- 3. Box G. Statistics for experimenters: an introduc-
tion to design, data analysis, and model building.
process. Similarly, convenience samples—tak- estimate of the true average. Selection bias
New York: Wiley, 1978
ing the subjects most easily available—are often should always be kept in mind when reading 4. Armitage P, Berry G. Statistical methods in medi-
not completely representative, because the very the medical literature. cal research, 3rd ed. Oxford: Blackwell Scientific
fact that subjects are easily available often tends Publications, 1994
to make them younger, less sick, and living near Conclusion 5. Colton T. Statistics in medicine. Boston: Little,
the clinic. This brief tour of probability, distributions, Brown, 1974
Because many outcomes of interest may and the roots of statistical inferences barely 6. Rosenberg L, Joseph L, Barkun A. Surgical arith-
metic: epidemiological, statistical and outcome-
differ between, for example, young and old or scratches the surface. Many of these ideas
based approach to surgical practice. Austin, TX:
urban and rural patients, convenience sam- will be amplified in future articles of this se- Landes Biosciences, 2000
ples and often case series are always suspect ries. For the impatient, or those who want 7. Rosner B. Fundamentals of biostatistics. Bel-
in terms of selection bias. In other words, al- more detailed explanations of the concepts mont, CA: Duxbury, 1994:105

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 7. Screening for Preclinical Disease: Test and Disease
2. Framework, April 2001 Characteristics, October 2002
3. Protocol, June 2001 8. Exploring and Summarizing Radiologic Data, January 2003
4. Data Collection, October 2001 9. Visualizing Radiologic Data, March 2003
5. Population and Sample, November 2001
6. Statistically Engineering the Study for Success, July 2002

AJR:180, April 2003 923


Fundamentals of Clinical Research
for Radiologists
C. Craig Blackmore1
Peter Cummings2
Observational Studies in Radiology

T he objectives of this paper are to


describe the commonly used ob-
servational study designs—co-
hort and case-control studies—and to
perform a randomized trial if the exposure
cannot be manipulated, if manipulation of
the exposure would be unethical, if the time
from exposure to outcome is very long and
illustrate their use in radiology research. We more immediate results are desired, or if the
will also discuss the strengths and limitations outcome is rare, requiring a prohibitively
of observational studies and the basics of large and expensive randomized clinical
data analysis. Comprehensive discussions of trial. Under these circumstances, observa-
observational studies can be found in several tional studies may be the best alternatives.
epidemiology textbooks [1–4]. Observational studies, including cohort and
An important goal in radiology research is case-control studies, are hypothesis-testing
to estimate any causal effect of radiology inter- analytic studies that do not require manipula-
ventions on patient outcome [5–7]. For exam- tion of an exposure [15].
Received May 24, 2004; accepted after revision June 2, 2004. ple, a number of investigators have studied the
Supported in part by the Agency for Healthcare Research effect of mammography screening on mortal-
and Quality grant K08 HS11291-02. ity due to breast cancer [8–10]. A second goal Cohort Studies
Series editors: Nancy Obuchowski, C. Craig Blackmore, of radiology research is to provide evidence to The most intuitively understood observa-
Steven Karlik, and Caroline Reinhold. guide selection of optimal imaging strategies. tional study is a cohort study, in which out-
This is the 11th in the series designed by the American Associations between clinical factors and dis- comes of subjects with and without a given
College of Radiology (ACR), the Canadian Association of eases can form the basis of clinical prediction exposure are compared. A well-known radiol-
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is rules that guide development of imaging ogy cohort study is the comparison of high-
designed to progressively educate radiologists in the strategies [11, 12]. For example, mechanism and low-osmolar contrast media by Bettmann
methodologies of rigorous clinical research, from the most of injury, such as a high-speed motor vehicle et al. [16]. In that study, the outcomes were
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive crash, is a predictor of cervical spine fracture adverse events that could be attributed to the
software that permits the user to work with what he or she that can be used to select between CT or radi- contrast media. Outcomes were assessed pro-
has learned, which is available on the ACR Web site
(www.acr.org).
ography to evaluate the cervical spine of spectively, meaning that subjects were identi-
trauma patients [13, 14]. fied at the time of exposure (use of contrast
Project coordinator: G. Scott Gazelle, Chair, ACR
Commission on Research and Technology Assessment. The best research design for the investiga- material), and then followed up to see if the
Staff coordinator: Jonathan H. Sunshine, Senior Director tion of causal relationships is the randomized outcome (adverse reaction) occurred. Bett-
for Research, ACR. clinical trial. However, clinical trials require mann et al. found that use of low-osmolar con-
1
Department of Radiology, Harborview Medical Center and that the investigator control which subjects trast material was associated with a decreased
Harborview Injury Prevention and Research Center,
University of Washington, Box 359728, 325 Ninth Ave.,
receive a given treatment or exposure under rate of all adverse reactions. Cohort studies
Seattle, WA 98104. Address correspondence to study. Many circumstances exist in which it may also be retrospective; exposed and unex-
C. C. Blackmore. is not ethical or feasible to perform a ran- posed subjects are identified retrospectively af-
2
Department of Epidemiology, Harborview Medical Center domized trial. For example, we cannot study ter all outcomes of interest have occurred.
and Harborview Injury Prevention and Research Center, the influence of cervical spine imaging on Both exposure and outcome are then deter-
University of Washington, Seattle, WA.
outcome in major trauma patients by ran- mined from medical records or some other
AJR 2004;183:1203–1208
domizing which trauma patients will have data source.
0361–803X/04/1835–1203 their cervical spines imaged and which will In cohort studies, the rate of the outcome for
© American Roentgen Ray Society not. In general, it may not be appropriate to each of the exposure cohorts is measured di-

AJR:183, November 2004 1203


Blackmore and Cummings

Cohort Study Comparing Reaction Rates Using Low-Osmolar Versus


Two-by-Two Table for a TABLE 2
TABLE 1 High-Osmolar Intraarterial Contrast Media (Risk Ratio = 0.71)
Cohort Study
Group Reaction No Reaction Total
Outcome
Group Total Low-osmolar [a] 942 [b] 8,482 [a +b] 9,424
Yes No High-osmolar [c] 1,601 [d] 9,833 [c + d] 11,434
Exposed a b [a +b] Risk ratio
Unexposed c d [c + d] Formula [a / (a + b)] / [c / (c + d)]
Risk ratio [a / (a + b)] / [c / (c + d)] Result (942 / 9,424) / (1,601 / 11,434) = 0.71
Note.—Derived from Bettmann et al. [16].

rectly. The groups are often compared using vantage of a cohort study is that because sub- Women invited for breast cancer mammog-
the risk ratio. Using notation from the 2 × 2 jects are selected on the basis of exposure; raphy screening as part of the Trial of Early
contingency table (Table 1), the risk ratio is usually only a single exposure can be studied. Detection of Breast Cancer were considered
computed as: to be exposed. Unexposed subjects were
Case-Control Studies those not invited for screening. Being invited
risk ratio = [a / (a + b)] / [c / (c + d)] = p1 / p2, In case-control studies, subjects are se- to screening was associated with decreased
lected on the basis of their outcomes. Cases breast cancer–related mortality.
where p1 is the probability of the outcome in are those with the outcome being studied, The analysis of case-control study data
subjects with the exposure and p2 is the and controls are subjects selected, often at can also be illustrated using a 2 × 2 table (Ta-
probability of outcome in subjects without random, from the population from which the ble 3). However, the relevant measure of as-
the exposure. cases arose. Exposure is then assessed for sociation is the odds ratio:
The risk ratio provides an estimate of the both the cases and the controls. Case-control
strength of association between the exposure studies may be used to study the impact of an odds ratio = a × d / b × c =
and outcome. Risk ratios may be greater than imaging technique on patient outcome. For [p1 / (1 – p1)] / [p2 / (1 – p2)],
1, indicating positive association between example, Moss et al. [8] used case-control
outcome and exposure, or less than 1, indi- methods to evaluate the impact of mammog- where p is the probability of the outcome, and
cating that a given exposure is associated raphy screening on mortality due to breast p / 1 – p is the odds of the outcome. Like the
with a decreased risk of the outcome. Confi- cancer. Cases were subjects who died from risk ratio, the statistical significance of the
dence intervals (CIs) for the risk ratio are de- breast cancer, and controls were age-matched odds ratio can be estimated using the chi-
scribed in detail elsewhere [1]. women who survived in the Guilford and square statistic. Confidence intervals for odds
The study by Bettmann et al. [16] com- Stoke region of the United Kingdom. ratios are described elsewhere [1].
pared the intraarterial use of low-osmolar
contrast material with intraarterial high-os-
Cohort Study of Mammography Screening and Mortality due to Breast
molar contrast material in diagnostic proce- TABLE 4
Cancer (Risk Ratio = 0.74)
dures. When compared with high-osmolar
contrast material, low-osmolar contrast ma- Mortality due to Breast Alive, or Death from
Group Total
Cancer Other Cause
terial was associated with a lower rate of ad-
verse events, with an unadjusted risk ratio of Offered screening [a] 51 [b] 22,647 [a + b] 22,698
0.71 (95% CI, 0.67, 0.75) (Table 2) [16]. Not offered screening [c] 147 [d] 48,324 [c + d] 48,471
An advantage of a cohort study is that a sin- Risk ratio
gle cohort may be used to study multiple out-
Formula [a / (a + b)] / [c / (c + d)]
comes. Bettmann et al. [16] investigated the
Result (51 / 22,698) / (147 / 48,471) = 0.74
rate of all adverse events after contrast admin-
istration. However, they were also able to in- Note.—Derived from Moss et al. [8].

vestigate the rates of major reactions and


minor reactions in the same patients. A disad-
Case-Control Study of Mammography Screening and Mortality due to
TABLE 5
Breast Cancer (Odds Ratio = 0.75)
Two-by-Two Table for Group Mortality due to Breast Cancer Alive, or Death from Other Cause
TABLE 3
Case-Control Study
Offered screening [a] 51 [b] 312
Group Case Control Not offered screening [c] 147 [d] 678
Exposed a b Odds ratio
Unexposed c d Formula ad / bc
Odds ratio (a / c) / (b / d) = ad / bc Result (51) (678) / (312) (147) = 0.75
Note.—Derived from Moss et al. [8].

1204 AJR:183, November 2004


Fundamentals of Clinical Research for Radiologists

The case-control study design has several both a case-control study and a cohort study When the outcome or disease under study is
advantages. Case-control studies are a cost- that were performed simultaneously in the common, the odds ratio may differ substan-
efficient research design, particularly when same population, in order to compare the two tially from the risk ratio. For example, the the-
the outcome under study is rare. Also, in a designs. Tables 4 and 5 illustrate the 2 × 2 ta- oretic data presented in Tables 6 and 7
case-control study, multiple exposures may bles for the two study designs. The risk ratio compare two studies that yield an odds ratio of
be studied from the same data. As an exam- using the cohort data was [51 / (51 + 22,647)] 2.0 for the outcome of death in subjects who
ple, CT rather than radiography may be the / [147 / (147 + 48,324)], or 0.74 (95% CI, received test A compared with those who re-
more cost-effective imaging strategy in 0.54, 1.02). The odds ratio using the case- ceived test B. When the outcome of death was
trauma patients with a high probability of control approach for this study was approxi- rare (Table 6), the odds ratio and the risk ratio
cervical spine fracture [17]. Therefore, iden- mately the same, (51) × (678) / (312) × (147), were both about 2.0. However, when death
tification of subjects at high probability of or 0.75 (95% CI, 0.52, 1.08). Note that the was common, the same odds ratio of 2.0 corre-
fracture can aid appropriate selection of CT number of subjects with the outcome of death sponds to a risk ratio of only 1.1. Thus, for
versus radiography. Blackmore et al. [14] due to breast cancer was the same for both common diseases or outcomes, the odds ratio
performed a case-control study to identify studies, but the number of subjects without may not approximate the risk ratio.
factors that were associated with cervical the outcome differed. In the case-control
spine fractures. Cases were those with a cer- study, several controls were selected for each
vical spine fracture, and controls were ran- case. In the cohort study, on the other hand, Subject Selection
domly selected trauma patients without a all the subjects with and without the exposure In case-control studies, bias can arise if
cervical fracture. This single set of cases and were included. As a result, there are thou- the selection of cases and controls is affected
controls was then used to simultaneously as- sands of subjects in the cohort study, but only by exposure status other than through the in-
sess any association between cervical spine several hundred in the case-control study. Be- fluence of the exposure on outcome. Simi-
fracture (outcome) and a host of potential cause the outcome was rare, the cohort and larly, selection bias can arise in cohort
predictors (exposures), including mechanism case-control study results were nearly identi- studies if the outcome affects the selection of
of injury, presence of associated injuries cal, but many fewer subjects were required the exposed or unexposed subjects. A useful
such as head injury, and clinical findings under the case-control study design. approach to avoid selection bias is to define
such as neurologic deficits [14].
A disadvantage of case-control studies is
that they yield the odds ratio rather than the Two-by-Two Table for Cohort Study When the Outcome Is Rare
TABLE 6
risk ratio. The risk ratio from a cohort study (Odds Ratio = 2.00, Risk Ratio = 1.98)
has a more intuitive interpretation and is gen- Outcome (Death)
erally preferred, because the risk ratio di- Group Total
Yes No
rectly compares the proportion of subjects
with the outcome in the exposed group with Exposed (test A) [a] 2 [b] 100 [a + b] 102
the proportion of subjects with the outcome Unexposed (test B) [c] 10 [d] 1,000 [c + d] 1,010
in the unexposed group. In case-control stud- Odds ratio
ies, the proportion of subjects with the out- Formula ad / bc
come in the exposed and unexposed groups
Result (2) (1,000) / (100) (10) = 2.00
is generally not known, so the analysis is
based on the odds of the outcome. Fortu- Risk ratio
nately, when the study outcome is rare in the Formula [a / (a + b)] / [c / (c + d)]
population from which the cases and con- Result (2 / 102) / (10 / 1,010) = 1.98
trols are drawn, the odds ratio will provide a
good approximation of the risk ratio. In co-
hort studies, the risk ratio is [a / (a + b)] / [c / Two-by-Two Table for Cohort Study with Very Common Outcome
(c + d)] (Table 1). However, for rare out- TABLE 7
(Odds Ratio = 2.00, Risk Ratio = 1.09)
comes, the contribution of the subjects with
Outcome (Death)
the outcome in the denominators becomes Group Total
small; that is, a and c are small compared Yes No
with b and d, respectively. The risk ratio be- Exposed (test A) [a] 100 [b] 10 [a + b] 110
comes approximately (a / b) / (c / d). This in Unexposed (test B) [c] 1,000 [d] 200 [c + d] 1,200
turn reduces to a × d / b × c, which is equal to
the odds ratio derived from the case-control Risk ratio
study (Table 3). Formula [a / (a + b)] / [c / (c + d)]
The relationship between the odds ratio Result (100 / 110) / (1,000 / 1,200) = 1.09
and the risk ratio, as well as a comparison be- Odds ratio
tween case-control and cohort studies, is
Formula ad / bc
shown in the breast cancer paper by Moss et
al. [8]. In that paper, the authors report on Result (100) (200) / (10) (1,000) = 2.00

AJR:183, November 2004 1205


Blackmore and Cummings

the target clinical population to which the re- data from the study by Blackmore et al. (Ta- compared differ with respect to some factor
sults are expected to be applied. This popula- bles 8 and 9) [14]. When the controls for this that is associated with the outcome. For ex-
tion represents the ideal study population. study were selected from all emergency de- ample, in the study of contrast agents by
Study subjects should be drawn from this partment trauma patients (the clinically rele- Bettmann et al. [16], subjects with a history
target population when possible [4, 18, 19]. vant target population), the results revealed a of reaction to contrast material were more
For example, a number of studies have strong association between head injury and likely to receive low-osmolar contrast mate-
been undertaken to define clinical risk fac- cervical spine fracture (Table 8). Another ap- rial than were subjects without a history of
tors for cervical spine fracture in order to proach would have been to select the sub- contrast reaction. Furthermore, those with a
help guide the care and evaluation of these jects only from those admitted to the hospital history of contrast reaction were more likely
patients in the emergency department. The (Table 9). However, admitted subjects had a to have a new adverse reaction than were
target clinical population for these studies much greater proportion of head injuries those without a history of reaction. There-
consisted of patients who were evaluated in than did the group consisting of all emer- fore, the group that received low-osmolar
the emergency department for possible cervi- gency department subjects. This difference contrast material included more persons with
cal spine fracture. In the case-control study was expected, because patients with head in- a propensity to have a reaction than did the
by Blackmore et al. [14] described earlier, jury were almost always admitted, whereas high-osmolar contrast group. Failure to ac-
case and control subjects (regardless of ex- those without head injury were more likely count for history of reaction would bias the
posure) were selected from among those to be discharged from the emergency depart- risk ratio estimate for adverse outcomes.
who presented to the emergency department, ment. However, the increased proportion of Thus, a history of contrast reactions con-
including patients who were discharged from head-injured control subjects in the inpatient founded the relationship between the type of
the emergency department as well as those study led to an odds ratio of only 1.4 for cer- contrast material and the outcome [16].
who were admitted to the hospital. Head in- vical spine fracture among subjects with Several strategies may mitigate the bias
jury was a strong risk factor for cervical head injury when compared with those with- induced by confounding variables. The first
spine fracture, with an odds ratio of 10.0 out head injury. The exposure, head injury, is to restrict the study to those subjects with
(95% CI, 5.2, 19.1) (p < 0.0001). affected whether subjects were admitted and only one level of the potential confounder. In
Other investigators have studied clinical therefore affected whether subjects would be this case, that could mean restricting the
predictors of cervical spine fracture but have eligible for the study—leading to selection study to subjects without a history of con-
used different subject selection criteria, with bias when only admitted patients were con- trast reactions. A second strategy is to strat-
correspondingly different results [14, 20– sidered. Thus, to study predictors of cervical ify subjects on the basis of the confounder,
22]. In a large cohort study by Williams et al. spine fracture in emergency department pa- create an estimate within each stratum, and
[22], exposed (i.e., head-injured) and unex- tients, it is most appropriate to select subjects then combine results across strata. For the
posed (i.e., not head-injured) subjects were from the target population, emergency de- contrast media example used by Bettmann et
selected from an inpatient trauma registry. partment patients. al. [16], separate analyses could be done for
The rate of cervical spine fracture was similar subjects with and without a history of previ-
in both groups (risk ratio = 1.1; 95% CI, 0.93, ous contrast reaction. The relative risk esti-
1.3), suggesting no association between head Confounding mates for the two strata could then be
injury and cervical spine fracture, and con- In randomized clinical trials, the random- combined using Mantel-Haenszel techniques
flicting with the results from the study by ization process helps to ensure that, on aver- described later in this article [24]. Such strat-
Blackmore et al. [14]. age, the study groups are alike with respect ification may be effective for a small number
The different results from these studies to all known and unknown confounders [23]. of potential confounders but can become im-
can be understood by applying both subject In observational studies, on the other hand, practical when multiple potential confound-
selection strategies to the case-control study confounding can occur if the groups being ers must be considered. A third strategy is to
adjust for potential confounders using re-
gression methods. In the results reported for
Case-Control Study of Head the study by Bettmann et al. [16], adjustment
Injury as a Predictor of
Cervical Spine Fracture Case-Control Study of Head
was made for potentially confounding vari-
Using Emergency Injury as a Predictor of ables in a regression model. The results
TABLE 8
Department Trauma Cervical Spine Fracture showed that low-osmolar contrast material
TABLE 9
Patients as Cases and Using Admitted Trauma was associated with fewer reactions than
Controls Patients as Cases and high-osmolar contrast material was, after ac-
(Odds Ratio = 10.0) Controls (Odds Ratio = 1.4)
counting for the effects of previous contrast
Group Fracture No Fracture Group Fracture No Fracture reaction, asthma, steroid pretreatment, race,
Head injury [a] 52 [b] 13 Head injury [a] 52 [b] 11 sex, and other potential confounders [16].
No head injury [c] 116 [d] 291 No head injury [c] 116 [d] 35 Finally, matching may be used to control
for a potentially confounding variable.
Odds ratio Odds ratio
Matching in a cohort study involves selecting
Formula ad / bc Formula ad / bc unexposed subjects who have equivalent val-
Result (52) (291) / (13) (116) = 10.0 Result (52) (35) / (11) (116) = 1.4 ues of a confounding variable as the exposed
Note.—Derived from Blackmore et al. [14]. Note.—Derived from Blackmore et al. [14]. subjects. For the contrast media example, a

1206 AJR:183, November 2004


Fundamentals of Clinical Research for Radiologists

matched cohort study could be designed Contrast Reaction Rates Using Low-Osmolar Versus High-Osmolar
whereby an unexposed (high-osmolar con- TABLE 10 Intraarterial Contrast Media in Subjects with a History of Reaction
trast material) subject was selected who had (Risk Ratio = 0.64)
a history of contrast reaction for each ex- Group Reaction No Reaction Total
posed (low-osmolar contrast material) sub-
ject who had a previous contrast reaction, Low-osmolar [a] 145 [b] 892 [a +b] 1,037
and an unexposed subject without contrast High-osmolar [c] 75 [d] 268 [c + d] 343
reaction selected for each exposed subject Risk ratio
without a previous contrast reaction. This Formula [a / (a +b)] / [c / (c + d)]
matching would control for potential con- Result (145 / 1,037) / (75 / 343) = 0.64
founding by past reaction to contrast mate-
Note.—Derived from Bettmann et al. [16].
rial. Matching can also be used in case-
control studies. However, if controls in a
case-control study are selected on the basis Contrast Reaction Rates Using Low-Osmolar Versus High-Osmolar
of the presence of a potential confounder, TABLE 11 Intraarterial Contrast Media in Subjects Without a History of Reaction
then the frequency of the potential con- (Risk Ratio = 0.69)
founder will no longer be equal in the study Group Reaction No Reaction Total
controls and the underlying population.
Matching in case-control studies can actually Low-osmolar [a] 797 [b] 7,599 [a +b] 8,396
introduce bias unless it is accounted for in High-osmolar [c] 1,526 [d] 9,564 [c + d] 11,090
the analysis using stratification or regression. Risk ratio
Matching has several disadvantages. First, Formula [a / (a +b)] / [c / (c + d)]
it is not possible to study the effects of the Result (797 / 8,396) / (1,526 / 11,090) = 0.69
variable that was used for matching. Second,
Note.—Derived from Bettmann et al. [16].
matching can be expensive and difficult.
Third, matching may decrease the power of a
study if some cases cannot be matched to ap-
propriate controls. In general, matching estimators are explained in detail in standard useful in determining the influence of a radi-
should be used sparingly, or not at all. epidemiology texts [1, 2]. ology intervention on patient outcome, and
As an example, from the contrast study by in determining clinical risk factors for dis-
Bettmann et al. [16], it is possible to use the ease, in order to aid determination of optimal
Analysis Mantel-Haenszel risk ratio to account for any imaging strategies. However, radiologists
The basic analysis for an observational effect of previous contrast reaction on deter- should be aware of the uses, limitations, and
study of a binary exposure and binary out- mination of the association between low-os- techniques of observational study designs.
come can be expressed in 2 × 2 tables. Mea- molar contrast media and any adverse
sures of association—the relative risk for reaction. Separate 2 × 2 tables for subjects
cohort studies and the odds ratio for case- with and without previous contrast reactions References
control studies—are derived from the 2 × 2 are shown in Tables 10 and 11. These tables 1. Rothman K, Greenland S. Modern epidemiology,
table as described earlier. However, the 2 × 2 are combined using the Mantel-Haenszel 2nd ed. Philadelphia, PA: Lippincott, 1998
table allows consideration of only a single method to yield a Mantel-Haenszel risk ratio 2. Kelsey J, Whittemore A, Evans A, Thompson W.
binary exposure and single binary outcome. of 0.69, slightly lower than the crude esti- Methods in observational epidemiology. New
York, NY: Oxford University Press, 1996
Confounding variables may complicate the mate of risk ratio = 0.71.
3. Weiss NS. Clinical epidemiology: the study of
relationship between exposure and outcome. Analyses involving multiple confounders, outcome of illness. New York, NY: Oxford, 1996
The Mantel-Haenszel method allows con- and analyses involving multiple exposures or 4. Schlesselman JJ. Case-control studies. New York,
sideration of one or more potentially con- outcomes, may be analyzed using regression NY: Oxford University Press, 1982
founding variables in assessment of the 2 × 2 techniques. Regression allows estimation of 5. Thornbury JR. Eugene W. Clinical efficacy of di-
table. Separate 2 × 2 tables are constructed for the odds ratio or risk ratio associated with a agnostic imaging: love it or leave it. (Caldwell
Lecture) AJR 1994;162:1–8
each level of the potentially confounding vari- given variable after accounting for the effects
6. Blackmore CC, Black WB, Jarvik JG, Langlotz
able. The numerators and denominators for the of all other variables in the model [25, 26]. CP. A critical synopsis of the diagnostic and
odds ratios derived from each 2 × 2 table are Logistic regression and other regression screening radiology outcomes literature. Acad
then weighted on the basis of the total number techniques will be discussed in future arti- Radiol 1999;6[suppl 1]:S8–S18
of subjects in each and combined. The calcula- cles in this series. 7. Hillman BJ. Outcomes research and cost-effec-
tion of the Mantel-Haenszel odds ratio for a tiveness analysis for diagnostic imaging. Radiol-
case-control study and a calculation of a Man- ogy 1994;193:307–310
Conclusion 8. Moss SM, Summerley ME, Thomas BT, Ellman R,
tel-Haenszel version of the risk ratio for cohort
Chamberlain JO. A case-control evaluation of the
study data are provided in Appendix 1 [2, 24]. Case-control and cohort study designs are effect of breast cancer screening in the United
Methods for determining variance estimates valuable alternatives to randomized clinical Kingdom trial of early detection of breast cancer. J
and confidence intervals for the Mantel-Haenszel trials. These study designs are particularly Epidemiol Community Health 1992;46:362–364

AJR:183, November 2004 1207


Blackmore and Cummings

9. Palli D, Del Turco MR, Buiatti E, Ciatto S, Croc- 14. Blackmore CC, Emerson SS, Mann FA, Koepsell roentgenographic criteria for cervical spine inju-
etti E, Paci E. Time interval since last test in a TD. Cervical spine imaging in patients with ries. Ann Emerg Med 1987;16:738–742
breast cancer screening programme: a case-con- trauma: determination of fracture risk to optimize 21. Sinclair D, Schwartz M, Gruss J, McLellan B. A
trol study in Italy. J Epidemiol Community Health use. Radiology 1999;211:759–765 retrospective review of the relationship between
1989;43:241–248 15. Rivara F, Cummings P, Koepsell T, Grossman D, facial fractures, head injuries, and cervical spine
10. Shapiro S. Evidence on screening for breast can- Maier R. Injury control: a guide to research and injuries. J Emerg Med 1988;6:109–112
cer from a randomized trial. Cancer 1977;39: program evaluation. New York, NY: Cambridge 22. Williams J, Jehle D, Cottington E, Shufflebarger
2772–2782 University Press, 2001 C. Head, facial, and clavicular trauma as a predic-
11. Stiell IG, Greenberg GH, Wells GA, et al. Pro- 16. Bettmann MA, Heeren T, Greenfield A, Goudey tor of cervical spine injury. Ann Emerg Med
spective validation of a decision rule for the use C. Adverse events with radiographic contrast 1992;21:70–73
of radiography in acute knee injuries. JAMA agents: results of the SCVIR contrast agent regis- 23. Beam CA. Statistically engineering the study for
1996;275:611–615 try. Radiology 1997;203:611–620 success. AJR 2002;179:47–52
12. Hoffman J, Mower W, Wolfson A, Todd K, 17. Blackmore CC, Ramsey SD, Mann FA, Deyo 24. Mantel N, Haenszel W. Statistical aspects of the
Zucker M. Validity of a set of clinical criteria to RA. Cost-effectiveness of cervical spine CT in analysis of data from retrospective studies. J Natl
rule out injury to the cervical spine in patients trauma patients. Radiology 1999;212:117–125 Cancer Inst 1959;22:719–748
with blunt trauma. N Engl J Med 2000;343:94–99 18. Eng J, Siegelman SS. Improving radiology re- 25. Hosmer D, Lemeshow S. Applied logistic regres-
13. Hanson JA, Blackmore CC, Mann FA, Wilson AJ. search methods: what is being asked and who is sion, 2nd ed. New York, NY: John Wiley and
Cervical spine screening: a decision rule can being studied? Radiology 1997;205:651–655 Sons, 2000
identify high risk patients to undergo screening 19. Kazerooni E. Population and sample. AJR 26. Kleinbaum DG, Kupper LL, Muller KE. Applied
helical CT of the cervical spine. AJR 2000; 2001;177:993–999 regression analysis and other multivariate meth-
174:713–718 20. Cadoux CG, White JD, Hedberg MC. High-yield ods. Belmont, NY: Duxbury, 1988

APPENDIX 1. Calculation of Mantel-Haenszel Odds Ratio and Risk Ratio Estimates


The Mantel-Haenszel odds ratio (ORMH) for a case-control study is derived as follows:

a1d1/ n1 + a2d2 / n2 +...+ aidi / ni


ORMH =
b1c1/ n1 + b2c2 / n2 +...+ bici / ni

where each stratum of the confounding variable is denoted by the subscript 1 to i, and ni is the total number of subjects (a + b + c + d) in stra-
tum i [2, 24]. The ORMH can also be expressed as the weighted sum of the stratum odds ratios.

(ORi )(bici / ni )
ORMH =
(bici / ni )

A Mantel-Haenszel version of the risk ratio (RRMH) can be calculated for cohort study data [2]:

(a1)(c1 + d1) / n1 + (a2)(c2 + d2) / n2 +... + (ai)(ci + di) / ni


RRMH =
(c1)(a1 + b1) / n1 + (c2)(a2 + b2) / n2 +... + (ci)(ai + bi) / ni

The RRMH can also be expressed as the weighted sum of the stratum risk ratios:

(RRi ) {[ci(ai + bi)] / ni}


RRMH =
{[ci(ai + bi)] / ni}

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 7. Screening for Preclinical Disease: Test and Disease
2. Framework, April 2001 Characteristics, October 2002
3. Protocol, June 2001 8. Exploring and Summarizing Radiologic Data, January 2003
4. Data Collection, October 2001 9. Visualizing Radiologic Data, March 2003
5. Population and Sample, November 2001 10. Introduction to Probability Theory and Sampling
6. Statistically Engineering the Study for Success, July 2002 Distributions, April 2003

1208 AJR:183, November 2004


Fundamentals of Clinical Research
for Radiologists
Harald O. Stolberg1
Geoffrey Norman2
Randomized Controlled Trials
Isabelle Trop3

P receding articles in this series


have provided a great deal of in-
formation concerning research
design and methodology, including research
The 19th century saw many major ad-
vances in clinical trials. In 1836, the editor of
the American Journal of Medical Sciences
wrote an introduction to an article that he
protocols, statistical analyses, and assess- considered “one of the most important medi-
ment of the clinical importance of radiologic cal works of the present century, marking the
research studies. Many methods of research start of a new era of science,” and stated that
design have already been presented, includ- the article was “the first formal exposition of
ing descriptive studies (e.g., case reports, the results of the only true method of investi-
case series, and cross-sectional surveys), and gation in regard to the therapeutic value of
some analytical designs (e.g., cohort and remedial agents.” The article that evoked
case-control studies). such effusive praise was the French study on
Case-control and cohort studies are also bloodletting in treatment of pneumonia by
called observational studies, which distin- P. C. A. Louis [2, 3].
guishes them from interventional (experi- Credit for the modern randomized trial is
Received June 14, 2004; accepted after revision mental) studies because the decision to seek usually given to Sir Austin Bradford Hill [4].
July 2, 2004.
one treatment or another, or to be exposed to The Medical Research Council trials on strep-
Series editors: Nancy Obuchowski, C. Craig Blackmore,
Steven Karlik, and Caroline Reinhold.
one risk or another, was made by someone tomycin for pulmonary tuberculosis are rightly
other than the experimenter. Consequently, regarded as a landmark that ushered in a new
This is the 12th in the series designed by the American
College of Radiology (ACR), the Canadian Association of the researcher’s role is one of observing the era of medicine. Since Hill’s pioneering
Radiologists, and the American Journal of Roentgenology. outcome of these exposures. By contrast, in achievement, the methodology of the random-
The series, which will ultimately comprise 22 articles, is experimental studies, the researcher (experi- ized controlled trial has been increasingly ac-
designed to progressively educate radiologists in the
methodologies of rigorous clinical research, from the most menter) controls the exposure. The most cepted and the number of randomized
basic principles to a level of considerable sophistication. powerful type of experimental study is the controlled trials reported has grown exponen-
The articles are intended to complement interactive
software that permits the user to work with what he or she
randomized controlled trial. The basic prin- tially. The Cochrane Library already lists more
has learned, which is available on the ACR Web site ciples of randomized controlled trials will be than 150,000 such trials, and they have be-
(www.acr.org). discussed in this article. come the underlying basis for what is currently
Project coordinator: G. Scott Gazelle, Chair, ACR called “evidence-based medicine” [5].
Commission on Research and Technology Assessment; History of Randomized Controlled Trials
staff coordinator: Jonathan H. Sunshine, Senior Director
for Research, ACR. The history of clinical trials dates back to General Principles of Randomized
1
Department of Radiology, McMaster University Medical approximately 600 B.C. when Daniel of Judah Controlled Trials
Centre, 1200 Main St. W, Hamilton, ON L8N 3Z5, Canada. [1] conducted what is probably the earliest re- The randomized controlled trial is one of the
Address correspondence to H. O. Stolberg.
corded clinical trial. He compared the health simplest but most powerful tools of research. In
2
Department of Clinical Epidemiology and Biostatistics, effects of the vegetarian diet with those of a essence, the randomized controlled trial is a
McMaster University, Hamilton, ON L8N 3Z5, Canada.
royal Babylonian diet over a 10-day period. study in which people are allocated at random
3
Department of Radiology, Hôpital Saint-Luc, 1058 St. The trial had obvious deficiencies by contem- to receive one of several clinical interventions
Denis St., Montreal, QC H2X 3J4, Canada.
porary medical standards (allocation bias, as- [2]. On most occasions, the term “intervention”
AJR 2004;183:1539–1544
certainment bias, and confounding by divine refers to treatment, but it should be used in a
0361–803X/04/1836–1539 intervention), but the report has remained in- much wider sense to include any clinical ma-
© American Roentgen Ray Society fluential for more than two millennia [2]. neuver offered to study participants that may

AJR:183, December 2004 1539


Stolberg et al.

have an effect on their health status. Such clini- cases, randomized controlled trials may not be procedure. They must first define the rules
cal maneuvers include prevention strategies, feasible because of financial constraints or be- that will govern allocation and then follow
screening programs, diagnostic tests, interven- cause of the expectation of low compliance or those rules strictly throughout the entire
tional procedures, the setting in which health high drop-out rates. study [2]. The crucial issue is that after the
care is provided, and educational models [2]. Many randomized controlled trials involve procedure for randomization is determined,
Randomized controlled trials in radiology can large sample sizes because many treatments it should not be modified at any point during
play a major role in the assessment of screen- have relatively small effects. The size of the ex- the study. There are many adequate methods
ing programs, diagnostic tests, and procedures pected effect of the intervention is the main de- of randomization, but their common element
in interventional radiology [6–13]. terminant of the sample size necessary to is that no one should be able to determine
Randomized controlled trials are used to conduct a successful randomized controlled ahead of time to which group a given patient
examine the effect of interventions on particu- trial. Obtaining statistically significant differ- will be assigned. Detailed discussion of ran-
lar outcomes such as death or the recurrence ences between two samples is easy if large dif- domization methods is beyond the scope of
of disease. Some consider randomized con- ferences are expected. However, the smaller the this article.
trolled trials to be the best of all research de- expected effect of the intervention, the larger the Numerous methods are also available to en-
signs [14], or “the most powerful tool in sample size needed to be able to conclude, with sure that the sample of patients is balanced
modern clinical research” [15], mainly be- enough power, that the differences are unlikely whenever a small predetermined number of
cause the act of randomizing patients to re- to be due to chance. For example, let us assume patients have been enrolled. Unfortunately, the
ceive or not receive the intervention ensures that we wish to study two groups of patients who methods of allocation in studies described as
that, on average, all other possible causes are will undergo different interventions, one of randomized are poorly and infrequently re-
equal between the two groups. Thus, any sig- which is a new procedure. We expect a 10% de- ported [2, 28]. As a result, it is not possible to
nificant differences between groups in the out- crease in the morbidity rate with the new proce- determine, on most occasions, whether the in-
come event can be attributed to the dure. To be able to detect this difference with a vestigators used proper methods to generate
intervention and not to some other unidenti- probability (power) of 80%, we need 80 patients random sequences of allocation [2].
fied factor. However, randomized controlled in each treatment arm. If the expected difference
trials are not a panacea to answer all clinical in effect between the two groups increases to
questions; for example, the effect of a risk fac- 20%, the number of patient required per arm de- Bias in Randomized Controlled Trials
tor such as smoking cannot ethically be ad- creases to 40. Conversely, if the difference be- The main appeal of the randomized con-
dressed with randomized controlled trials. tween the groups is expected to be only 1%, the trolled trial in health care derives from its po-
Furthermore, in many situations randomized study population must increase to 8,000 per tential for reducing allocation bias [2]. No
controlled trials are not feasible, necessary, treatment arm. The sample size required to other study design allows researchers to bal-
appropriate, or even sufficient to help solve achieve power in a study is inversely propor- ance unknown prognostic factors at baseline.
important problems [2]. Randomized con- tional to the treatment effect squared [23]. Stan- Random allocation does not, however, pro-
trolled trials are not appropriate for cancer dard formulas are available to calculate the tect randomized controlled trials against
screening, a situation in which the outcome is approximate sample size necessary when de- other types of bias. During the past 10 years,
rare and frequently occurs only after a long signing a randomized controlled trial [24–26]. randomized controlled trials have been the
delay. Thus, although the test for appraising subject rather than the tool of important, al-
the ultimate value of a diagnostic test may be beit isolated, research efforts usually de-
a large well-designed randomized controlled Randomization: The Strength of the signed to generate empiric evidence to
trial that has patient outcomes as the end point Randomized Controlled Trial improve the design, reporting, dissemina-
[16], the trial should presumably be per- The randomization procedure gives the ran- tion, and use of randomized controlled trials
formed after other smaller studies have exam- domized controlled trial its strength. Random in health care [28]. Such studies have shown
ined the predictive value of the test against allocation means that all participants have the that randomized controlled trials are vulnera-
some accepted standard. same chance of being assigned to each of the ble to multiple types of bias at all stages of
An excellent example of the controversies study groups [27]. The allocation, therefore, is their workspan. A detailed discussion of bias
that can arise with randomized controlled tri- not determined by the investigators, the clini- in randomized controlled trials was offered
als is an overview of the publications on cians, or the study participants [2]. The pur- by Jadad [2].
mammography screening. The most impor- pose of random allocation of participants is to In summary, randomized controlled trials
tant references concern the article by Miet- assure that the characteristics of the partici- are quantitative, comparative, controlled ex-
tinen et al. [17] linking screening for breast pants are as likely to be similar as possible periments in which a group of investigators
cancer with mammography and an appar- across groups at the start of the comparison studies two or more interventions by admin-
ently substantial reduction in fatalities and (also called the baseline). If randomization is istering them to groups of individuals who
the responses that it elicited [18–22]. done properly, it reduces the risk of a serious have been randomly assigned to receive each
Randomized controlled trials may not be imbalance in known and unknown factors that intervention. Alternatively, each individual
appropriate for the assessment of interventions could influence the clinical course of the par- might receive a series of interventions in ran-
that have rare outcomes or effects that take a ticipants. No other study design allows investi- dom order (crossover design) if the outcome
long time to develop. In such instances, other gators to balance these factors. can be uniquely associated with each inter-
study designs such as case-control studies or The investigators should follow two rules vention, through, for example, use of a
cohort studies are more appropriate. In other to ensure the success of the randomization “washout” period. This step ensures that the

1540 AJR:183, December 2004


Fundamentals of Clinical Research for Radiologists

effects from one test are not carried over to daily practice. Although both explanatory and ventions is determined at random. This de-
the next one and subsequently affect the in- pragmatic approaches are reasonable, and sign, obviously, is appropriate only for
dependent evaluation of the second test ad- even complementary, it is important to under- chronic conditions that are fairly stable over
ministered. Apart from random allocation to stand that they represent extremes of a spec- time and for interventions that last a short
comparison groups, the elements of a ran- trum, and most randomized controlled trials time within the patient and that do not inter-
domized controlled trial are no different combine elements of both. fere with one another. Otherwise, false con-
from those of any other type of prospective, Efficacy or effectiveness trials.—Random- clusions about the effectiveness of an
comparative, quantitative study. ized controlled trials are also often described in intervention could be drawn [29].
terms of whether they evaluate the efficacy or Factorial design.—A randomized con-
effectiveness of an intervention. Efficacy refers trolled trial has a factorial design when two or
Types of Randomized Controlled Trials
to interventions carried out under ideal circum- more experimental interventions are not only
As Jadad observed in his 1998 book Ran- stances, whereas effectiveness evaluates the ef- evaluated separately but also in combination
domised Controlled Trials [2]: fects of an intervention under circumstances and against a control [2]. For example, a 2 × 2
similar to those found in daily practice. factorial design generates four sets of data to
Over the years, multiple terms have Phase 1, 2, 3, and 4 trials.—These terms analyze: data on patients who received none
been used to describe different types of describe the different types of trials used for of the interventions, patients who received
randomized controlled trials. This termi- the introduction of a new intervention, tradi- treatment A, patients who received treatment
nology has evolved to the point of tionally a new drug, but could also encom- B, and patients who received both A and B.
becoming real jargon. This jargon is not pass trials used for the evaluation of a new More complex factorial designs, involving
easy to understand for those who are embolization material or type of prosthesis, multiple factors, are occasionally used. The
starting their careers as clinicians or for example. Phase 1 studies are usually con- strength of this design is that it provides more
researchers because there is no single ducted after the safety of the new interven- information than parallel designs. In addition
source with clear and simple definitions tion has been documented in animal to the effects of each treatment, factorial de-
of all these terms. research, and their purpose is to document sign allows evaluation of the interaction that
the safety of the intervention in humans. may exist between two treatments. Because
The best classification of frequently used Phase 1 studies are usually performed on randomized controlled trials are generally ex-
terms was offered by Jadad [2], and we have healthy volunteers. Once the intervention pensive to conduct, the more answers that can
based our article on his work. passes phase 1, phase 2 begins. Typically, the be obtained, the better.
According to Jadad, randomized con- intervention is given to a small group of real
trolled trials can be classified as to the as- patients, and the purpose of this study is to Randomized Controlled Trials Classified According to
pects of intervention that investigators want evaluate the efficacy of different modes of the Number of Participants
to explore, the way in which the participants administration of the intervention to patients. Randomized controlled trials can be per-
are exposed to the intervention, the number Phase 2 studies focus on efficacy while still formed in one or many centers and can in-
of participants included in the study, whether providing information on safety. Phase 3 clude from one to thousands of participants,
the investigators and participants know studies are typically effectiveness trials, and they can have fixed or variable (sequen-
which intervention is being assessed, and which are performed after a given procedure tial) numbers of participants.
whether the preference of nonrandomized in- has been shown to be safe with a reasonable “N-of-one trials.”—Randomized con-
dividuals and participants has been taken chance of improving patients’ conditions. trolled trials with only one participant are
into account in the design of the study. In the Most phase 3 trials are randomized con- called “n-of-one trials” or “individual patient
context of this article, we can offer only a trolled trials. Phase 4 studies are equivalent trials.” Randomized controlled trials with a
brief discussion of each of the different types to postmarketing studies of the intervention; simple design that involve thousands of pa-
of randomized controlled trials. they are performed to identify and monitor tients and limited data collection are called
possible adverse events not yet documented. “megatrials.” [30, 31]. Usually, megatrials
Randomized Controlled Trials Classified According to require the participation of many investiga-
the Different Aspects of Interventions Evaluated tors from multiple centers and from different
Randomized Controlled Trials Classified According to
Randomized controlled trials used to evalu- Participants’ Exposure and Response to the countries [2].
ate different interventions include explanatory Intervention Sequential trials.—A sequential trial is a
or pragmatic trials; efficacy or equivalence tri- These types of randomized controlled trials study with parallel design in which the number
als; and phase 1, 2, 3, and 4 trials. include parallel, crossover, and factorial designs. of participants is not specified by the investiga-
Explanatory or pragmatic trials.—Explan- Parallel design.—Most randomized con- tors beforehand. Instead, the investigators
atory trials are designed to answer a simple trolled trials have parallel designs in which continue recruiting participants until a clear
question: Does the intervention work? If it each group of participants is exposed to only benefit of one of the interventions is observed
does, then the trial attempts to establish how one of the study interventions. or until they become convinced that there are
it works. Pragmatic trials, on the other hand, Crossover design.— Crossover design re- no important differences between the inter-
are designed not only to determine whether fers to a study in which each of the partici- ventions [27]. This element applies to the
the intervention works but also to describe all pants is given all of the study interventions in comparison of some diagnostic interventions
the consequences of the intervention and its successive periods. The order in which the and some procedures in interventional radiol-
use under circumstances corresponding to participants receive each of the study inter- ogy. Strict rules govern when trials can be

AJR:183, December 2004 1541


Stolberg et al.

stopped on the basis of cumulative results, in a randomized controlled trial but have a limitations of the research methods of ran-
and important statistical considerations come clear preference for one of the study inter- domized controlled trials is growing. A ma-
into play. ventions. At least three types of randomized jor barrier hindering the assessment of trial
Fixed trials.—Alternatively, in a fixed controlled trials take into account the prefer- quality is that, in most cases, we must rely on
trial, the investigators establish deductively ences of eligible individuals as to whether or the information contained in the written re-
the number of participants (sample size) that not they take part in the trial. These are port. A trial with a biased design, if well re-
will be studied. This number can be decided called preference trials because they include ported, could be judged to be of high quality,
arbitrarily or can be calculated using statisti- at least one group in which the participants whereas a well-designed but poorly reported
cal methods. The latter is a more commonly are allowed to choose their preferred treat- trial could be judged to be of low quality.
used method. Even in a fixed trial, the design ment from among several options offered Recently, efforts have been made to im-
of the trial usually specifies whether there [32, 33]. Such trials can have a Zelen design, prove the quality of randomized controlled
will be one or more interim analyses of data. comprehensive cohort design, or Wennberg’s trials. In 1996, a group of epidemiologists,
If a clear benefit of one intervention over the design [33–36]. For a detailed discussion of biostatisticians, and journal editors published
other can be shown with statistical signifi- these designs of randomized controlled tri- “CONSORT (Consolidated Standards of Re-
cance before all participants are recruited, it als, the reader is directed to the excellent de- porting Trials)” [38], a statement that re-
may not be ethical to pursue the trial, and it tailed discussion offered by Jadad [2]. sulted from an extensive collaborative
may be prematurely terminated. process to improve the standards of written
reports of randomized controlled trials. The
Randomized Controlled Trials Classified According to The Ethics of Randomized Controlled
CONSORT statement was revised in 2001
the Level of Blinding Trials
[39]. It was designed to assist the reporting
In addition to randomization, the investi- Despite the claims of some enthusiasts for of randomized controlled trials with two
gators can incorporate other methodologic randomized controlled trials, many important groups and those with parallel designs. Some
strategies to reduce the risk of other biases. aspects of health care cannot be subjected to a modifications will be required to report
These strategies are known as “blinding.” randomized trial for practical and ethical rea- crossover trials and those with more than two
The purpose of blinding is to reduce the risk sons. A randomized controlled trial is the best groups [40]. Although the CONSORT state-
of ascertainment and observation bias. An way of evaluating the effectiveness of an inter- ment was not evaluated before its publica-
open randomized controlled trial is one in vention, but before a randomized controlled tion, it was expected that it would lead to an
which everybody involved in the trial knows trial can be conducted, there must be equi- improvement in the quality of reporting of
which intervention is given to each partici- poise—genuine doubt about whether one randomized controlled trials, at least in the
pant. Many radiology studies are open ran- course of action is better than another [16]. journals that endorse it [41].
domized controlled trials because blinding is Equipoise then refers to that state of knowl- Recently, however, Chan et al. [42]
not feasible or ethical. One cannot, for exam- edge in which no evidence exists that shows pointed out that the interpretation of the re-
ple, perform an interventional procedure that any intervention in the trial is better than sults of randomized controlled trials has em-
with its associated risks without revealing to another and that any intervention is better than phasized statistical significance rather than
the patient and the treating physician to those in the trial. It is not ethical to build a trial clinical importance:
which group the patient has been random- in which, before enrollment, evidence suggests
ized. A single-blinded randomized controlled that patients in one arm of the study are more The lack of emphasis on clinical
trial is one in which a group of individuals likely to benefit from enrollment than patients importance has led to frequent miscon-
involved in the trial (usually patients) does in the other arm. Equipoise thus refers to the ceptions and disagreements regarding
not know which intervention is given to each fine balance that exists between being hopeful the interpretation of the results of clinical
participant. A double-blinded randomized a new treatment will improve a condition and trials and a tendency to equate statistical
controlled trial, on the other hand, is one in having enough evidence to know that it does significance with clinical importance. In
which two groups of individuals involved in (or does not). Randomized controlled trials some instances, statistically significant
the trial (usually patients and treating physi- can be planned only in areas of uncertainty results may not be clinically important
cians) do not know which intervention is and can be carried out only as long as the un- and, conversely, statistically insignificant
given to each participant. Beyond this, tri- certainty remains. Ethical concerns that are results do not completely rule out the
ple-blinded (blinding of patients, treating unique to randomized controlled trials as well possibility of clinically important effects.
physicians, and study investigators) and as other research designs will be addressed in
quadruple-blinded randomized controlled subsequent articles in this series. Hellman Limitations of the Research Methods Used in
trials (blinding of patients, treating physi- and Hellman [37] offered a good discussion Randomized Controlled Trials
cians, study investigators, and statisticians) on this subject. The evaluation of the methodologic qual-
have been described but are rarely used. ity of randomized controlled trials is central
to the appraisal of individual trials, the con-
Reporting of Randomized Controlled
Randomized Controlled Trials Classified According to duct of unbiased systematic reviews, and the
Nonrandomized Participant Preferences Trials
performance of evidence-based health care.
Eligible individuals may refuse to partici- The Quality of Randomized Controlled Trial Reporting However, important methodologic details
pate in a randomized controlled trial. Other Awareness concerning the quality of re- may be omitted from published reports, and
eligible individuals may decide to participate porting randomized controlled trials and the the quality of reporting is, therefore, often

1542 AJR:183, December 2004


Fundamentals of Clinical Research for Radiologists

used as a proxy measure for methodologic randomized controlled trials has significantly 8. Fontana RS, Sanderson DR, Woolner LB, et al.
quality. High-quality reporting may hide im- increased in both diagnostic and interven- Screening for lung cancer: a critique of the Mayo
Lung Project. Cancer 1991;67[suppl 4]:1155–1164
portant differences in methodologic quality, tional radiology. Examples of randomized
9. [No authors listed]. Impact of follow-up testing
and well-conducted trials may be reported controlled trials in diagnostic imaging in- on survival and health-related quality of life in
badly [43]. As Devereaux et al. [41] ob- clude the works of Gottlieb et al. [48] and breast cancer patients: a multicenter randomized
served, “[h]ealth care providers depend upon Kaiser et al. [49]. Examples of interventional controlled trial—the GIVIO Investigators. JAMA
authors and editors to report essential meth- randomized controlled trials are the studies 1994;271:1587–1592
odological factors in randomized controlled by Pinto et al. [50] and Lencioni et al. [51]. 10. Jarvik JG, Maravilla KR, Haynor DR, Levitz M,
trials (RCTs) to allow determination of trial Randomized controlled trials are equally Deyo RA. Rapid MR imaging versus plain radi-
ography in patients with low back pain: initial re-
validity (i.e., likelihood that the trials’ results important in screening for disease. Our ini-
sults of a randomized study. Radiology
are unbiased).” tial experience with breast screening was un- 1997;204:447–454
The most important limitations of re- fortunate, and controversy over this issue 11. Kinnison ML, Powe NR, Steinberg EP. Results of
search methods include the following: continues to this day [52, 53]. On the other randomized controlled trials of low-versus high-
Insufficient power.—A survey of 71 ran- hand, positive developments have occurred, osmolality contrast media. Radiology 1989;170:
domized controlled trials showed that most such as the work of the American College of 381–389
12. Rosselli M, Palli D, Cariddi A, Ciatto S, Pacini P,
of these trials were too small (i.e., had insuf- Radiology Imaging Network. Writing for
Distante V. Intensive diagnostic follow-up after
ficient power to detect important clinical dif- this group, Berg [54] has offered a commen- treatment of primary breast cancer. JAMA 1994;
ferences) and that the authors of these trials tary on the rationale for a trial of screening 271:1593–1597
seemed unaware of these facts [44]. breast sonography. 13. Swingler GH, Hussey GD, Zwarenstein M. Ran-
Poor reporting of randomization—A study Radiologists have a great deal to learn about domised controlled trial of clinical outcome after
of 206 randomized controlled trials showed randomized controlled trials. Academic radiol- chest radiograph in ambulatory acute lower-respi-
ration infection in children. Lancet 1998;351:
that randomization, one of the main design fea- ogists who perform research and radiologists
404–408
tures necessary to prevent bias in randomized who translate research results into practice 14. Cochrane Library Web site. Available at:
controlled trials, was poorly reported [45]. should be familiar with the different types of www.update-software.com/abstracts/
Other limitations.—Additional limitations these trials, including those conducted for di- ab001877.htm. Accessed September 10, 2004
identified by Chalmers [46] were inadequate agnostic tests and interventional procedures. 15. Nystrom L, Rutqvist LE, Wall S, et al. Breast can-
randomization, failure to blind the assessors Radiologists also must be aware of the limita- cer screening with mammography: overview of
to the outcomes, and failure to follow up all tions and problems associated with the meth- Swedish randomised trials. Lancet 1993;341:
973–978
patients in the trials. odologic quality and reporting of the trials. It is
16. Duffy SW. Interpretation of the breast screening
our hope that this article proves to be a valu- trials: a commentary on the recent paper by
able source of information about randomized Gotzsche and Olsen. Breast 2001;10:209–212
Intent to Treat
controlled trials. 17. Miettinen OS, Henschke CI, Pasmantier MW,
A method to correct for differential drop- Smith JP, Libby DM, Yankelevitz DF. Mammo-
out rates between patients from one arm of graphic screening: no reliable supporting evi-
the study and another is to analyze data by Acknowledgments dence? Lancet 2002;359:404–405
the intent to treat—that is, data are analyzed We thank Alejandro Jadad for his support 18. Tabar L, Vitak B, Chen HHT, Yen MF, Duffy SW,
Smith RA. Beyond randomized controlled trials:
in the way patients were randomized, regard- and Monika Ferrier for her patience and sup-
organized mammographic screening substantially
less of whether or not they received the in- port in keeping us on track and for preparing reduces breast carcinoma mortality. Cancer 2001;
tended intervention. The intent to treat the manuscript. 91:1724–1731
correction is a form of protection against bias 19. Hoey J. Does mammography save lives? CMAJ
and strengthens the conclusions of a study. A 2002;166:1187–1188
detailed discussion of the assessment of the 20. Norman GR, Streiner DL. Biostatistics: the bare
References essentials, 2nd ed. Hamilton, ON, Canada: B. C.
quality of randomized controlled trials was
1. Book of Daniel 1:1–21 Decker, 2000
offered by Jadad [2]. 2. Jadad AR. Randomised controlled trials: a user’s 21. Silerman WA. Gnosis and random allotment.
In the appraisal of randomized controlled guide. London, England: BMJ Books, 1998 Control Clin Trials 1981;2:161–164
trials, a clear distinction should be made be- 3. Louis PCA. Research into the effects of blood- 22. Gray JAM. Evidence-based health care. Edin-
tween the quality of the reporting and the letting in some inflammatory diseases and on the burgh, Scotland: Churchill Livingstone, 1997
quality of methodology of the trials [43]. influence of tartarized antimony and vesication in 23. Rosner B. Fundamentals of biostatistics, 5th ed.
pneumonitis. Am J Med Sci 1836;18:102–111 Duxbury, England: Thomson Learning, 2000
4. Hill AB. The clinical trial. N Engl J Med 1952; 24. Norman GR, Streiner DL. PDQ statistics, 2nd ed.
247:113–119 St. Louis, MO: Mosby, 1997
Recent Randomized Controlled Trials in 5. Cochrane Library Web site. Available at: www. 25. Altman DG, Machin D, Bagant TN, Gardner MJ.
Radiology update-software.com/cochrane. Accessed Septem- Statistics with confidence, 2nd ed. London, Eng-
In recent years, randomized controlled tri- ber 10, 2004 land: BMJ Books, 2000
als have become increasingly popular in ra- 6. Bree RL, Kazerooni EA, Katz SJ. Effect of man- 26. Moher D, Dulberg CS, Wells GA. Statistical power,
datory radiology consultation on inpatient imag- sample size, and their reporting in randomized con-
diology research. In 1997, for instance, there
ing use. JAMA 1996;276:1595–1598 trolled trials. JAMA 1994;22:122–124
were only a few good randomized studies in 7. DeVore GR. The routine antenatal diagnostic im- 27. Altman DG. Practical statistics for medical re-
diagnostic imaging, such as the one by Jarvik aging with ultrasound study: another perspective. search. London, England: Chapman & Hall, 1991
et al. [47]. Since 2000, the number of good Obstet Gynecol 1994;84:622–626 28. Jadad AR, Rennie D. The randomized controlled

AJR:183, December 2004 1543


Stolberg et al.

trial gets a middle-aged checkup. JAMA 1998; als). The CONSORT statement: revised recom- 1995;48:67–70
279:319–320 mendations for improving the quality of reports 47. Jarvik JG, Maravilla KR, Haynor DR, Levitz M,
29. Louis TA, Lavori PW, Bailar JC III, Polansky M. of parallel-group randomized trials. JAMA Deyo RA. Rapid MR imaging versus plain radi-
Crossover and self-controlled designs in clinical re- 2001;285:1987–1991 ography in patients with low back pain: initial re-
search. In: Bailar JC III, Mosteller F. eds. Medical 40. Altman DG. Better reporting of randomized con- sults of a randomized study. Radiology 1997;204:
uses of statistics, 2nd ed. Boston, MA: New Eng- trolled trials: the CONSORT statement. BMJ 447–454
land Medical Journal Publications, 1992:83–104 1996;313:570–571 48. Gottlieb RH, Voci SL, Syed L, et al. Randomized
30. Woods KL. Megatrials and management of acute 41. Devereaux PJ, Manns BJ, Ghali WA, Quan H, prospective study comparing routine versus selec-
myocardial infarction. Lancet 1995;346:611–614 Guyatt GH. The reporting of methodological fac- tive use of sonography of the complete calf in pa-
31. Charlton BG. Megatrials: methodological issues tors in randomized controlled trials and the asso- tients with suspected deep venous thrombosis.
and clinical implications. Coll Phys Lond 1995; ciation with a journal policy to promote AJR 2003;180:241–245
29:96–100 adherence to the Consolidated Standards of Re- 49. Kaiser S, Frenckner B, Jorulf HK. Suspected ap-
32. Till JE, Sutherland HJ, Meslin EM. Is there a role porting Trials (CONSORT) checklist. Control pendicitis in children: US and CT–a prospective
for performance assessments in research on qual- Clin Trials 2002;23:380–388 randomized study. Radiology 2002;223:633–638
ity of life in oncology? Quality Life Res 1992; 42. Chan KBY, Man-Son-Hing M, Molnar FJ, Laupa- 50. Pinto I, Chimeno P, Romo A, et al. Uterine fi-
1:31–40 cis A. How well is the clinical importance of study broids: uterine artery embolization versus abdom-
33. Silverman WA, Altman DG. Patient preferences results reported? an assessment of randomized con- inal hysterectomy for treatment—a prospective,
and randomized trials. Lancet 1996;347:171–174 trolled trials. CMAJ 2001;165:1197–1202 randomized, and controlled clinical trial. Radiol-
34. Zelen M. A new design for randomized clinical 43. Huwiler-Müntener K, Jüni P, Junker C, Egger M. ogy 2003;226:425–431
trials. N Engl J Med 1979;300:1242–1245 Quality of reporting of randomized trials as a 51. Lencioni RA, Allgaier HP, Cioni D, et al. Small
35. Olschewski M, Scheurlen H. Comprehensive Co- measure of methodologic quality. JAMA hepatocellular carcinoma in cirrhosis: random-
hort Study: an alternative to randomized consent 2002;287:2801–2804 ized comparison of radio-frequency thermal abla-
design in a breast preservation trial. Methods Inf 44. Freiman JA, Chalmers TC, Smith H, Kuebler RR. tion versus percutaneous ethanol injection.
Med 1985;24:131–134 The importance of Beta, the type 2 error, and Radiology 2003;228:235–240
36. Brewin CR, Bradley C. Patient preferences and ran- sample size in design and interpretation of ran- 52. Dean PB. Gotzsche’s quixotic antiscreening cam-
domized clinical trials. BMJ 1989;299:684–685 domized controlled trials. N Engl J Med paign: nonscientific and contrary to Cochrane
37. Hellman S, Hellman DS. Of mice but not men: 1978;299:690–694 principles. JACR 2004;1:3–7
problems of the randomized trial. N Engl J Med 45. Schulz KF, Chalmers I, Hayes RJ, Altman DJ. 53. Gotzsche PC. The debate on breast cancer screen-
1991;324:1585–1592 Empirical evidence of bias: dimensions of the ing with mammography is important. JACR
38. Begg C, Cho M, Eastwood S, et al. Improving the methodologic quality associated with estimates of 2004;1:8–14
quality of reporting of randomized controlled trials: treatment efforts in controlled trials. JAMA 54. Berg WA. Rationale for a trial of screening breast
the CONSORT statement. JAMA 1996;276:7–9 1995;273:408–412 ultrasound: American College of Radiology Imag-
39. Moher D, Schulz KF, Altman DG, CONSORT 46. Chalmers I. Applying overviews and meta-analy- ing Network (ACRIN) 6666. AJR 2003;180:
Group (Consolidated Standards of Reporting Tri- sis at the bedside: discussion. J Clin Epidemiol 1225–1228

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 8. Exploring and Summarizing Radiologic Data, January 2003
2. Framework, April 2001 9. Visualizing Radiologic Data, March 2003
3. Protocol, June 2001 10. Introduction to Probability Theory and Sampling
4. Data Collection, October 2001 Distributions, April 2003
5. Population and Sample, November 2001 11. Observational Studies in Radiology, November 2004
6. Statistically Engineering the Study for Success, July 2002
7. Screening for Preclinical Disease: Test and Disease
Characteristics, October 2002

1544 AJR:183, December 2004


Weinstein et al.
Clinical Evaluation of Diagnostic Tests

Fundamentals of Clinical Research


for Radiologists
Susan Weinstein1
Nancy A. Obuchowski2 Clinical Evaluation of Diagnostic Tests
Michael L. Lieber2
he evaluation of the accuracy of positive for the disease of interest or negative

T diagnostic tests and the appropri-


ate interpretation of test results
are the focus of much of radiology
for the disease of interest. The columns indi-
cate the true disease status, as either disease
present or disease absent. True-positives
and its research. In this article, we first will re- (TPs) are those patients with the disease who
view the basic definitions of diagnostic test test positive. True-negatives (TNs) are those
accuracy, including a brief introduction to re- without the disease who test negative. False-
ceiver operating characteristic (ROC) curves. negatives (FNs) are those with the disease but
Then we will evaluate how diagnostic tests the test falsely indicates the disease is not
can be used to address clinical questions such present. False-positives (FPs) are those with-
as “Should this patient undergo this diagnos- out the disease but the test falsely indicates
tic test?” and, after ordering the test and see- the presence of disease. Sensitivity, then, is
ing the test result, “What is the likelihood that the probability of a TP among patients with
this patient has the disease?” We will finish the disease (TPs + FNs). Specificity is the
with a discussion of some important concepts probability of a TN among patients without
for designing research studies that estimate or the disease (TNs + FPs).
compare diagnostic test accuracy. Consider the following example. Carpenter
et al. [2] evaluated the diagnostic accuracy of
Series editors: Nancy Obuchowski, C. Craig Blackmore, Defining Diagnostic Test Accuracy MR venography (MRV) to detect deep
Steven Karlik, and Caroline Reinhold. Sensitivity and Specificity venous thrombosis (DVT). They performed
This is the 13th in the series designed by the American There are two basic measures of the inher- MRV in a group of 85 patients who presented
College of Radiology (ACR), the Canadian Association of ent accuracy of a diagnostic test: sensitivity with clinical symptoms of DVT. The patients
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is
and specificity. They are equally important, also underwent contrast venography, which is
designed to progressively educate radiologists in the and one should never be reported without the an invasive procedure considered to provide
methodologies of rigorous research, from the most basic other. Sensitivity is the probability of a posi- an unequivocal diagnosis for DVT (the so-
principles to a level of considerable sophistication. The
articles are intended to complement interactive software tive test result (that is, the test indicates the called “gold standard test” or “standard of ref-
that permits the user to work with what he or she has presence of disease) for a patient with the dis- erence”). Of a total of 101 venous systems
learned, which is available on the ACR Web site ease. Specificity, on the other hand, is the evaluated, 27 had DVT by contrast venogra-
(www.acr.org).
probability of a negative test result (that is, the phy. All 27 cases were detected on MRV;
Project coordinator: G. Scott Gazelle, Chair, ACR
Commission on Research and Technology Assessment.
test does not indicate the presence of disease) thus, the sensitivity of MRV was 27/27, or
for a patient without the disease. We use the 100%. Of 74 venous systems without DVT, as
Staff coordinator: Jonathan H. Sunshine, Senior Director
for Research, ACR. term “disease” here loosely to mean the con- confirmed by contrast venography, three
1Department dition (e.g., breast cancer, deep venous tested positive on MRV (that is, three FPs).
of Radiology, University of Pennsylvania
Medical Center, Philadelphia, PA 19104. Address thrombosis, intracranial aneurysm) that the The specificity of MRV was 71/74, or 96%
correspondence to S. Weinstein. diagnostic test is supposed to detect. We cal- specificity (Table 2).
2Departments of Biostatistics and Epidemiology and culate the test’s specificity based on patients
Radiology, The Cleveland Clinic Foundation, Cleveland, OH without this condition, but these patients of- Combining Multiple Tests
44195.
ten have other diseases. Few diagnostic tests are both highly sensi-
AJR 2005;184:14–19 Table 1 summarizes the definitions of sen- tive and highly specific. For this reason, pa-
0361–803X/05/18405–14 sitivity and specificity [1]. The table rows tients sometimes are diagnosed using two or
© American Roentgen Ray Society give the results of the diagnostic test, as either more tests. These tests may be performed ei-

14 AJR:184, January 2005


Clinical Evaluation of Diagnostic Tests

Defining Sensitivity and Under the OR rule, the sensitivity of the T–K Ratio Values of
TABLE 1 combined result is higher than that of either
Specificity 5 Papillary and
test alone, but the combined specificity is TABLE 3
5 Nonpapillary Renal
Disease lower than that of either test. With the AND
Test Masses
Present Absent rule, this is reversed: The specificity of the
Cell
combined result is higher than either test T–K Ratio Sensitivity Specificity FPR
+ True-positive (TP) False-positive (FP) Type
alone, but the combined sensitivity is lower
– False-negative (FN) True-negative (TN) than that of either test. PRCC 0.05 0.0 1.0 0.0
Note.—Sensitivity = TPs/(TPs + FNs), specificity = Serial testing is an alternative to parallel PRCC 0.11 0.2 1.0 0.0
TNs/(TNs + FPs). testing that is particularly cost-efficient when Other 0.20 0.4 1.0 0.0
screening for rare conditions and often is used PRCC 0.22 0.4 0.8 0.2
when the second test is expensive and/or PRCC 0.25 0.6 0.8 0.2
risky. Under the OR rule, if the first test is
Other 0.29 0.8 0.8 0.2
Sensitivity and Specificity positive, the diagnosis is positive; otherwise,
the second test is performed. If the second test Other 0.38 0.8 0.6 0.4
TABLE 2 of MRV in 101 Venous
Systems is positive after a negative first test, then the PRCC 0.43 0.8 0.4 0.6
diagnosis also is positive; otherwise, the diag- Other 0.56 1.0 0.4 0.6
Deep Venous Thrombosis
MRV nosis is negative. The OR rule, then, leads to Other 0.66 1.0 0.2 0.8
Present Absent a higher overall sensitivity than either test by
Note.—PRCC = papillary renal cell carcinoma,
+ 27 3 itself. With the AND rule, if the first test is FPR = false-positive rate, or 1 – specificity.
positive, the second test is performed. If the
– 0 71
second test is positive, the diagnosis is posi-
Note.—MRV = magnetic resonance venography. tive; otherwise, the diagnosis is negative. The
AND rule, then, leads to a higher overall lary lesions. In Table 3, the third and fourth
ther in parallel (i.e., at the same time and in- specificity than either test by itself. columns give the calculated sensitivity and
terpreted together) or in series (i.e., the results To calculate sensitivity of the combined specificity, respectively, using the T–K ratio
of the first test determine whether the second test using serial testing with the OR rule, the value in column 2 as the cutoff. Note that as
test is performed at all) [3]. The latter has the formula is: SEa + (1 − SEa) × SEb. Specificity the value of the cutoff increases, the specific-
advantage of avoiding unnecessary tests, but under the OR rule is simply SPa × SPb. Con- ity decreases while the sensitivity increases.
the disadvantage of potentially delaying treat- versely, to calculate sensitivity using the In Figure 1, we have plotted the 10 pairs of
ment for diseased patients by lengthening the AND rule, the formula is: SEa × SEb, while sensitivity and specificity calculated in Table
diagnostic testing period. specificity under the AND rule is SPa + (1 − 3. The y-axis is the sensitivity and the x-axis
Tests can be interpreted in parallel in two SPa) × SPb. is 1 minus the specificity, or the false-positive
ways. The first, called “the OR rule,” yields a rate (FPR). Connecting these points with line
positive diagnosis if either test (let’s assume ROC Curves segments, we have constructed an ROC curve
there are two tests) is positive and a negative While some tests provide dichotomous re- [5]. A test with an ROC curve that lies near
diagnosis if both tests are negative. That is, if sults (that is, positive or negative), other tests the “chance diagonal line” in Figure 1 has no
test A and test B are both negative, then the yield results that are numeric values (for ex- ability, beyond mere guessing, at distinguish-
combined result is negative, but if either or ample, attenuation of a lesion on CT) or or- ing between patients with and without the dis-
both are positive, then the combined result is dered categories (for example, BI-RADS ease. In contrast, a test with an ROC curve
positive. scoring used in mammography). Consider that passes near the upper left corner (that is,
The second rule, called “the AND rule,” CT attenuation as a diagnostic test for distin- near 100% sensitivity and 0% FPR [100%
yields a positive diagnosis only if both tests guishing papillary renal cell carcinomas specificity]) is nearly perfect at distinguishing
are positive and a negative diagnosis if either from other types of renal masses [4]. In Ta- disease from no disease. T–K ratio has mod-
test is negative. That is, if test A and test B are ble 3, the ratio of tumor enhancement to nor- erate accuracy, with its ROC curve falling be-
both positive, then the combined result is pos- mal kidney enhancement (T–K ratio) of 10 tween these two extremes.
itive, but if either or both are negative, then masses is listed. Suppose now that an investigator proposes
the combined result is negative. How do we calculate the basic measures of the ratio of the attenuation of the mass to the
Let us denote the sensitivities of the two accuracy, that is, sensitivity and specificity, attenuation of the abdominal aorta (T–A ra-
tests by SEa and SEb, and their specificities by for T–K ratio as a diagnostic test for papillary tio) as a new diagnostic test for papillary le-
SPa and SPb. To calculate the sensitivity of the masses? We shall consider each unique T–K sions. This investigator, however, arbitrarily
combined test in parallel using the OR rule, ratio value as a “cutoff,” or “decision thresh- chooses a single cutoff and reports only the
the formula is: SEa + SEb − (SEa × SEb). old” and calculate the sensitivity and specific- sensitivity and specificity at that cutoff. Fig-
Specificity under the OR rule is simply SPa × ity associated with each cutoff. Masses with ure 2 illustrates this single point (labeled A) in
SPb. Conversely, to calculate sensitivity using T–K ratio values greater than or equal to the relation to the ROC curve for T–K ratio. We
the AND rule, the formula is: SEa × SEb, cutoff are called “negative” for papillary le- might be tempted to conclude that T–K ratio
while specificity under the AND rule is SPa + sions and masses with T–K ratio values less is superior to T–A ratio because point A falls
SPb − (SPa × SPb). than the cutoff are called “positive” for papil- below the ROC curve for T–K ratio. There

AJR:184, January 2005 15


Weinstein et al.

superior ROC curve


1.0 Perfect Test 1.0

cutoff=0.38 T-K ratio ROC curve


0.8 0.8

l
na
inferior ROC curve

Sensitivity
ag
Sensitivity

0.6 0.6

di
e
anc
ch

0.4 0.4
A
O
cutoff=0.11
0.2 0.2

0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

False-Positive Rate False-Positive Rate

Fig. 1.—10 pairs of sensitivity and specificity as calculated in Table 3. The y-axis is Fig. 2.—Single cutoff point (labeled A) in relation to the receiver operating charac-
the sensitivity and the x-axis is 1 minus the specificity, or the false-positive rate (FPR). teristic (ROC) curve for T–K (tumor enhancement to normal kidney enhancement)
Receiver operating characteristic (ROC) curve is created by connecting points with ratio.
line segments.

are, however, an infinite number of ROC for patients, based at least in part on the re- has spina bifida when the lemon sign is
curves that could pass through point A, two of sults of less than perfect diagnostic tests. present. The PPV is calculated as follows:
which are depicted by dashed curves in Figure These clinicians need answers to the follow-
2. Some of these ROC curves could be supe- ing questions. “What is the likelihood that this PPV = TP / (TP + FP) =
13 / (13 + 3) × 100% = 81.3% (1)
rior to the ROC curve for T–K ratio for most patient has the disease when the test result is
FPRs and others inferior. Based on the single positive?” and “What is the likelihood that The PPV differs from sensitivity. While the
sensitivity and specificity reported by the in- this patient does not have the disease when PPV tells us the probability of a fetus with
vestigator, we cannot determine if the T–A ra- the test result is negative?” The answers to spina bifida following detection of the lemon
tio is superior or inferior in relation to the T– these questions are known as the positive and sign (that probability is 0.813, or 81.3%), the
K ratio. However, if we had been given the negative predictive values, respectively. We sensitivity tells us the probability that the
ROC curves of both the T–A and T–K ratio, illustrate these with the following example. lemon sign will be present among fetuses
then we could compare these two diagnostic The lemon sign has been described as an with spina bifida (probability is 0.929, or
tests and determine, for any range of FPRs, important indicator of spina bifida. Nyberg et 92.9%). PPV helps the clinician decide how
which test is preferred. al. [7] describe the sensitivity and specificity to treat the patient after the diagnostic test
This example illustrates the importance of of the lemon sign in the detection of spina bi- comes back positive. Sensitivity, on the other
ROC curves and why they have become the fida in a high-risk population (elevated material hand, is a property of the diagnostic test and
state-of-the-art method for describing the di- serum α-fetoprotein level, suspected hydro- helps the clinician decide which test to use.
agnostic accuracy of a test. In a future module cephalus or neural tube defect, or family history The corollary to the PPV is the negative pre-
in this series Obuchowski [6] provides a de- of neural tube defect). A portion of their data is dictive value (NPV), that is, the probability that
tailed account of ROC curves, including con- summarized in Table 4. spina bifida will not be present when the lemon
structing smooth ROC curves, estimating Spina bifida occurred in 6.1% (14/229) of sign is absent. The NPV is calculated as follows:
various summary measures of accuracy de- the sample, that is, sample prevalence was
rived from them, finding the optimal cutoff on 6.1%. The lemon sign was seen in 92.9% (13/ NPV = TN / (TN + FN) =
212 / (212 + 1) × 100% = 99.5% (2)
the ROC curve for a particular clinical appli- 14) of the fetuses with spina bifida (92.9%
cation, and identifying available software. sensitivity), and was absent in 98.6% (212/ If the lemon sign is absent, there is a 99.5%
215) of the fetuses without spina bifida chance that the fetus will not have spina bi-
Interpretation of Diagnostic Tests (98.6% specificity). fida. The NPV is different from the test’s
Calculating the Positive and Negative We also can calculate the positive and neg- specificity. Specificity tells us the probability
Predictive Values ative predictive values of the lemon sign from that the lemon sign will be absent among fe-
Clinicians are faced each day with the chal- the available data. The positive predictive tuses without spina bifida (that probability is
lenge of deciding appropriate management value (PPV) is the probability that the fetus 0.986, or 98.6%).

16 AJR:184, January 2005


Clinical Evaluation of Diagnostic Tests

Lemon Sign Versus Spinal accuracy of the test and the patient’s pretest PPV can be quite low when the pretest prob-
TABLE 4 Cord Defect in Fetuses probability of disease. ability is low. The clinician ordering a test
Prior to 24 Weeks The PPV and NPV can vary markedly, de- needs to consider how the patient will be
pending on the patient’s pretest probability, or managed if the test result is negative versus if
Spina No Spina
Lemon Sign Total prevalence of disease in the population. In the the test result is positive. If the probability of
Bifida Bifida
Nyberg et al. [7] study the prevalence rate of disease will still be low after a positive test,
+ 13 3 16 spina bifida in their high risk sample was then the test may have no impact on the pa-
– 1 212 213 6.1%. In the general population, however, the tient’s management.
Total 14 215 229 prevalence of spina bifida is much less, about An example is screening for intracranial
Note.—SE = 92.9%, SP = 98.6%, PPV = 81.3%, 0.1%. Filly [9] studied the predictive ability aneurysms in the general population. The
NPV = 99.99%, prevalence = 6.1%. of the lemon sign in the general population. prevalence of aneurysms is low, maybe 1%, in
He assumed that the sensitivity of the lemon the general population. Even though mag-
sign was 90.0% and the specificity was 98.6% netic resonance angiography (MRA) may
(very similar to that in Nyberg’s small study, have excellent accuracy, say 95% sensitivity
92.9% and 98.6%, respectively). In a sample and specificity, the PPV is still quite low, 0.16
The PPV of the Lemon of 10,000 fetuses from a low-risk population (16%) from equation 3. Considering the non-
TABLE 5 Sign in the General
(see Table 5), Filly showed that the positive trivial risks of invasive catheter angiography
Population
predictive value is only 6%. This is in contrast (which is the usual presurgical tool) [10], the
Spina No Spina to the PPV of 81.3% in the Nyberg study. The clinician may decide that even after a positive
Lemon Sign Total
Bifida Bifida drastic difference in PPVs is due to the differ- MRA, the patient should not undergo catheter
+ 9 140 149 ent prevalence rates of spina bifida in the two angiography. In this scenario, the clinician
– 1 9,850 9,851 samples, 6.1% in Nyberg’s and 0.1% in may decide not to order the MRA, given that
Filly’s. Thus, while a high-risk fetus with a its result, either positive or negative, will not
Total 10 9,990 10,000
lemon sign may have an 81% chance of hav- impact the patient’s management.
Note.—SE = 90.0%, SP = 98.6%, PPV = 6.0%,
NPV = 99.99%, prevalence = 0.1%.
ing spina bifida, “a low risk fetus with a
lemon sign has a 94% chance of being per- Designing Studies to Estimate and
fectly normal” [9]. This example illustrates Compare Tests’ Diagnostic Accuracy
The PPV and NPV can also be calculated the importance of reporting the pretest proba- As with all new medical devices, treat-
from Bayes’ theorem. Bayes’ theorem allows bility or prevalence rate of disease whenever ments, and procedures, the efficacy of diag-
us to compute the PPV and NPV from esti- one presents a PPV or NPV. nostic tests must be assessed in clinical
mates of the test’s sensitivity and specificity, studies. In the second module of this series
and the probability of the disease before the Rationale for Ordering a Diagnostic Test Jarvik [11] described six levels of diagnostic
test is applied. The latter is referred to as the efficacy. Here, we will focus on the second
pretest probability and is based on the patient’s The previous section described how clini- level, which is the stage at which investigators
previous medical history, previous and recent cians can use the results of a diagnostic test to assess the diagnostic accuracy of a test.
exposures, current signs and symptoms, and plan a patient’s management. Let’s back up a bit
results of other screening and diagnostic tests in the clinical decision-making process and look Phases in the Assessment of
performed. When this information is unknown at the rationale for ordering a diagnostic test. Diagnostic Test Accuracy
or when calculating the PPV or NPV for a pop- In the simplest scenario (ignoring mone- There typically are three phases to the as-
ulation, the prevalence of the disease in the tary costs, insurance reimbursement rates, sessment of a diagnostic test’s accuracy [3].
population is used as the pretest probability. etc.), there are three pieces of information that The first is the exploratory phase. It usually is
The PPV and NPV, then, are called posttest a clinician needs to determine whether a diag- the first clinical study performed to assess the
probabilities (also, revised or posterior proba- nostic test should or should not be ordered: efficacy of a new diagnostic test. These tend
bilities), and represent the probability of the 1. From the patient’s previous medical his- to be small, inexpensive studies, typically in-
disease after the test result is known. tory, previous and recent exposures, current volving 10 to 50 patients with and without the
Let p denote the pretest probability of dis- signs and symptoms, and results of other disease of interest. The patients selected for
ease, and SE and SP the sensitivity and spec- screening and diagnostic tests performed, the study samples often are cases with classi-
ificity of the diagnostic test. Recalling the what is the probability that this patient has the cal overt disease (for example, symptomatic
expression for a conditional probability (see disease (that is, the pretest probability)? lung cancer) and healthy volunteer controls.
module 10 [8]), 2. How accurate (sensitivity and specific- If the test results of these two populations do
ity) is the diagnostic test being considered? not differ, then it is not worth pursuing the di-
PPV = P(disease| + test) = 3. Could the results of this test affect the agnostic test further.
[SE × p] / [SE × p + (1 − SP) × (1 − p)] (3) patient’s management? The second phase is the challenge phase.
In the previous section, we saw how the Here, we recognize that a diagnostic test’s
NPV = P(no disease| − test) =
[SP × (1 − p)] / [SP × (1 − p) + (1 − SE) × p] (4) pretest probability and the test’s sensitivity sensitivity and specificity can vary with the
and specificity fit into Bayes’ theorem to tell extent and stage of the disease, and the co-
Thus, the posttest probability of disease for us the posttest probability of disease. We also morbidities present. Thus, in this phase we se-
any patient can be calculated if one knows the saw, even for a very accurate test, how the lect patients with subtle, or early disease, and

AJR:184, January 2005 17


Weinstein et al.

TABLE 6 Common Features of Diagnostic Test Accuracy Studies the study of Carpenter et al. [2] of the accuracy
of MR venography for detecting deep venous
Feature Explanation thrombosis, contrast venography was used as
Two samples of patients One sample of patients with and one sample without the disease are the reference standard. Sometimes a study uses
needed to estimate both sensitivity and specificity. more than one type of reference standard. For
Well-defined patient samples Regardless of the sampling scheme used to obtain patients for the example, in a study to assess the accuracy of
study, the characteristics of the study patients (e.g., age, gender, mammography, patients with a suspicious le-
comorbidities, stage of disease) should be reported. sion on mammography might undergo core bi-
Well-defined diagnostic test The diagnostic test must be clearly defined and applied in the same opsy and/or surgery, whereas patients with a
fashion to all study patients. negative mammogram would need to be fol-
Gold standard/reference standard The true disease status of each study patient must be determined by lowed for 2 years either to confirm that the pa-
a test or procedure that is infallible, or nearly so. tient was cancer free or to detect missed cancers
Sample of interpreters If the test relies on a trained observer to interpret it, then two or more on follow-up screenings. Note that when using
such observers are needed to independently interpret the test [15]. different reference standards for patients with
Blinded interpretations The gold standard should be conducted and interpreted blinded to positive and negative test results, it is important
the results of the diagnostic test, and the diagnostic test should be that all the reference standards are infallible, or
performed and interpreted blinded to the results of the gold
standard.
nearly so. One form of workup bias occurs
when patients with one test result undergo a less
Standard reporting of findings The results of the study should be reported following published
guidelines for the reporting of diagnostic test accuracy [16].
rigorous reference standard than patients with a
different test result [3].
Determining the appropriate reference
with comorbidities that could interfere with Common Features of Diagnostic standard for a study often is the most difficult
the diagnostic test [12]. For example, in a Test Accuracy Studies part of designing a diagnostic accuracy study.
study to assess the ability of MRI to detect The studies in the three phases differ in Reference standards should be infallible, or
lung cancer, the study patients might include terms of their objectives, sampling of pa- nearly so. This is difficult, however, because
those with small nodules (3 cm), and patients tients, and sample sizes. There are, how- even pathology is not infallible, as it is an in-
with nodules and interstitial disease. The con- ever, some common features to all studies terpretive field relying on subjective assess-
trols might have diseases in the same ana- of diagnostic test accuracy, as summarized ment from human observers with varying
tomic location as the disease of interest, for in Table 6. We elaborate here on a few im- skill levels. One such example is the reader
example, interstitial disease but no nodules. portant issues. variability in pathologic interpretation of bor-
These studies often include competing diag- Studies of diagnostic test accuracy require derline intraductal breast carcinoma versus
nostic tests to compare their accuracies with both subjects with and without the disease of atypical ductal carcinoma. While some pa-
the test under evaluation. ROC curves are interest. If one of these populations is not rep- thologists may interpret the lesion as intra-
most often used to assess and compare the resented in the study, then either sensitivity or ductal cancer, others may interpret the same
tests. If the diagnostic test shows good accu- specificity cannot be calculated. We stress lesion as atypical ductal hyperplasia. While
racy, then it can be considered for the third that reporting one without reference to the often we have to accept that a reference stan-
phase of assessment. other is uninformative and often misleading. dard is not perfect, it is important that it be
The third phase is the advanced phase. The number of patients needed for diagnostic nearly infallible. If the reference standard is
These studies often are multicenter studies in- accuracy studies depends on the phase of the not nearly infallible, then imperfect gold stan-
volving large numbers of patients (100 or study, the clinical setting in which the test will dard bias can lead to unreliable and mislead-
more). The patient sample should be repre- be applied (for example, screening or diag- ing estimates of accuracy. Zhou et al. [3]
sentative of the target clinical population. For nostic), and certain characteristics of the pa- discuss in detail imperfect gold standard bias
example, instead of selecting patients with tients and test itself (for example, does the test and possible solutions.
known lung cancer and controls without can- require interpretation by human observers?). In other situations, no reference standard is
cer, we might recruit patients presenting to Statistical methods are available for deter- available (for example, epilepsy) or it is un-
their primary care physician with a persistent mining the appropriate sample size for diag- ethical to subject patients to the reference
cough or bloody sputum. Further testing and nostic accuracy studies [3, 14]. standard because it poses a risk (for example,
follow-up will determine which patients have Studies of diagnostic test accuracy require a an invasive test such as catheter angiography).
lung cancer and which do not. test or procedure for determining the true dis- In these situations, we at least can correlate
It is from this third phase where we obtain re- ease status of each patient. This test or proce- the test results to other tests’ findings and to
liable estimates of a test’s accuracy for the target dure is called the “gold standard” (or “standard clinical outcome, even if we cannot report the
clinical population. Estimates of accuracy from of reference,” “reference standard,” particularly test’s sensitivity and specificity.
the exploratory phase usually are too optimistic when there is no perfect gold standard). The It is never an option to omit from the calcu-
because the “sickest of the sick” are compared gold, or reference, standard must be conducted lation of sensitivity and specificity those pa-
with the “wellest of the well” [13]. In contrast, and interpreted blinded to the diagnostic test re- tients without a diagnosis confirmed by a
estimates of accuracy from the challenge phase sults to avoid bias. Common standards of refer- reference standard. Such studies yield errone-
often are too low because the patients are excep- ence in radiology studies are surgery, pathology ous estimates of test accuracy due to a form of
tionally difficult to diagnose. results, and clinical follow-up. For example, in workup bias called verification bias [17, 18].

18 AJR:184, January 2005


Clinical Evaluation of Diagnostic Tests

This is one of the most common types of bias 4. Patients who did not undergo the reference detection of spina bifida: evaluation of the
in radiology studies [19] and is counterintui- standard procedure should never be omitted “lemon” sign. Radiology 1988;167:387–392
8. Joseph L, Reinhold C. Introduction to probability theory
tive. Investigators often believe they are get- from studies of diagnostic test accuracy.
and sampling distributions. AJR 2003;180:917–923
ting more reliable estimates of accuracy by 5. Published guidelines should be followed 9. Filly RA. The “lemon” sign: a clinical perspec-
excluding cases where the reference standard when reporting the findings from studies of tive. Radiology 1988;167:573–575
was not performed. If, however, the diagnos- diagnostic test accuracy. 10. Levey AS, Pauker SG, Kassirer JP, et al. Occult
tic test results were used in the decision of intracranial aneurysms in polycystic kidney dis-
whether to perform the reference standard ease: when is cerebral arteriography indicated? N
procedure, then verification bias most likely References Engl J Med 1983;308:986–994
1. Gehlbach SH. Interpretation: sensitivity, specific- 11. Jarvik JG. The research framework. AJR
is present. For example, if the results of MR ity, and predictive value. In: Gehlbach SH, ed. In- 2001;176:873–877
venography are used to determine which pa- terpreting the medical literature. New York: 12. Ransohoff DJ, Feinstein AR. Problems of spec-
tients will undergo contrast venography, and McGraw-Hill, 1993:129–139 trum and bias in evaluating the efficacy of diag-
if patients who did not undergo contrast 2. Carpenter JP, Holland GA, Baum RA, Owen RS, nostic tests. N Engl J Med 1978;299:926–930
venography are excluded from the calcula- Carpenter JT, Cope C. Magnetic resonance 13. Sox Jr HC, Blatt MA, Higgins MC, Marton KI.
tions of the test’s accuracy, then verification venography for the detection of deep venous Medical decision making. Boston: Butterworths-
thrombosis: comparison with contrast venogra- Heinemann, 1988
bias exists. Zhou et al. [3] discuss verification
phy and duplex Doppler ultrasonography. J Vasc 14. Beam CA. Strategies for improving power in diag-
bias from a statistical standpoint and offer a Surg 1993;18:734–741 nostic radiology research. AJR 1992;159:631–637
variety of solutions. 3. Zhou XH, Obuchowski NA, McClish DK. Statis- 15. Obuchowski NA. How many observers in clinical stud-
Summary tical methods in diagnostic medicine. New York: ies of medical imaging? AJR 2004;182:867–869
Wiley & Sons, 2002 16. Bossuyt PM, Reitsma JB, Bruns DE, et al. Toward
We conclude with a summary of five key points 4. Herts BR, Coll DM, Novick AC, Obuchowski N, complete and accurate reporting of studies of di-
in the clinical evaluation of diagnostic tests: Linnell G, Wirth SL, Baker ME. Enhancement agnostic accuracy: the STARD initiative. Acad
1. Sensitivity and specificity always should characteristics of papillary renal neoplasms re- Radiol 2003;10:664–669
be reported together. vealed on triphasic helical CT of the kidneys. AJR 17. Begg CB, McNeil BJ. Assessment of radiologic
2. ROC curves allow a comprehensive as- 2002;178:367–372 tests, control of bias, and other design consider-
sessment and comparison of diagnostic test 5. Metz CE. ROC methodology in radiological im- ations. Radiology 1988;167:565–569
aging. Invest Radiol 1986;21:720–733 18. Black WC. How to evaluate the radiology litera-
accuracy.
6. Obuchowski NA. Receiver operating characteris- ture. AJR 1990;154:17–22
3. PPV and NPV cannot be interpreted cor- tic (ROC) analysis. AJR 2005(in press) 19. Reid MC, Lachs MS, Feinstein AR. Use of method-
rectly without knowing the prevalence of dis- 7. Nyberg DA, Mack LA, Hirsch J, Mahony BS. Ab- ologic standards in diagnostic test research: getting
ease in the study sample. normalities of fetal cranial contour in sonographic better but still not good. JAMA 1995;274:645–651

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 8. Exploring and Summarizing Radiologic Data, January 2003
2. The Research Framework, April 2001 9. Visualizing Radiologic Data, March 2003
3. Protocol, June 2001 10. Introduction to Probability Theory and Sampling
4. Data Collection, October 2001 Distributions, April 2003
5. Population and Sample, November 2001 11. Observational Studies in Radiology, November 2004
6. Statistically Engineering the Study for Success, July 2002 12. Randomized Controlled Trials, December 2004
7. Screening for Preclinical Disease: Test and Disease
Characteristics, October 2002

AJR:184, January 2005 19


-
Research
Obuchowski
ROC Analysis

Fundamentals of Clinical Research


for Radiologists
Nancy A. Obuchowski1
ROC Analysis
Obuchowski NA

n this module we describe the stan- tivity and specificity. Here, the second thresh-

I dard methods for characterizing and


comparing the accuracy of diagnos-
tic and screening tests. We motivate
old would have higher specificity than the first
but lower sensitivity. Also, note that trained
mammographers use the scoring system differ-
the use of the receiver operating characteristic ently. Even the same mammographer may use
(ROC) curve, provide definitions and interpreta- the scoring system differently on different re-
tions for the common measures of accuracy de- viewing occasions (e.g., classifying the same
rived from the ROC curve (e.g., the area under mammogram as probably benign on one inter-
the ROC curve), and present recent examples of pretation and as suspicious on another), leading
ROC studies in the radiology literature. We de- to different estimates of sensitivity and specific-
scribe the basic statistical methods for fitting ity even with the same threshold.
ROC curves, comparing them, and determining Which decision threshold should be used to
sample size for studies using ROC curves. We classify test results? How will the choice of a de-
briefly describe the MRMC (multiple-reader, cision threshold affect comparisons between
multiple-case) ROC paradigm. We direct the in- two diagnostic tests or between two radiolo-
terested reader to available software for analyz- gists? These are critical questions when com-
ing ROC studies and to literature on more puting sensitivity and specificity, yet the choice
advanced statistical methods of ROC analysis. for the decision threshold is often arbitrary.
ROC curves, although constructed from sen-
Received October 28, 2004; accepted after revision sitivity and specificity, do not depend on the de-
November 3, 2004. Why ROC?
cision threshold. In an ROC curve, every
Series editors: Nancy Obuchowski, C. Craig Blackmore, In module 13 [1], we defined the basic mea-
Steven Karlik, and Caroline Reinhold. possible decision threshold is considered. An
sures of accuracy: sensitivity (the probability
ROC curve is a plot of a test’s false-positive rate
This is the 14th in the series designed by the American the diagnostic test is positive for disease for a
College of Radiology (ACR), the Canadian Association of patient who truly has the disease) and specific- (FPR), or 1 – specificity (plotted on the horizon-
Radiologists, and the American Journal of Roentgenology. tal axis), versus its sensitivity (plotted on the
The series, which will ultimately comprise 22 articles, is ity (the probability the diagnostic test is nega-
designed to progressively educate radiologists in the tive for disease for a patient who truly does not veritical axis). Each point on the curve repre-
methodologies of rigorous clinical research, from the most
have the disease). These measures require a de- sents the sensitivity and FPR at a different deci-
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive cision rule (or positivity threshold) for classify- sion threshold. The plotted (FPR, sensitivity)
software that permits the user to work with what he or she ing the test results as either positive or negative. coordinates are connected with line segments to
has learned, which is available on the ACR Web site construct an empiric ROC curve. Figure 1 illus-
(www.acr.org). For example, in mammography the BI-RADS
(Breast Imaging Reporting and Data System) trates an empiric ROC curve constructed from
Project coordinator: G. Scott Gazelle, Chair, ACR
Commission on Research and Technology Assessment. scoring system is used to classify mammo- the fictitious mammography data in Table 1.
grams as normal, benign, probably benign, sus- The empiric ROC curve has four points corre-
Staff coordinator: Jonathan H. Sunshine, Senior Director
for Research, ACR. picious, or malignant. One positivity threshold sponding to the four decision thresholds de-
1Department of Biostatistics and Epidemiology, Cleveland
is classifying probably benign, suspicious, and scribed in Table 1.
Clinic Foundation, 9500 Euclid Ave., Cleveland, OH. Address
correspondence to N. Obuchowski.
malignant findings as positive (and classifying An ROC curve begins at the (0, 0) coordi-
normal and benign findings as negative). An- nate, corresponding to the strictest decision
AJR 2005;184:364–372
other positivity threshold is classifying suspi- threshold whereby all test results are negative
0361–803X/05/1842–364 cious and malignant findings as positive. Each for disease (Fig. 1). The ROC curve ends at
© American Roentgen Ray Society threshold leads to different estimates of sensi- the (1, 1) coordinate, corresponding to the

364 AJR:184, February 2005


ROC Analysis

most lenient decision threshold whereby all patient management is more complex than is To compare two or more diagnostic tests, it
test results are positive for disease. An em- allowed with a decision threshold that classi- is convenient to summarize the tests’ accura-
piric ROC curve has h − 1 additional coordi- fies the test results into positive or negative. cies with a single summary measure. Several
nates, where h is the number of unique test For example, in mammography suspicious such summary measures are used in the liter-
results in the sample. In Table 1 there are 200 and malignant findings are usually followed ature. One is Youden’s index, defined as sen-
test results, one for each of the 200 patients in up with biopsy, probably benign findings usu- sitivity + specificity − 1 [2]. Note, however,
the sample, but there are only five unique re- ally result in a follow-up mammogram in 3–6 that Youden’s index is affected by the choice
sults: normal, benign, probably benign, suspi- months, and normal and benign findings are of the decision threshold used to define sensi-
cious, and malignant. Thus, h = 5, and there considered negative. tivity and specificity. Thus, different decision
are four coordinates plotted in Figure 1 corre- When comparing two or more diagnostic thresholds yield different values of the
sponding to the four decision thresholds de- tests, ROC curves are often the only valid Youden’s index for the same diagnostic test.
scribed in Table 1. method of comparison. Figure 2 illustrates Another summary measure commonly
The line connecting the (0, 0) and (1, 1) two scenarios in which an investigator, used is the probability of a correct diagnosis,
coordinates is called the “chance diagonal” comparing two diagnostic tests, could be often referred to simply as “accuracy” in the
and represents the ROC curve of a diagnostic misled by relying on only a single sensitivity– literature. It can be shown that the probability
test with no ability to distinguish patients specificity pair. Consider Figure 2A. Suppose of a correct diagnosis is equivalent to
with versus those without disease. An ROC a more expensive or risky test (represented by
curve that lies above the chance diagonal, ROC curve Y) was reported to have the probability (correct diagnosis) = PREVs ×
such as the ROC curve for our fictitious following accuracy: sensitivity = 0.40, sensitivity + (1 – PREVs) × specificity, (1)
mammography example, has some diagnos- specificity = 0.90 (labeled as coordinate 1 in where PREVs is the prevalence of disease in
tic ability. The further away an ROC curve is Fig. 2A); a less expensive or less risky test the sample. That is, this summary measure of
from the chance diagonal, and therefore, the (represented by ROC curve X) was reported accuracy is affected not only by the choice of
closer to the upper left-hand corner, the bet- to have the following accuracy: sensitivity = the decision threshold but also by the preva-
ter discriminating power and diagnostic ac- 0.80, specificity = 0.65 (labeled as coordinate lence of disease in the study sample [2]. Thus,
curacy the test has. 2 in Fig. 2A). If the investigator is looking for even slight changes in the prevalence of dis-
In characterizing the accuracy of a diag- the test with better specificity, then he or she ease in the population of patients being tested
nostic (or screening) test, the ROC curve of may choose the more expensive, risky test, can lead to different values of “accuracy” for
the test provides much more information not realizing that a simple change in the the same test.
about how the test performs than just a single decision threshold of the less expensive, Summary measures of accuracy derived
estimate of the test’s sensitivity and specific- cheaper test could provide the desired from the ROC curve describe the inherent ac-
ity [1, 2]. Given a test’s ROC curve, a clini- specificity at an even higher sensitivity curacy of a diagnostic test because they are not
cian can examine the trade-offs in sensitivity (coordinate 3 in Fig. 2A). affected by the choice of the decision threshold
versus specificity for various decision thresh- Now consider Figure 2B. The ROC curve and they are not affected by the prevalence of
olds. Based on the relative costs of false-pos- for test Z is superior to that of test X for a nar- disease in the study sample. Thus, these sum-
itive and false-negative errors and the pretest row range of FPRs (0.0–0.08); otherwise, di- mary measures are preferable to Youden’s in-
probability of disease, the clinician can agnostic test X has superior accuracy. A dex and the probability of a correct diagnosis
choose the optimal decision threshold for comparison of the tests’ sensitivities at low [2]. The most popular summary measure of ac-
each patient. This idea is discussed in more FPRs would be misleading unless the diag- curacy is the area under the ROC curve, often
detail in a later section of this article. Often, nostic tests are useful only at these low FPRs. denoted as “AUC” for area under curve. It
ranges in value from 0.5 (chance) to 1.0 (per-
fect discrimination or accuracy). The chance
Construction of Receiver Operating Characteristic Curve Based on diagonal in Figure 1 has an AUC of 0.5. In Fig-
TABLE 1
Fictitious Mammography Data ure 2A the areas under both ROC curves are
Mammography Results Pathology/Follow-Up Results Decision Rules 1–4 the same, 0.841. There are three interpretations
(BI-RADs Score) for the AUC: the average sensitivity over all
Not Malignant Malignant FPR Sensitivity
false-positive rates; the average specificity
Normal 65 5 (1) 35/100 95/100
over all sensitivities [3]; and the probability
Benign 10 15 (2) 25/100 80/100 that, when presented with a randomly chosen
Probably benign 15 10 (3) 10/100 70/100 patient with disease and a randomly chosen pa-
Suspicious 7 60 (4) 3/100 10/100 tient without disease, the results of the diag-
Malignant 3 10 nostic test will rank the patient with disease as
having higher suspicion for disease than the
Total 100 100
patient without disease [4].
Note.—Decision rule 1 classifies normal mammography findings as negative; all others are positive. Decision rule 2
classifies normal and benign mammography findings as negative; all others are positive. Decision rule 3 classifies normal,
The AUC is often too global a summary
benign, and probably benign findings as negative; all others are positive. Decision rule 4 classifies normal, benign, measure. Instead, for a particular clinical ap-
probably benign, and suspicious findings as negative; malignant is the only finding classified as positive. BI-RADS = plication, a decision threshold is chosen so
Breast Imaging Reporting and Data System, FPR = false-positive rate.
that the diagnostic test will have a low FPR

AJR:184, February 2005 365


Obuchowski

(e.g., FPR < 0.10) or a high sensitivity (e.g., Fig. 1.—Empiric and fitted (or “smooth”) receiver oper-
ating characteristic (ROC) curves constructed from 1.0
sensitivity > 0.80). In these circumstances,
mammography data in Table 1. Four labeled points on
the accuracy of the test at the specified FPRs empiric curve (dotted line) correspond to four decision 0.8
(or specified sensitivities) is a more meaning- thresholds used to estimate sensitivity and specificity. l
Area under curve (AUC) for empiric ROC curve is 0.863 na
ful summary measure than the area under the go

Sensitivity
and for fitted curve (solid line) is 0.876. 0.6 ia
d
entire ROC curve. The partial area under the ce
ROC curve, PAUC (e.g., the PAUC where FPR han
0.4 C
< 0.10, or the PAUC where sensitivity > 0.80),
is then an appropriate summary measure of the 0.2
diagnostic test’s accuracy. In Figure 2B, the
PAUCs for the two tests where the FPR is be- 0.0
tween 0.0 and 0.20 are the same, 0.112. For 0.0 0.2 0.4 0.6 0.8 1.0
interpretation purposes, the PAUC is often di- False-Positive Rate
vided by its maximum value, given by the
range (i.e., maximum–minimum) of the FPRs
(or false-negative rates [FNRs]) [5]. The and other relevant biases common to these defining the decision thresholds that yield the
PAUC divided by its maximum value is called types of studies. estimates of sensitivity and specificity that are
the partial area index and takes on values be- In ROC studies we also require that the test plotted to form the ROC curve. Some diagnos-
tween 0.5 and 1.0, as does the AUC. It is in- results, or the interpretations of the test images, tic tests yield an objective measurement (e.g.,
terpreted as the average sensitivity for the be assigned a numeric value or rank. These nu- attenuation value of a lesion). The decision
FPRs examined (or average specificity for the meric measurements or ranks are the basis for thresholds for constructing the ROC curve are
FNRs examined). In our example, the range
of the FPRs of interest is 0.20–0.0 = 0.20;
1.0 1.0
thus, the average sensitivity for FPRs less
than 0.20 for diagnostic tests X and Z in Fig-
0.8 0.8
ure 2B is 0.56.
Although the ROC curve has many advan-

Sensitivity
Sensitivity

0.6 0.6
tages in characterizing the accuracy of a diag-
nostic test, it also has some limitations. One 0.4 0.4
criticism is that the ROC curve extends beyond
the clinically relevant area of potential clinical 0.2 0.2
interpretation. Of course, the PAUC was devel-
oped to address this criticism. Another criti- 0.0 0.0
cism is that it is possible for a diagnostic test 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
with perfect discrimination between diseased False-Positive Rate
False-Positive Rate
and nondiseased patients to have an AUC of
0.5. Hilden [6] describes this unusual situation A B
and offers solutions. When comparing two di-
agnostic tests’ accuracies, the tests’ ROC Fig. 2.—Two examples illustrate advantages of receiver operating characteristic (ROC) curves (see text for expla-
nation) and comparing summary measures of accuracy.
curves can cross, as in Figure 2. A comparison A, ROC curve Y (dotted line) has same area under curve (AUC) as ROC curve X (solid line), but lower partial area
of these tests based only on their AUCs can be under curve (PAUC) when false-positive rate (FPR) is ≤ 0.20, and higher PAUC when false-positive rate > 0.20.
misleading. Again, the PAUC attempts to ad- B, ROC curve Z (dashed line) has same PAUC as curve X (solid line) when FPR ≤ 0.20 but lower AUC.
dress this limitation. Last, some [6, 7] criticize
the ROC curve, and especially the AUC, for
not incorporating the pretest probability of dis-
ease and the costs of misdiagnoses. 1.0

The ROC Study 0.8


Relative Frequency

Weinstein et al. [1] describe the common Fig. 3.—Unobserved binormal distribution that was
assumed to underlie test results in Table 1. Distribu-
features of a study of the accuracy of a diag- tion for nondiseased patients was arbitrarily centered 0.6
nostic test. These include samples from both at 0 with SD of 1 (i.e., µo = 0 and σo = 1). Binormal pa-
patients with and those without the disease of rameters were estimated to be A = 2.27 and B = 1.70. 0.4 Diseased
Thus, distribution for diseased patients is centered at Nondiseased
interest and a reference standard for deter- µ1 = 1.335 with SD of σ1 = 0.588. Four cutoffs, z1, z2, z3,
mining whether positive test results are true- and z4, correspond to four decision thresholds in Ta- 0.2
positives or false-positives, and whether neg- ble 1. If underlying test value is less than z1, then mam-
mographer assigns test result of “normal." If the
ative test results are true-negatives or false- 0.0
underlying test value is less than z2 but greater than
negatives. They also discuss the need to blind z1, then mammographer assigns test result of “be-
-4 -2 0 2 4

reviewers who are interpreting test images nign,” and so forth. Unobserved Underlying Variable

366 AJR:184, February 2005


ROC Analysis

based on increasing the values of the attenua- Fig. 4.—Three receiver operating characteristic (ROC)
tion coefficient. Other diagnostic tests must be curves with same binormal parameter A (i.e., A = 1.0) 1.0
but different values for parameter B of 3.0 (3σ1 = σo),
interpreted by a trained observer, often a radi- 1.0 (σ1 = σo), and 0.33 (σ1 = 3σo).When B = 3.0, ROC
0.8
ologist, and so the interpretation is subjective. curve dips below chance diagonal; this is called an im-
Two general scales are often used in radiology proper ROC curve [2].

Sensitivity
B = 3.3
0.6
for observers to assign a value to their subjec-
tive interpretation of an image. One scale is the 0.4
B = 1.0
5-point rank scale: 1 = definitely normal, 2 =
probably normal, 3 = possibly abnormal or 0.2
equivocal, 4 = probably abnormal, and 5 = def- B = 3.0
initely abnormal. 0.0
The other popular scale is the 0–100% con-
0.0 0.2 0.4 0.6 0.8 1.0
fidence scale, where 0% implies that the ob-
False-Positive Rate
server is completely confident in the absence
of the disease of interest, and 100% implies
that the observer is completely confident in
the presence of the disease of interest. The considerably less. An excellent illustration of sity, however, has superior accuracy. For
two scales have strengths and weaknesses [2, the issues involved in sampling reviewers for more information on the optimal way to
8], but both are reasonably well suited to ra- an MRMC study can be found in the study by combine measures or test results, see the ar-
diology research. In mammography a rating Beam et al. [21]. ticle by Pepe and Thompson [24].
scale already exists, the BI-RADS score, In a third study, Zheng et al. [25] assessed
which can be used to form decision thresholds Examples of ROC Studies in Radiology how the accuracy of a mammographic com-
from least to most suspicion for the presence The radiology literature, and the clinical puter-aided detection (CAD) scheme was
of breast cancer. laboratory and more general medical litera- affected by restricting the maximum num-
When the diagnostic test requires a subjec- ture, contain many excellent examples of how ber of regions that could be identified as
tive interpretation by a trained reviewer, the ROC curves are used to characterize the accu- positive. Using a sample of 300 cases with a
reviewer becomes part of the diagnostic pro- racy of a diagnostic test and to compare accu- malignant mass and 200 normals, the inves-
cess [9]. Thus, to properly characterize the ac- racies of diagnostic tests. We briefly describe tigators applied their CAD system, each time
curacy of the diagnostic test, we must include here three recent examples of ROC curves be- reducing the maximum number of positive
multiple reviewers in the study. This is the so- ing used in the radiology literature. regions that the CAD system could identify
called MRMC, multiple-reader multiple- Kim et al. [22] conducted a prospective from seven to one. A special ROC technique
case, ROC study. Much has been written study to determine if rectal distention using called “free-response receiver operating
about the design and analysis of MRMC stud- warm water improves the accuracy of MRI characteristic curves” (FROC) was used.
ies [10–20]. We mention here only the basic for preoperative staging of rectal cancer. Af- The horizontal axis of the FROC curve dif-
design of MRMC studies, and in a later sub- ter MRI, the patients underwent surgical re- fers from the traditional ROC curve in that
section we describe their statistical analysis. section, considered the gold standard it gives the average number of false-posi-
The usual design for the MRMC study is a regarding the invasion of adjacent structures tives per image. Zheng et al. concluded that
factorial design, in which every reviewer in- and regional lymph node involvement. Four limiting the maximum number of positive
terprets the image (or images if there is more observers, unaware of the pathology results, regions that the CAD could identify im-
than one test) of every patient. Thus, if there independently scored the MR images using 4- proves the overall accuracy of CAD in mam-
are R reviewers, C patients, and I diagnostic and 5-point rating scales. Using statistical mography. For more information on FROC
tests, then each reviewer interprets C × I im- methods for MRMC studies [13], the authors curves and related methods, I refer you to
ages, and the study involves R × C × I total in- determined that typical reviewers’ accuracy other articles [26–29].
terpretations. The accuracy of each reviewer for determining outer wall penetration is im-
with each diagnostic test is characterized by proved with rectum distention, but that re- Statistical Methods for ROC Analysis
an ROC curve, so R × I ROC curves are con- viewer accuracy for determining regional Fitting Smooth ROC Curves
structed. Constructing pooled or consensus lymph node involvement is not affected. In Figure 1 we saw the empiric ROC curve
ROC curves is not the goal of these studies. Osada et al. [23] used ROC analysis to as- for the test results in Table 1. The curve was
Rather, the primary goals are to document the sess the ability of MRI to predict fetal pul- constructed with line segments connecting
variability in diagnostic test accuracy be- monary hypoplasia. They imaged 87 the observed points on the ROC curve. Em-
tween reviewers and report the average, or fetuses, measuring both lung volume and piric ROC curves often have a jagged appear-
typical, accuracy of reviewers. In order for the signal intensity. An ROC curve based on ance, as seen in Figure 1, and often lie slightly
results of the study to be generalizeable to the lung volume showed that lung volume has below the “true,” smooth, ROC curve—that
relevant patient and reviewer populations, some ability to discriminate between fetuses is, the test’s ROC curve if it were constructed
representative samples from both populations who will have good versus those who will with an infinite number of points (not just the
are needed for the study. Often expert review- have poor respiratory outcome after birth. four points in Fig. 1) and an infinitely large
ers take part in studies of diagnostic test accu- An ROC curve based on the combined infor- sample size. A smooth curve gives us a better
racy, but the accuracy for a nonexpert may be mation from lung volume and signal inten- idea of the relationship between the diagnos-

AJR:184, February 2005 367


Obuchowski

tic test and the disease. In this subsection we (see next subsection), its SE and confidence attenuation values, or measured lesion diam-
describe some methods for constructing interval (CI), and CIs for the ROC curve itself eter), it may not be necessary to assume the
smooth ROC curves. (see Appendix 1). existence of an unobserved, underlying dis-
The most popular method of fitting a Dorfman and Alf [30] suggested a statistical tribution. Sometimes continuous-scale test
smooth ROC curve is to assume that the test test to evaluate whether the binormal assump- results themselves follow a binormal distri-
results (e.g., the BI-RADS scores in Table 1) tion was reasonable for a given data set. Others bution, but caution should be taken that the
come from two unobserved distributions, one [33, 34] have shown through empiric investiga- fit is good (see the article by Goddard and
distribution for the patients with disease and tion and simulation studies that many different Hinberg [35] for a discussion of the resulting
one for the patients without the disease. Usu- underlying distributions are well approximated bias when the distribution is not truly binor-
ally it is assumed that these two distributions by the binormal assumption. mal yet the binormal distribution is as-
can be transformed to normal distributions, When the diagnostic test results are them- sumed). Zou et al. [36] suggest using a Box-
referred to as the binormal assumption. It is selves a continuous measurement (e.g., CT Cox transformation to transform data to
the unobserved, underlying distributions that
we assume can be transformed to follow a
binormal distribution, and not the observed
test results. Figure 3 illustrates the hypothe- Estimating Area Under Empirical Receiver Operating Characteristic
sized unobserved binormal distribution esti- TABLE 2
Curve
mated for the observed BI-RADS results in
Test Results
Table 1. Note how the distributions for the Score x
Score No. of Pairs
diseased and nondiseased patients overlap. Nondiseased Diseased No. of Pairs
Let the unobserved binormal variables for Normal Normal 1/2 65 x 5 = 325 162.5
the nondiseased and diseased patients have
Normal Benign 1 65 x 15 = 975 975
means µ0 and µ1, and variances σ0 [2] and σ1
Normal Probably benign 1 65 x 10 = 650 650
[2], respectively. Then it can be shown [30]
that the ROC curve is completed described by Normal Suspicious 1 65 x 60 = 3,900 3,900
two parameters: Normal Malignant 1 65 x 10 = 650 650

A = (µ1 – µ0) / σ1 (2)


Benign Normal 0 10 x 5 = 50 0
B = σ0 / σ1. (3) Benign Benign 1/2 10 x 15 = 150 75
Benign Probably benign 1 10 x 10 = 100 100
(See Appendix 1 for a formula that links
parameters A and B to the ROC curve.) Fig- Benign Suspicious 1 10 x 60 = 600 600
ure 4 illustrates three ROC curves. Parameter Benign Malignant 1 10 x 10 = 100 100
A was set to be constant at 1.0 and parameter
B varies as follows: 0.33 (the underlying dis- Probably benign Normal 0 15 x 5 = 75 0
tribution of the diseased patients is three
Probably benign Benign 0 15 x 15 = 225 0
times more variable than that of the nondis-
eased patients), 1.0 (the two distributions Probably benign Probably benign 1/2 15 x 10 = 150 75
have the same SD), and 3.0 (the underlying Probably benign Suspicious 1 15 x 60 = 900 900
distribution of the nondiseased patients is Probably benign Malignant 1 15 x 10 = 150 150
three times more variable than that of the dis-
eased patients). As one can see, the curves
Suspicious Normal 0 7 x 5 = 35 0
differ dramatically with changes in parameter
Suspicious Benign 0 7 x 15 = 105 0
B. Parameter A, on the other hand, determines
how far the curve is above the chance diago- Suspicious Probably benign 0 7 x 10 = 70 0
nal (where A = 0); for a constant B parameter, Suspicious Suspicious 1/2 7 x 60 = 420 210
the greater the value of A, the higher the ROC Suspicious Malignant 1 7 x 10 = 70 70
curve lies (i.e., greater accuracy).
Parameters A and B can be estimated from
data such as in Table 1 using maximum like- Malignant Normal 0 3 x 5 = 15 0
lihood methods [30, 31]. For the data in Table Malignant Benign 0 3 x 15 = 45 0
1, the maximum likelihood estimates (MLEs)
of parameters A and B are 2.27 and 1.70, re- Malignant Probably benign 0 3 x 10 = 30 0
spectively; the smooth ROC curve is given in Malignant Suspicious 0 3 x 60 = 180 0
Figure 1. Fortunately, some useful software
[32] has been written to perform the neces- Malignant Malignant 1/2 3 x 10 = 30 15
sary calculations of A and B, along with esti- Total 10,000 pairs 8,632.5
mation of the area under the smooth curve

368 AJR:184, February 2005


ROC Analysis

TABLE 3 Fictitious Data Comparing the Accuracy of Two Diagnostic Tests cov is the estimated covariance between AUC1
and AUC2. When different samples of patients
ROC Curve
undergo the two diagnostic tests, the covariance
X Y equals zero. When the same sample of patients
Estimated AUC 0.841 0.841 undergoes both diagnostic tests (i.e., a paired
Estimated SE of AUC 0.041 0.045 study design), then the covariance is not gener-
Estimated PAUC where FPR < 0.20 0.112 0.071 ally equal to zero and is often positive. The esti-
mated variances and covariances are standard
Estimated SE of PAUC 0.019 0.014
output for most ROC software [32, 41].
Estimated covariance 0.00001
The test statistic Z follows a standard nor-
Z test comparing PAUCs Z = [0.112 – 0.071] / √[0.0192 + 0.0142 – 0.00002] mal distribution. For a two-tailed test with
95% CI for difference in PAUCs [0.112 – 0.071] ± 1.96 × √[0.0192 + 0.0142 – 0.00002] significance level of 0.05, the critical values
Note.—AUC = area under the curve, PAUC = partial area under the curve, CI = confidence interval. are –1.96 and +1.96. If Z is less than −1.96,
then we conclude that the accuracy of diag-
nostic test 2 is superior to that of diagnostic
binormality. Alternatively, one can use soft- eased patient assigned a test result of “normal.” test 1; if Z exceeds +1.96, then we conclude
ware like ROCKIT [32] that will bin the test Because their test results are the same, this pair that the accuracy of diagnostic test 1 is supe-
results into an optimal number of categories is assigned a score of 1/2. rior to that of diagnostic test 2.
and apply the same maximum likelihood Second, repeat the first step for every pos- A two-sided CI for the difference in AUC
methods as mentioned earlier for rating data sible pair of diseased and nondiseased pa- (or PAUC) between two diagnostic tests can
like the BI-RADS scores. tients in your sample. In Table 1 there are 100 be calculated from
More elaborate models for the ROC curve that diseased patients and 100 nondiseased pa-
can take into account covariates (e.g., the pa- tients, thus 10,000 possible pairs. Because LL = [AUC1 – AUC2] – zα/2 ×
tient’s age, symptoms) have also been developed there are only five unique test results, the √[var1 + var2 – 2 × cov] (5)
in the statistics literature [37–39] and will be- 10,000 possible pairs can be scored easily, as
come more accessible as new software is written. in Table 2. UL = [AUC1 – AUC2] + zα/2 ×
Third, sum the scores of all possible pairs. √[var1 + var2 – 2 × cov], (6)
From Table 2, the sum is 8,632.5.
Estimating the Area Under the ROC Curve
Fourth, divide the sum from step 3 by the where LL is the lower limit of the CI, UL is
Estimation of the area under the smooth
number of pairs in the study sample. In our the upper limit, and zα/2 is a value from the
curve, assuming a binormal distribution, is
example we have 10,000 pairs. Dividing the standard normal distribution corresponding to
described in Appendix 1. In this subsection,
we describe and illustrate estimation of the
sum from step 3 by 10,000 gives us 0.86325, a probability of α/2. For example, to construct
area under the empiric ROC curve. The pro-
which is our estimate of the area under the a 95% CI, α = 0.05, thus zα/2 = 1.96.
empiric ROC curve. Note that this method of Consider the ROC curves in Figure 2A. The
cess of estimating the area under the empiric
estimating the area under the empiric ROC estimated areas under the smooth ROC curves of
ROC curve is nonparametric, meaning that
curve gives the same result as one would ob- the two tests are the same, 0.841. The PAUCs
no assumptions are made about the distribu-
tain by fitting trapezoids under the curve and where the FPR is greater than 0.20, however, dif-
tion of the test results or about any hypothe-
summing the areas of the trapezoids (so- fer. From the estimated variances and covariance
sized underlying distribution. The estimation
called trapezoid method). in Table 3, the value of the Z statistic for compar-
works for tests scored with a rating scale, a
The variance of the estimated area under ing the PAUCs is 1.77, which is not statistically
0–100% confidence scale, or a true continu-
the empiric ROC curve is given by DeLong significant. The 95% CI for the difference in
ous-scale variable.
et al. [40] and can be used for constructing PAUCs is more informative: (−0.004 to 0.086);
The process of estimating the area under the
CIs; software programs are available for es- the CI for the partial area index is (−0.02 to 0.43).
empiric ROC curve involves four simple steps:
timating the nonparametric AUC and its The CI contains large positive differences, sug-
First, the test result of a patient with disease is
variance [41]. gesting that more research is needed to investi-
compared with the test result of a patient with-
out disease. If the former test result indicates Comparing the AUCs or PAUCs of Two gate the relative accuracies of these two
more suspicion of disease than the latter test re- Diagnostic Tests diagnostic tests for FPRs less than 0.20.
sult, then a score of 1 is assigned. If the test re- To test whether the AUC (or PAUC) of one
sults are identical, then a score of 1/2 is diagnostic test (denoted by AUC1) equals the Analysis of MRMC ROC Studies
assigned. If the diseased patient has a test result AUC (or PAUC) of another diagnostic test Multiple published methods discuss perform-
indicating less suspicion for disease than the (AUC2), the following test statistic is calculated: ing the statistical analysis of MRMC studies [13–
test result of the nondiseased patient, then a 20]. The methods are used to construct CIs for di-
score of 0 is assigned. It does not matter which Z = [AUC1 – AUC2] / agnostic accuracy and statistical tests for assess-
diseased and nondiseased patient you begin √[var1 + var2 – 2 × cov], (4) ing differences in accuracy between tests. A
with. Using the data in Table 1 as an illustration, statistical overview of the methods is given else-
suppose we start with a diseased patient as- where var1 is the estimated variance of AUC1, where [10]. Here, we briefly mention some of the
signed a test result of “normal” and a nondis- var2 is the estimated variance of AUC2, and key issues of MRMC ROC analyses.

AJR:184, February 2005 369


Obuchowski

Fixed- or random-effects models.—The of reviewers in the reviewer sample with the 5. What will be the ratio of nondis-
MRMC study has two samples, a sample of number of patients in the patient sample. See eased to diseased patients in the
patients and a sample of reviewers. If the study [14, 44] for formulae for determining sample study sample? Let k denote the ratio of
results are to be generalized to patients similar sizes for MRMC studies and [45] for sample the number of nondiseased to diseased pa-
to ones in the study sample and to reviewers size tables for MRMC studies. Sample size de- tients in the study sample. For retrospective
similar to ones in the study sample, then a sta- termination for non-MRMC studies is based studies k is usually decided in the design
tistical analysis that treats both patients and re- on the number of patients needed. phase of the study. For prospective designs
viewers as random effects should be used [13, 2. Will the study involve a single di- k is unknown in the design phase but can be
14, 17–20]. If the study results are to be gen- agnostic test or compare two or more estimated by (1 – PREVp) / PREVp, where
eralized to just patients similar to ones in the diagnostic tests? ROC studies comparing PREVp is the prevalence of disease in the
study sample, then the patients are treated as two or more diagnostic tests are common. relevant population. A range of values for-
random effects but the reviewers should be These studies focus on the difference be- PREVp should be considered when deter-
treated as fixed effects [13–20]. Some of the tween AUCs or PAUCs of the two (or more) mining sample size.
statistical methods can treat reviewers as ei- diagnostic tests. Sample size can be based 6. What summary measure of accu-
ther random or fixed, whereas other methods on either planning for enough statistical racy will be used? In this article we have
treat reviewers only as fixed effects. power to detect a clinically important differ- focused mainly on the AUC and PAUC, but
Parametric or nonparametric.—Some of the ence, or constructing a CI for the difference others are possible (see [2]). The choice of
methods rely on models that make strong as- in accuracies that is narrow enough to make summary measures determines which vari-
sumptions about how the accuracies of the re- clinically relevant conclusions from the ance function formula will be used in calcu-
viewers are correlated and distributed study. In studies of one diagnostic test, we lating sample size. Note that the variance
(parametric methods) [13, 14], other methods are often focus on the magnitude of the test’s function is related to the variance by the fol-
more flexible [15, 20], and still others make no AUC or PAUC, basing sample size on the lowing formula: variance = VF / N, where VF
assumptions [16–19] (nonparametric methods). desired width of a CI. is the variance function and N is the number
The parametric methods may be more powerful 3. If two or more diagnostic tests are of study patients with disease.
when their assumptions are met, but often it is dif- being compared, will it be a paired or 7. What is the conjectured accuracy
ficult to determine if the assumptions are met. unpaired study design, and are the ac- of the diagnostic test? The conjectured ac-
Covariates.—Reviewers’ accuracy may be curacies of the tests hypothesized to be curacy is needed to determine the expected
affected by their training or experience or by different or equivalent? Paired designs difference in accuracy between two or more
characteristics of the patients (e.g., age, sex, almost always require fewer patients than an diagnostic tests. Also, the magnitude of the
stage of disease, comorbidities). These vari- unpaired design, and so are used whenever accuracy affects the variance function. In the
ables are called covariates. Some of the statis- they are logistically, ethically, and financially following example, we present the variance
tical methods [15, 20] have models that can feasible. Studies that are performed to deter- function for the AUC; see Zhou et al. [2] for
include covariates. These models provide mine whether two or more tests have the same formulae for other variance functions.
valuable insight into the variability between accuracy are called equivalency studies. Of- Consider the following example. Suppose
reviewers and between patients. ten in radiology a less invasive diagnostic test, an investigator wants to conduct a study to de-
Software.—Software is available for public or a quicker imaging sequence, is developed termine if MRI can distinguish benign from
use for some of the methods [32, 42, 43]; the and compared with the standard test. The in- malignant breast lesions. Patients with a suspi-
authors of the other methods may be able to vestigator wants to know if the test is similar cious lesion detected on mammography will be
provide software if contacted. in accuracy to the standard test. Equivalency prospectively recruited to undergo MRI before
studies often require a larger sample size than biopsy. The pathology results will be the refer-
Determining Sample Size for ROC Studies studies in which the goal is to show that one ence standard. The MR images will be inter-
Many issues must be considered in deter- test has superior accuracy to another test. The preted independently by two reviewers; they
mining the number of patients needed for an reason is that to show equivalence the investi- will score the lesions using a 0–100% confi-
ROC study. We list several of the key issues gator must rule out all large differences be- dence scale. An ROC curve will be constructed
and some useful references here, followed by tween the tests—that is, the CI for the for each reviewer; AUCs will be estimated, and
a simple illustration. Software is also avail- difference must be very narrow. 95% CIs for the AUCs will be constructed. If
able for determining the required sample size 4. Will the patients be recruited in a MRI shows some promise, the investigator will
for some ROC study designs [32, 41]. prospective or retrospective fashion? plan a larger MRMC study.
1. Is it a MRMC ROC study? Many ra- In prospective designs, patients are recruited The investigator expects 20–40% of patients
diology studies include more than one re- based on their signs or symptoms, so at the to have pathologically confirmed breast cancer
viewer but are not considered MRMC studies. time of recruitment it is unknown whether the (PREVp = 0.2–0.4); thus, k = 1.5–4.0. The in-
MRMC studies usually involve five or more re- patient has the disease of interest. In contrast, vestigator expects the AUC of MRI to be ap-
viewers and focus on estimating the average in retrospective designs patients are recruited proximately 0.80 or higher. The variance
accuracy of the reviewers. In contrast, many ra- based on their known true disease status (as function of the AUC often used for sample size
diology studies include two or three reviewers determined by the gold or reference standard) calculations is as follows:
to get some idea of the interreviewer variabil- [2]. Both studies are used commonly in radi-
ity. Estimation of the required sample size for ology. Retrospective studies often require VF = (0.0099 × e–A×A/2) ×
MRMC studies requires balancing the number fewer patients than prospective designs. [(5 × A2 + 8) + (A2 + 8) / k], (7)

370 AJR:184, February 2005


ROC Analysis

where A is the parameter from the binormal of a false-positive is large; for these situa- 6. Hilden J. The area under the ROC curve and its
distribution. Parameter A can be calculated tions, a low FPR is optimal. The slope takes competitors. Med Decis Making 1991;11:95–101
from A = φ−1(AUC) × 1.414, where φ−1 is the in- on a value near zero when the patient is likely 7. Hilden J. Prevalence-free utility-respecting sum-
mary indices of diagnostic power do not exist. Stat
verse of the cumulative normal distribution func- to have the disease or treatment for the dis- Med 2000;19:431–440
tion [2]. For our example, AUC = 0.80; thus ease is beneficial and carries little risk to 8. Wagner RF, Beiden SV, Metz CE. Continuous versus
φ−1(0.80) = 0.84 and A = 1.18776. The variance healthy patients; in these situations, a high categorical data for ROC analysis: some quantitative
function, VF, equals (0.00489) × [(15.05387) + sensitivity is optimal [3]. A nice example of a considerations. Acad Radiol 2001;8:328–334
(9.41077) / 4.0] = 0.08512, where we have set k study using this equation is given in [48]. See 9. Beam CA, Baker ME, Paine SS, Sostman HD, Sul-
= 4.0. For k = 1.5, the VF = 0.10429. also work by Greenhouse and Mantel [49] livan DC. Answering unanswered questions: pro-
posal for a shared resource in clinical diagnostic
Suppose the investigator wants a 95% CI and Linnet [50] for determining the optimal radiology research. Radiology 1992;183:619–620
no wider than 0.10. That is, if the estimated decision threshold when a desired level for 10. Obuchowski NA, Beiden SV, Berbaum KS, et al.
AUC from the study is 0.80, then the lower the sensitivity, specificity, or both is specified Multireader multicase receiver operating charac-
bound of the CI should not be less than 0.75 a priori. teristic analysis: an empirical comparison of five
and the upper bound should not exceed 0.85. methods. Acad Radiol 2004;11:980–995
A formula for calculating the required sample Conclusion 11. Obuchowski NA. Multi-reader ROC studies: a
comparison of study designs. Acad Radiol
size for a CI is Applications of ROC curves in the medi-
1995;2:709–716
cal literature have increased greatly in the 12. Roe CA, Metz CE. Variance-component modeling
N = [zα/22 × VF] / L2 (8) past few decades, and with this expansion in the analysis of receiver operating characteristic in-
many new statistical methods of ROC anal- dex estimates. Acad Radiol 1997;4:587–600
where zα/2 = 1.96 for a 95% CI and L is the
ysis have been developed. These include 13. Dorfman DD, Berbaum KS, Metz CE. Receiver op-
desired half-width of the CI. Here, L = 0.05. erating characteristic rating analysis: generalization
methods that correct for common biases like
N is the number of patients with disease to the population of readers and patients with the
verification bias and imperfect gold standard
needed for the study; the total number of pa- jackknife method. Invest Radiol 1992;27:723–731
bias, methods for combining the information
tients needed for the study is N × (1 + k). For 14. Obuchowski NA. Multi-reader multi-modality
from multiple diagnostic tests (i.e., optimal
our example, N equals [1.962 × 0.08512] / ROC studies: hypothesis testing and sample size
combinations of tests) and multiple studies estimation using an ANOVA approach with de-
0.052 = 130.8 for k = 4.0, and 160.3 for k =
(i.e., meta-analysis), and methods for analyz- pendent observations. with rejoinder. Acad Radiol
1.5. Thus, depending on the unknown preva- 1995;2:S22–S29
ing clustered data (i.e., multiple observations
lence of breast cancer in the study sample, the 15. Toledano AY, Gatsonis C. Ordinal regression
from the same patient). Interested readers can
investigator needs to recruit perhaps as few as methodology for ROC curves derived from corre-
search directly for these statistical methods or
401 total patients (if the sample prevalence is lated data. Stat Med 1996;15:1807–1826
consult two recently published books on ROC 16. Song HH. Analysis of correlated ROC areas in di-
40%) but perhaps as many as 654 (if the sam-
curve analysis and related topics [2, 39]. agnostic testing. Biometrics 1997;53:370–382
ple prevalence is only 20%).
Available software for ROC analysis allows 17. Beiden SV, Wagner RF, Campbell G. Compo-
investigators to easily fit, evaluate, and com- nents-of-variance models and multiple-bootstrap
Finding the Optimal Point on the Curve experiments: an alternative method for random-
Metz [46] derived a formula for determining pare ROC curves [41, 51], although users
effects, receiver operating characteristic analysis.
the optimal decision threshold on the ROC should be cautious about the validity of the
Acad Radiol 2000;7:341–349
curve, where “optimal” is in terms of minimiz- software and check the underlying methods 18. Beiden SV, Wagner RF, Campbell G, Metz CE, Jiang
ing the overall costs. “Costs” can be defined as and assumptions. Y. Components-of-variance models for random-effects
monetary costs, patient morbidity and mortal- ROC analysis: the case of unequal variance structure
Acknowledgments across modalities. Acad Radiol 2001;8:605–615
ity, or both. The slope, m, of the ROC curve at 19. Beiden SV, Wagner RF, Campbell G, Chan HP.
the optimal decision threshold is I thank the two series’ coeditors and an out-
Analysis of uncertainties in estimates of compo-
side statistician for their helpful comments on
nents of variance in multivariate ROC analysis.
m = (1 – PREVp) / PREVp × an earlier draft of this manuscript. Acad Radiol 2001;8:616–622
[CFP – CTN] / [CFN – CTP] (9) 20. Ishwaran H, Gatsonis CA. A general class of hi-
erarchical ordinal regression models with applica-
where CFP, CTN, CFN, and CTP are the costs of tions to correlated ROC analysis. Can J Stat
false-positive, true-negative, false-negative, and References 2000;28:731–750
true-positive results, respectively. Once m is es- 1. Weinstein S, Obuchowski NA, Lieber ML. Clini- 21. Beam CA, Layde PM, Sullivan DC. Variability in
cal evaluation of diagnostic tests. AJR the interpretation of screening mammograms by
timated, the optimal decision threshold is the
2005;184:14–19 US radiologists: findings from a national sample.
one for which sensitivity and specificity maxi- 2. Zhou XH, Obuchowski NA, McClish DK. Statis- Arch Intern Med 1996;156:209–213
mize the following expression: [sensitivity – tical methods in diagnostic medicine. New York, 22. Kim MJ, Lim JS, Oh YT, et al. Preoperative MRI of
m(1 – specificity)] [47]. NY: Wiley-Interscience, 2002 rectal cancer with and without rectal water filling: an in-
Examining the ROC curve labeled X in 3. Metz CE. Some practical issues of experimental traindividual comparison. AJR 2004;182:1469–1476
Figure 2, we see that the slope is very steep in design and data analysis in radiologic ROC stud- 23. Osada H, Kaku K, Masuda K, Iitsuka Y, Seki K,
the lower left where both the sensitivity and ies. Invest Radiol 1989;24:234–245 Sekiya S. Quantitative and qualitative evaluations
4. Hanley JA, McNeil BJ. The meaning and use of of fetal lung with MR imaging. Radiology
FPR are low, and is close to zero at the upper the area under a receiver operating characteristic 2004;231:887–892
right where the sensitivity and FPR are high. (ROC) curve. Radiology 1982;143:29–36 24. Pepe MS, Thompson ML. Combining diagnostic
The slope takes on a high value when the pa- 5. McClish DK. Analyzing a portion of the ROC test results to increase accuracy. Biostatistics
tient is unlikely to have the disease or the cost curve. Med Decis Making 1989;9:190–195 2000;1:123–140

AJR:184, February 2005 371


Obuchowski

25. Zheng B, Leader JK, Abrams G, et al. Computer- ment of performance. Psychol Bull 1986;99:181–198 ber 13, 2004
aided detection schemes: the effect of limiting the 34. Hanley JA. The robustness of the binormal as- 43. The University of Iowa Department of Radiology:
number of cued regions in each case. AJR sumption used in fitting ROC curves. Med Decis The Medical Image Perception Laboratory.
2004;182:579–583 Making 1988;8:197–203 MRMC 2.0. Available at: perception.radiol-
26. Chakraborty DP, Winter LHL. Free-response meth- 35. Goddard MJ, Hinberg I. Receiver operating char- ogy.uiowa.edu. Accessed December 13, 2004
odology: alternative analysis and a new observer-per- acteristic (ROC) curves and non-normal data: an 44. Hillis SL, Berbaum KS. Power estimation for the Dorf-
formance experiment. Radiology 1990;174:873–881 empirical study. Stat Med 1990;9:325–337 man-Berbaum-Metz method. Acad Radiol (in press)
27. Chakraborty DP. Maximum likelihood analysis of 36. Zou KH, Tempany CM, Fielding JR, Silverman SG. 45. Obuchowski NA. Sample size tables for receiver op-
free-response receiver operating characteristic Original smooth receiver operating characteristic erating characteristic studies. AJR 2000;175:603–608
(FROC) data. Med Phys 1989;16:561–568 curve estimation from continuous data: statistical 46. Metz CE. Basic principles of ROC analysis.
28. Swensson RG. Unified measurement of observer methods for analyzing the predictive value of spiral Semin Nucl Med 1978;8:283–298
performance in detecting and localizing target ob- CT of ureteral stones. Acad Radiol 1998;5:680–687 47. Zweig MH, Campbell G. Receiver-operating char-
jects on images. Med Phys 1996;23:1709–1725 37. Pepe MS. A regression modeling framework for re- acteristic (ROC) plots: a fundamental evaluation tool
29. Obuchowski NA, Lieber ML, Powell KA. Data ceiver operating characteristic curves in medical di- in clinical medicine. Clin Chem 1993;39:561–577
analysis for detection and localization of multiple agnostic testing. Biometrika 1997;84:595–608 48. Somoza E, Mossman D. “Biological markers” and
abnormalities with application to mammography. 38. Pepe MS. An interpretation for the ROC curve using psychiatric diagnosis: risk-benefit balancing using
Acad Radiol 2000;7:516–525 GLM procedures. Biometrics 2000;56:352–359 ROC analysis. Biol Psychiatry 1991;29:811–826
30. Dorfman DD, Alf E. Maximum likelihood esti- 39. Pepe MS. The statistical evaluation of medical 49. Greenhouse SW, Mantel N. The evaluation of di-
mation of parameters of signal detection theory: a tests for classification and prediction. New York, agnostic tests. Biometrics 1950;6:399–412
direct solution. Psychometrika 1968;33:117–124 NY: Oxford University Press, 2003 50. Linnet K. Comparison of quantitative diagnostic
31. Dorfman DD, Alf E. Maximum-likelihood esti- 40. DeLong ER, DeLong DM, Clarke-Pearson DL. tests: type I error, power, and sample size. Stat
mation of parameters of signal detection theory Comparing the areas under two or more correlated re- Med 1987;6:147–158
and determination of confidence intervals: rating ceiver operating characteristic curves: a nonparamet- 51. Stephan C, Wesseling S, Schink T, Jung K. Com-
method data. J Math Psychol 1969;6:487–496 ric approach. Biometrics 1988;44:837–844 parison of eight computer programs for receiver-
32. ROCKIT and LABMRMC. Available at: 41. ROC analysis. Available at: www.bio.ri.ccf.org/ operating characteristic analysis. Clin Chem
xray.bsd.uchicago.edu/krl/KRL_ROC Research/ROC/index.html. Accessed December 2003;49:433–439
software_index.htm. Accessed December 13, 2004 13, 2004 52. Ma G, Hall WJ. Confidence bands for receiver op-
33. Swets JA. Empirical RO. Cs in discrimination and di- 42. OBUMRM. Available at: www.bio.ri.ccf.org/ erating characteristic curves. Med Decis Making
agnostic tasks: implications for theory and measure- OBUMRM/OBUMRM.html. Accessed Decem- 1993;13:191–197

APPENDIX 1. Area Under the Curve and Confidence Intervals with Binormal Model

Under the binormal assumption, the receiver operating characteristic (ROC) curve is the
collection of points given by
[1 – φ(c), 1 – φ(B × c – A)]
where c ranges from –∞ to +∞ and represents all the possible values of the underlying binormal
distribution, and φ is the cumulative normal distribution evaluated at c. For example, for a false-
positive rate of 0.10, φ(c) is set equal to 0.90; from tables of the cumulative normal distribution,
we have φ(1.28) = 0.90. Suppose A = 2.0 and B = 1.0; then the sensitivity = 1 – φ(− 0.72) = 1 –
0.2358 = 0.7642.
ROCKIT [32] gives a confidence interval (CI) for sensitivity at particular false-positive
rates (i.e., pointwise CIs). A CI for the entire ROC curve (i.e., simultaneous CI) is described
by Ma and Hall [52].
Under the binormal distribution assumption, the area under the smooth ROC curve (AUC)
is given by
AUC = φ[A / √ (1 + B2)].
For the example above, AUC = φ[2.0 / √ (2.0)] = φ[1.414] = 0.921.
The variance of the full area under the ROC curve is given as standard output in programs like
ROCKIT [32]. An estimator for the variance of the partial area under the curve (PAUC) was given
by McClish [5]; a Fortran program is available for estimating the PAUC and its variance [41].

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 8. Exploring and Summarizing Radiologic Data, January 2003
2. The Research Framework, April 2001 9. Visualizing Radiologic Data, March 2003
3. Protocol, June 2001 10. Introduction to Probability Theory and Sampling
4. Data Collection, October 2001 Distributions, April 2003
5. Population and Sample, November 2001 11. Observational Studies in Radiology, November 2004
6. Statistically Engineering the Study for Success, July 2002 12. Randomized Controlled Trials, December 2004
7. Screening for Preclinical Disease: Test and Disease 13. Clinical Evaluation of Diagnostic Tests, January 2005
Characteristics, October 2002

372 AJR:184, February 2005


Research
Joseph and Reinhold
Statistical Inference for Continuous
Variables

Fundamentals of Clinical Research


for Radiologists
Lawrence Joseph1,2
Caroline Reinhold3,4
Statistical Inference for Continuous
Joseph L, Reinhold C

Variables
onsider the following statements tive statistics [2] and probability and sampling [3]

C from an abstract reporting results


from a study of CT in large cell
neuroendocrine carcinoma of the
may be a good idea. In this module, we introduce
the basic notions of inferential statistics—that is,
we discuss how to draw inferences about one or
lung [1]: more populations’ characteristics using data from
In the 38 patients, six central tumors and samples from these populations. We focus on
32 peripheral tumors, with diameters rang- continuous variables, including inferences for
ing from 12 to 92 mm (mean ± SD, 32 ± 19 means and simple nonparametric methods.
mm), were identified. None of the tumors Rather than simply providing a catalogue of
had air bronchograms or calcification in the which formulas to use in which situation, we ex-
Received November 5, 2004; accepted after revision mass or nodule… On contrast-enhanced plain the logic behind each technique. In this
November 11, 2004. CT scans, inhomogeneously enhanced tu- way, informed choices and decisions can be
Series editors: Nancy Obuchowski, C. Craig Blackmore, mors appeared to be larger (51 ± 18 mm) made on the basis of a deeper understanding of
Steven Karlik, and Caroline Reinhold.
than homogeneously enhanced tumors exactly what information each type of statistical
This is the 15th in the series designed by the American (25 ± 10 mm; p < 0.001). inference provides.
College of Radiology (ACR), the Canadian Association of
Radiologists, and the American Journal of Roentgenology. Proper interpretation of the above results, Recall from the discussion in a previous
The series, which will ultimately comprise 22 articles, is and of similar reports from much of the mod- module [3] that there are two main schools of
designed to progressively educate radiologists in the ern clinical literature, depends in large part on statistical inference: the frequentist school and
methodologies of rigorous clinical research, from the most
basic principles to a level of considerable sophistication. the understanding of statistical terms. In this the Bayesian school. These are each based on a
The articles are intended to complement interactive case, terms such as “SD” were used for de- different definition of probability, the frequen-
software that permits the user to work with what he or she
has learned, which is available on the ACR Web site
scriptive purposes, and p values were given to tist school based on a long-run frequency defi-
(www.acr.org). support evidence of between-group differ- nition and the Bayesian school based on a more
Project coordinator: G. Scott Gazelle, Chair, ACR ences in tumor size. In other reports, one may subjective view of probability. We discuss these
Commission on Research and Technology Assessment. see terms such as “confidence intervals,” “t paradigms for statistical inference.
Staff coordinator: Jonathan H. Sunshine, Senior Director tests,” “type 1 and type 2 errors,” and so on. In the Statistical Inferences for Means sec-
for Research, ACR. Clearly, radiologists who wish to keep pace tion, the classical or frequentist school of sta-
1Department of Medicine, Division of Clinical with new technologies must at least have a ba- tistical inferences for means is covered, and in
Epidemiology, Montreal General Hospital, 1650 Cedar Ave., sic understanding of statistical language. This the Nonparametric Inference section, we
Montreal, QC H3G 1A4, Canada. Address correspondence
to L. Joseph (Lawrence.Joseph@mcgill.ca). is true not only if they desire to plan and per- present a brief introduction to nonparametric
2Departmentof Epidemiology and Biostatistics, 1020 Pine
form their own research, but also if they sim- inferences. In these sections, we explain ex-
Ave. W, McGill University, Montreal, QC H3A 1A2, Canada. ply want to read the medical literature with a actly what is meant by ubiquitous statistical
3Department of Diagnostic Radiology, Montreal General keen critical eye or to make informed deci- statements such as “p < 0.05”—which may not
Hospital, McGill University Health Centre, 1650 Cedar Ave., sions about which new treatments or diagnos- mean what many medical journal readers be-
Montreal, QC H3G 1A4, Canada. tic techniques they may wish to use to treat lieve it to mean—and examine confidence in-
4Department of Oncology, Synarc, 575 Market St., their own patients. tervals as an attractive alternative to p values.
San Francisco, CA 94105.
Descriptive terms such as “means,” “medi- The problem of choosing an appropriate sam-
AJR 2005;184:1047–1056 ans,” and “SDs” have been covered in a previous ple size for a given experiment is discussed in
0361–803X/05/1844–1047 article in this series [2]. Before reading this arti- the Sample Size Calculations section. Increas-
© American Roentgen Ray Society cle, reviewing the previous modules on descrip- ingly important Bayesian alternatives to the

AJR:184, April 2005 1047


Joseph and Reinhold

classical statistical techniques are presented in According to Table 1, if the accelerated define: The p value is the probability of ob-
the Bayesian Inference section. schedule in fact leads to smaller tumor diameters taining a result as extreme as or more extreme
than the standard and we reject the null hypothe- than that observed assuming that the null hy-
Statistical Inferences for Means sis, then we have made a correct decision, as also pothesis is in fact true.
In this section, we consider how to draw happens if the null hypothesis is in fact correct It is important to note that the p value is not
inferences about populations by statistically and we do not reject it. On the other hand, if we the probability that the null hypothesis is true
analyzing samples of data using standard reject the null hypothesis as false when it is in after having seen the data, even though many
frequentist methods. We first consider infer- fact true, we make a so-called type 1 error, which clinicians often falsely interpret it this way.
ences for a single mean when the variance in occurs with probability α, and if we fail to reject The p value does not directly or indirectly
the population is known. We also initially the null hypothesis when it is in fact false, we provide this probability and in fact can be or-
assume that the data follow a normal distri- make a type 2 error, which occurs with probabil- ders of magnitude different from it. In other
bution, so we are estimating the mean of this ity β. The power of a study is defined as the prob- words, it is possible to have a p value equal to
normal distribution. Once the basic concepts ability of rejecting the null hypothesis when the 0.05, when the probability of the null hypoth-
are understood in this simple case, we indi- alternative hypothesis is in fact true, so that the esis is 0.5, different from the p value by a fac-
cate how to extend the same ideas to cases in power is equal to 1 – β. To summarize, we have tor of 10 (see the Bayesian Inference section
which the variance is unknown or more than equations 1–4: for how to calculate a more easily interpreted
one mean is of interest and to cases in which hypothesis test from a Bayesian viewpoint).
the normal distribution is not assumed. Given the definition of a p value, how
(1)
In addition to the two different schools of would we calculate it? Suppose that we per-
inference (i.e., frequentist or Bayesian), sta- form tumor measurements on 10 patients un-
tistical inferences can be divided into proce- der the accelerated schedule and that these
dures that test a hypothesis and those that (2) tumors have a mean diameter of x = 3.0 cm,
estimate parameters. We begin with hypothe- with a known SD of σ = 1.5 cm. The defini-
sis testing procedures that lead to p values, (3) tion implies that we need to calculate the
and then compare the information they pro- probability of obtaining mean tumor diame-
vide to that provided by parameter estimation ters of 3.0 cm or less (i.e., as extreme as or
via confidence intervals. more extreme than what was observed), given
(4) that the true mean tumor diameter under the
Standard Frequentist Hypothesis Testing standard treatment schedule is exactly 3.5 cm
Suppose we wish to test the hypothesis that a Recall from a previous module in this series (i.e., given the null hypothesis is true). Now,
new accelerated radiation schedule for patients [3] that probabilities written in the form of Pr{A recall from a previous article in this series [3]
with brain cancer leads to smaller mean tumor | B} are called “conditional probabilities,” and the that the probability density of our mean, x, is
diameters compared with the standard schedule notation is read as the probability that the event A usually considered as normal. Because for
versus a null hypothesis that the tumor diame- occurs, given that the event B is known to have purposes of calculating a p value the null hy-
ters are the same regardless of schedule. Sup- occurred. Thus, all of the quantities are condi- pothesis is considered as exactly correct, the
pose further that it is known that patients on the tional on knowing whether the null or alternative mean of our normal distribution is assumed to
standard schedule have a tumor diameter of 3.5 hypotheses are in fact true. Of course, we gener- be 3.5 cm. The SD of our mean (known as the
cm, on average, after completing their radiation ally do not know whether the null hypothesis is SE) is given as the SD in the population (as-
therapy. Although it is somewhat unrealistic to true or not, so these conditional statements are at sumed here to be σ = 1.5 cm) divided by the
assume this perfect knowledge of past tumor di- best of indirect interest. Once we obtain our data, square root of the sample size [3]. Thus here,
ameters, this example approximates the situa- we would ideally like to know the probability that our SE is given by 1.5 / √ 10 = 0.474.
tion in which a large case series (e.g., a historical the null hypothesis is true—not assume the null Therefore, we calculate equations 5 and 6:
control series) of tumor diameters is available, hypothesis is true. We will discuss this point fur-
so that most uncertainty arises from the data ther in the Bayesian Inference section.
from the new schedule. Formally, we can state Although it is important to understand the
types of errors that can be made when hypoth- (5)
the hypotheses as:
esis testing, the results of a hypothesis test are (6)
H0 (null hypothesis): µ = 3.5 usually reported as a p value, which we now
This probability can be calculated from tables
ΗA (alternative hypothesis): µ < 3.5
of the normal distribution, as explained in Jo-
where µ represents the unknown true average tu- seph and Reinhold [3]. Normalizing, we find Z =
mor diameter of the accelerated radiation schedule. TABLE 1
Results of Hypothesis [(3.0 – 3.5) / 0.474] = –1.05, and looking up
There are four possible results when con- Testing −1.05 on standard normal tables, we find p =
sidering hypothesis testing, depending on the True State of Nature 0.147. Thus, there is about a 14.7% chance of
true state of nature, which is typically un- Test obtaining results as extreme as or more ex-
HA H0
known, and the statistical test result, which treme than the 3.0 cm observed, if the true
Reject H0 1–β α
depends on the data collected. The four pos- mean tumor diameter for the new schedule is
Do not reject H0 β 1–α
sibilities are shown in Table 1. exactly 3.5 cm. Therefore, the observed result

1048 AJR:184, April 2005


Statistical Inference for Continuous Variables

is not unusual (i.e., it is compatible with the pothesis (HA: µ ≠ 3.5 in this case), we find p = 2 × from that measured on the same scale after the
null hypothesis), so we cannot reject H0. 0.147 = 0.294. The doubling results from add- procedure to create a single set of before-to-after
Notice that if we had observed the same ing the areas under the normal curve both below differences. Once this subtraction has been done
mean tumor diameter but with a larger sample −1.05 (as in the one-sided case) and above 1.05. for each patient, one in fact has reduced the two
size of 100, say, the p value would have been Similar methods are available for tests in- measures on each patient (i.e., before and after)
0.0004. With a sample size of 100, the event of volving comparisons between two means. For to a single set of numbers representing the differ-
the observed data or data more extreme occur- example, to test the null hypothesis that ences. Therefore, paired data can be analyzed us-
ring would be a rare event if the null hypothesis means in two different groups are equal to ing the same formulas as used for single-sample
were true, so the null hypothesis could be re- each other versus a two-sided alternative hy- analyses. Paired designs are often more efficient
jected. Therefore, p values depend not only on pothesis, calculate as in equation 11: than unpaired designs, as between-group vari-
the observed mean tumor diameter, but also on ability is reduced by the pairing.
the sample size. Assumptions behind the Z tests.—For ease of
The test described earlier was one-sided— exposition, we have presented all of the test for-
(11)
that is, we a priori believed (perhaps from pre- mulas using percentiles that came from the nor-
liminary data or theoretic considerations) that mal distribution, but in practice there are two
the accelerated schedule would lead to equal or assumptions behind this use of the normal dis-
better results and not larger tumor sizes. To For example, suppose we wish to again look tribution. The first assumption is that the data
generalize, to perform a one-sided test of the at the difference in mean tumor diameter be- arise either from a normal distribution or the
null hypothesis that a single mean µ has value tween two groups of patients with brain can- sample size is large enough for the central limit
µ0, calculate the statistic in equation 7: cer, but this time in a clinical trial setting, with theorem [3] to apply. The second assumption is
subjects randomized into accelerated and that the variance or variances involved in the
standard schedule groups (this would, of calculations are known exactly.
(7) course, be a better design because concurrent The first of these assumptions is often satis-
groups are compared, minimizing potential fied at least approximately in practice, but the
confounding). Suppose we observe a mean second assumption almost never holds in real
and determine the p value from normal distribu- tumor diameter of x1 = 3.0 cm (σ1 = 1.5 cm) applications. We usually have to use estimates
tion tables as in equation 8: in 200 subjects under the new schedule, and a s2, s12, and s22 in the above formulas rather
mean tumor diameter of x2 = 3.7 cm (σ2 = 1.4 than the exact values σ2, σ12, and σ22, respec-
cm) in 200 subjects under the standard sched- tively, because the variances would usually be
(8) ule. Plugging into the above formula, we get estimated from the data rather than being
equation 12: known exactly. To account for the extra uncer-
tainty due to the fact that the variance is esti-
On the other hand, often we may not want to mated rather than known, the distribution of
specify the direction ahead of time. In this case, (12) our test statistic changes. We thus use t distri-
the alternative hypothesis is two-sided (i.e., the bution tables rather than normal distribution ta-
alternative hypothesis is HA: µ ≠ µ0 rather than bles. In calculations, this means that the z
the one-sided HA: µ < µ0), and one performs the values used in all of the formulas need to be
calculation in equation 9: Looking up 4.82 on normal tables gives a p switched to the corresponding values from t ta-
value of 2 × (0.0000007) = 0.0000014. Be- bles. This requires knowledge of the degrees of
cause this indicates a very rare event under freedom (df), which for single-mean problems
H0, we can reject the null hypothesis that the is simply the sample size minus 1. This of
(9)
two means are equal. course applies to paired designs as well, be-
These formulas can be extended in a vari- cause they reduce to single-sample problems.
ety of directions, which we describe in the For two sample unpaired problems, a conser-
where | a | indicates the absolute value of a, and subsequent sections. vative number for the df is the minimum of the
one determines the p value from normal distri- Paired versus unpaired tests.—In compar- two sample sizes minus 1 (n – 1, where n is the
bution tables as in equation 10: ing the two mean tumor diameters, we have as- sample size) [4]. These tests are called t tests.
sumed that the design of this study was unpaired, Equal or unequal variances.—The tests
meaning that the data were composed of two in- described earlier assume that the variances in
(10)
dependent samples, one from each treatment the two groups are unequal. Slightly more ef-
group. In some experiments, for example, if one ficient formulas can be derived if the vari-
In the one-sided case, we reject the null hy- wishes to compare quality of life before and after ances are the same, as a single pooled
pothesis only if we observe an extreme result in any medical procedure is performed, a paired de- estimate of the variance can be derived from
the direction specified by the alternative hypoth- sign is appropriate because the patient is being combining the information in both samples
esis. In the two-sided case, we reject if we ob- compared with him- or herself—that is, the pa- together. We do not discuss pooled variances
serve an extreme result in either direction (larger tient serves as his or her own control. Here, one further here, in part because in practice the
or smaller tumor sizes). This results in a doubling would subtract the value measured on an appro- difference in analyses done with pooled or
of the p value, so for a two-sided alternative hy- priate quality-of-life scale before the procedure unpooled variances is usually quite small and

AJR:184, April 2005 1049


Joseph and Reinhold

TABLE 2 Tests and Confidence Intervals Required for One and Two Sample Problems for Continues Variables

Note.—In all cases, the data are assumed to be normally distributed or the sample size large enough for the central limit theorem to apply. The data are assumed to be
represented by xi, i = 1,…, n for a single-sample problem or by xi, i = 1,…, n1 and yi, i = 1,…, n2 for a two-sample problem. Sample sizes are n for a single-sample problem and
n1 and n2 for the two-sample problem. The z indicates a normal table is used, t indicates a t table is required. When a t table is required, the degrees of freedom are equal to
n – 1 for a single-sample problem, while the degrees of freedom are n1 + n2 – 2 for a two-sample problem with equal variances, and min(n1 – 1, n2 – 1) for unequal variances
(conservative value). The x0 and y0 indicate null values under the null hypothesis (usually but not always equal to zero). For paired two-sample problems, form the within-
individual differences, and use the formulas for the one-sample case. N/A = not applicable.

in part because it is rarely appropriate to pool clinically unimportant difference. This is espe- treme as or more extreme than that observed
the variances, because the variability is usu- cially prone to occur in cases in which the sam- assuming the null hypothesis to be exactly
ally not exactly the same in both groups. ple size is large. Conversely, results of true, it provides no information about what
Analysis of variance: more than two potentially great clinical interest are not neces- the true parameter values might be. In the
means.—We have seen tests for one or two sarily ruled out if p > 0.05, especially in studies two-mean example described earlier, we ob-
means, but sometimes one wishes to test the with small sample sizes. Therefore, one should served a tumor diameter difference of 0.7 cm,
equality of three or more means. Although this not confuse statistical significance (i.e., p < which was shown to be “statistically signi-
topic is not covered here, readers should be 0.05) with practical or clinical importance. ficant,” with a p value of approximately
aware that analysis of variance is a technique Fourth, the null hypothesis is almost never ex- 0.000001. Although we observed a difference
that extends our two-sample procedure to three actly true. In the example described, does one of 0.7 cm, we know that our data are from a
or more means. See, for example, Armitage and seriously think that the mean tumor diameter of random sample of patients to whom this pro-
Berry [5] or Rosner [6] for details. the patients on the standard treatment schedule cedure could be applied, so the true mean dif-
Table 2 provides the test statistics used for could be exactly 3.5 cm (rather than, say, ference could in fact be higher or lower than
all possible cases with one or two means, as 3.50001 cm)? Because one knows the null hy- our observed difference. How likely is it that
discussed earlier. pothesis is almost surely false to begin with, it the true mean difference in tumor diameter is
makes little sense to test it. Instead, one should clinically important?
How Useful Are p Values for Medical concern oneself with the question, By how One way to answer this question is with a
Decision Making? much are the two treatments different? confidence interval. The formula in equation 13
Although p values are still often found in the There are so many problems associated provides 95% confidence interval limits for
literature, there are several major problems as- with p values that most statisticians now rec- means (the value 1.96 could be changed to other
sociated with their use. First, as mentioned ear- ommend against their use, in favor of confi- values if intervals with coverage other than 95%
lier, they are often misinterpreted as the dence intervals or Bayesian methods. In fact, are of interest) [3]:
probability of the null hypothesis given the some prominent journals have virtually ban-
data, when in fact they are calculated assuming ished p values from publication [7], others
the null hypothesis to be true. Second, clini- strongly discourage their use [8], and many (13)
cians often use them to dichotomize results into others have published articles and editorials
important or unimportant depending on encouraging the use of Bayesian methodol-
whether p < 0.05 or p > 0.05, respectively. ogy [9, 10]. We cover these more informative
However, there is not much difference between techniques for drawing statistical inferences, where x is the sample mean and σ is the known
p values of 0.049 and 0.051, so the cutoff of starting with confidence intervals. SD from a sample of size n. As before, if σ is not
0.05 is arbitrary. Third, p values concentrate at- known, it is replaced by its estimate from the
tention away from the magnitude of treatment Frequentist Confidence Intervals data, s, and 1.96 is increased somewhat, as a per-
differences. For example, one could have a p Although the p value provides some infor- centile from the t distribution replaces the nor-
value that is very small but is associated with a mation concerning the rarity of events as ex- mal percentile.

1050 AJR:184, April 2005


Statistical Inference for Continuous Variables

Applying this formula to the single-mean time (if a 95% confidence interval is used). Despite their somewhat unnatural inter-
example we first discussed, where x = 3.0, n = The two confidence interval equations dis- pretation, confidence intervals are gener-
10, and σ = 1.5, we obtain a 95% confidence cussed earlier provide procedures that, when ally preferred to p values. This is because
interval of (2.1–3.9 cm). We cannot conclude used repeatedly across different problems, confidence intervals focus attention on the
very much from this interval because we have will capture the true value of the mean (or dif- range of values compatible with the data on
not ruled out mean tumor diameters as small ference in means) 95% of the time and fail to a scale of direct clinical interest. Given a
as 2.1 cm, which is clinically superior to the capture the true value 5% of the time. In this confidence interval, one can assess the clin-
3.5 cm from the old schedule; however, on the sense, we have confidence that the procedure ical meaningfulness of the result, as can be
other hand, diameters as large as 3.9 cm have works well in the long run, although in any seen in Figure 2.
also not been ruled out, which is even worse single application, of course, the interval ei- Depending on where the upper and lower
than the tumor diameter in the standard ther does or does not contain the true mean. confidence interval limits fall in relation to
group. Thus, further data would need to be Note that we are careful not to say that our the upper and lower limits of the region of
collected before any conclusions could be confidence interval has a 95% probability of clinical equivalence, different conclusions
drawn about this new schedule. containing the true parameter value. For ex- should be drawn. The region of clinical
Our two-group clinical trial example had ample, we did not say that the true difference equivalence, sometimes called the region of
larger sizes, so it will presumably provide a more in mean tumor diameter is in the interval clinical indifference, is the region inside of
accurate estimate. We can calculate a 95% confi- −0.49 to −0.94 cm with 95% probability. This which two treatments, say, would be consid-
dence interval for the difference in means for the is because the confidence limits and the true ered to be the same for all practical purposes.
two groups using the formula in equation 14, mean tumor diameters are both fixed num- The point 0, indicating no difference in re-
bers, and it makes no more sense to say that sults between the two treatments, is usually
the true mean is in this interval than it does to included in the region of clinical equivalence,
say that the number 2 is inside the interval (1, but values above and below 0 are usually also
6) with probability 95%. Of course, 2 is inside included. How wide this region is depends on
(14) this interval, just like the number 8 is outside each individual clinical situation. For exam-
of the interval (1, 6). However, the procedure ple, if one treatment schedule is much more
used to calculate confidence intervals pro- expensive than another, one may want at least
vides random upper and lower limits that de- a 50% reduction in tumor diameter to con-
pend on the data collected; in repeated uses of sider it the preferred treatment.
where the same comment regarding unknown this formula across a range of problems, we There are five different conclusions that
variances again applies. Plugging in the values expect the random limits to capture the true can be made after a confidence interval has
we obtained from our clinical trial example value 95% of the time and exclude the true been calculated, as illustrated by the five hy-
given earlier, we find a confidence interval of mean 5% of the time. Refer to Figure 1. If we pothetic intervals displayed in Figure 2. The
−0.46 to –0.94 cm. Thus, roughly speaking, it is look at the set of confidence intervals as a first conclusion (interval 1) is that the confi-
likely that the true tumor diameter difference whole, we see that about 95% of them include dence interval includes zero and that both up-
between our two schedules is between ap- the true parameter value. However, if we pick per and lower confidence interval limits, if
proximately 0.5 cm less under the new out a single trial, it either contains the true they were the true values, would not be clini-
schedule (−0.46 cm) and up to almost a 1-cm value (≈ 95% of the time) or excludes this cally interesting. Therefore, this variable has
reduction (−0.94 cm). Although our p value for value (≈ 5% of the time). been shown to have no important effect.
this same data set was small, which enabled us
to reject the null hypothesis, we can see that the
confidence interval provides more clinically
useful information about the magnitude of the
difference. We can also see that, in contrast to
what may be believed after seeing the p value,
we are still uncertain about the clinical utility of
the new schedule, because values near the lower
limit of the confidence interval would not be in-
teresting clinically—it would represent less
than a 30% change from the mean baseline tu-
mor size—while differences near 1 cm may be
clinically interesting. Therefore, our conclu-
sions from the confidence interval are more de-
tailed than those from the p value. This is true in
general, as we now discuss.

Interpreting Confidence Intervals


Confidence intervals are derived from pro- Fig. 1.—Drawing shows series of 95% confi-
cedures that are set up to “work” 95% of the dence intervals for unknown parameter.

AJR:184, April 2005 1051


Joseph and Reinhold

experiment, which ranged from 2.1 to 3.9 cm, For example, consider the following data,
is clearly of type 2 and the interval based on which represent the tumor diameters of the
the two-group clinical trial is of type 5. Once marker liver metastases for two different che-
again, note that these confidence intervals motherapy regimens in patients with colorec-
provide much more detailed conclusions than tal carcinoma: conventional treatment, 21, 12,
the information contained in a p value. 11, 28, 3, 10, 9, 5, 7, 10, 6; new treatment, 4,
The p values group together intervals 1 and 3, 4, 5, 20, 22, 5, 12, 15, 5, 1, 14, 13.
2 as “nonsignificant” and intervals 3, 4, and 5 Because we are making nonparametric in-
as “significant.” This can lead to misleading ferences, we no longer refer to tests of simi-
conclusions from a clinical viewpoint. For ex- larity of group means. Rather, the null and
ample, similar clinical conclusions should be alternative hypotheses here are defined as fol-
drawn from intervals 1 and 4, even though lows: For the null hypothesis (H0), there is no
one is “significant” and the other is not. It treatment effect—that is, conventional treat-
should now be clear why many journals dis- ment tends to give rise to tumor sizes similar
courage reporting results in terms of p values to those from the new treatment. For the alter-
and encourage confidence intervals. native hypothesis (HA), the new treatment
tends to give rise to different values for tumor
Summary of Frequentist Statistical Inference sizes compared with those from the conven-
The main tools for statistical inference tional treatment group.
Fig. 2.—Drawing shows how to interpret confidence
intervals. Depending on where confidence interval lies
from the frequentist point of view are p val- The first step to nonparametrically test
in relation to region of clinical equivalence, different ues and confidence intervals. The p values these hypotheses is to order and rank the data
conclusions can be drawn. have fallen out of favor among statisticians, from lowest to highest values, keeping track
and although they continue to appear in of which data points belong to each treatment
medical journal articles, their use is likely to group, as shown in Table 3.
The second conclusion (interval 2) is that greatly diminish in the coming years. Confi- Thus, in ranking the data, we simply sort
the confidence interval includes zero but that dence intervals provide more clinically use- the data from the smallest to the largest value
one or both of the upper or lower confidence ful information than p values, so confidence regardless of group membership and assign
interval limits, if they were the true values, intervals are to be preferred in practice. Con- a rank to each data point depending on where
would be interesting clinically. Therefore, fidence intervals still do not allow the formal its value lies in relation to other values in the
the results of this variable in this study are incorporation of preexisting knowledge into data set. Hence, the lowest value receives a
inconclusive, and further evidence needs to any final conclusions. For example, in some rank of 1, the second lowest a rank of 2, and
be collected. cases there may be compelling medical rea- so on. Because there are many “ties” in this
The third conclusion (interval 3) is that the sons why a new technique may be better than data set, we need to rank the data accounting
confidence interval does not include zero and a standard technique, so if faced with an in- for the ties, which we do by grouping all tied
that all values inside the upper and lower confi- conclusive confidence interval, a radiologist values together and distributing the sum of
dence interval limits, if they were the true val- may still wish to switch to the new tech- the available ranks evenly among the tied
ues, would be clinically interesting. Therefore, nique, at least until more data become avail- values. For example, the second and third
this study shows this variable to be important. able. On what basis could this decision be lowest values in this data set are both 3, and
The fourth conclusion (interval 4) is that justified? We return to this question in the there is a total of five ranks (2 + 3) to be di-
the confidence interval does not include zero Bayesian Inference section, which appears vided among them. Hence, each of these val-
but that all values inside the upper and lower later in this article. ues receives a rank of 2.5 (5 / 2). Similarly,
confidence interval limits, if they were the the sixth through ninth values are all tied at
true values, would not be clinically interest- Nonparametric Inference 5. There are 30 total ranks (6 + 7 + 8 + 9) to
ing. Therefore, this study shows this variable, Thus far, statistical inferences on popula- divide up among four tied values, so each re-
although having some small effect, is not tions have been made by assuming a mathe- ceives a value of 7.5 (30 / 4), and so on.
clinically important. matic model for the population (e.g., a normal The next step is to sum the ranks for the values
The fifth conclusion (interval 5) is that the distribution) and estimating parameters from belonging to the conventional treatment group,
confidence interval does not include zero but that distribution based on a sample. Once the which yields a total of 147.5 (2.5 + 7.5 + 10 +
that only some of the values inside the upper parameters have been estimated—for exam- 11 + 12 + 13.5 + 13.5 + 15 + 16.5 + 22 + 24).
and lower confidence interval limits, if they ple, the mean or variance for a normal distri- We now reason as follows: There is a total
were the true values, would be clinically in- bution—the distribution is fully specified. of 300 ranks (1 + 2 + 3 + …+ 23 + 24) that can
teresting. Therefore, this study shows this This is known as parametric inference. be distributed among the conventional and
variable has at least a small effect and may be Sometimes we may be unwilling to specify new treatment groups. If the sample sizes
clinically important. Further study is required the general shape of the distribution in ad- were equal, therefore, and if the null hypothe-
to better estimate the magnitude of this effect. vance and prefer to base the inference only on sis were exactly true, we would expect that
Revisiting the two confidence intervals the data, without a parametric model. In this these ranks should divide equally among the
discussed earlier in light of Figure 2, we see case, we have distribution-free or nonpara- two groups, so each would have a sum of
that the interval based on our single-sample metric methods. ranks of 150. Now, the sample sizes are not

1052 AJR:184, April 2005


Statistical Inference for Continuous Variables

TABLE 3 First Step to Nonparametrically Test Null and Alternative Hypotheses: Order and Rank the Data
Treatment Group N N C N N N N N C C C C C C C C N N N N N C N C
Data 1 3 3 4 4 5 5 5 5 6 7 9 10 10 11 12 12 13 14 15 20 21 22 28
Ranks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Ranks with ties 1 2.5 2.5 4.5 4.5 7.5 7.5 7.5 7.5 10 11 12 13.5 13.5 15 16.5 16.5 18 19 20 21 22 23 24
Note.—N = new treatment, C = conventional treatment.

quite equal, so here we expect 300 × (11 / 24) The Wilcoxon’s rank sum test is appropriate or two samples. These are derived by solving
= 137.5 of the ranks to go to the conventional for unpaired designs. A similar test exists for for the sample size n in the formulas for the
group, and 300 × (13 / 24) = 162.5 of the ranks paired designs, called the Wilcoxon’s signed confidence intervals discussed.
to go to the new treatment group. Note that rank test. Nonparametric confidence intervals are
137.5 + 162.5 = 300, which is the total sum of also available, as are tests for two or more groups, Single Sample
ranks available. We have in fact observed a such as the Kruskal-Wallis test. See Sprent [11] Let µ be the mean that is to be estimated,
sum of ranks of 147.5 in the conventional for further details about these methods. and assume that we wish to estimate µ to an
group, which is higher than expected. Is it high accuracy of a total confidence interval width
enough that we can reject the null hypothesis? Sample Size Calculations of w (so that the confidence interval will be x ±
To answer this question, we must refer to As previously discussed, there has been a d, where 2 × d = w). Let σ be the SD in the
computer programs that will calculate the strong trend away from hypothesis testing population.
probability of obtaining a sum of ranks of and p values toward the use of confidence in- Then the required sample size, n, is given
147.5 or greater given that the null hypothesis tervals in the reporting of results from bio- by equation 15,
of no treatment difference is true (remember medical research. Because the design phase
the definition of the p value discussed earlier). of a study should be in sync with the analysis
Most statistical computer packages will per- that will eventually be performed, sample size (15)
form this calculation, which in this case gives calculations should be performed on the basis
p = 0.58. Hence, the null hypothesis cannot be of ensuring adequate numbers for accurate es-
rejected, because our result and those more ex- timation of important quantities that will be
treme are not rare under the null hypothesis. estimated in the study, rather than by power where, as usual, z is replaced by the appropri-
This nonparametric test is called the Wil- calculations. This distinction is important be- ate normal distribution quantile (z = 1.96,
coxon’s rank sum test. An exactly equivalent cause it has been shown [12] that sample sizes 1.64, or 2.58 for 95%, 90%, or 99% intervals,
test can be based on counts rather than ranks, calculated from a power viewpoint are often respectively).
and it is called the Mann-Whitney test. The insufficient when viewed from a confidence For example, suppose that we would like to
Mann-Whitney test always provides the same interval viewpoint. In other words, although estimate average tumor size to an accuracy of
p value as the Wilcoxon’s rank sum test, so ei- high power ensures rejection of the null hy- d = 2 mm with a 95% confidence interval and
ther can be used. The analogous parametric pothesis with high probability, it does not en- that we expect the patient-to-patient variabil-
test, the unpaired t test for the same data, also sure than the confidence interval will be ity will be σ = 10 mm. Then, from the previ-
gives a p value of 0.58, so the same conclu- narrow enough to allow good clinical deci- ous formula, we need to perform the
sion is reached. sion making. Therefore, in this section, we fo- calculation in equation 16,
Because the two tests do not always pro- cus on sample size methods based on
vide the same conclusions, which of these confidence interval width. For similar meth-
tests is to be preferred? The answer is situa- ods based on power, see the book by Leme- (16)
tion-specific. Remember that the t test as- show et al. [13].
sumes either that the data are from a normal The question of how accurate is “accurate
distribution—here, it would imply that the tu- enough” can be addressed by carefully con- rounding up to the next highest integer. The
mor diameters are approximately normally sidering the results you would expect to get most difficult problem in using this equation
distributed—or that the sample size is large. (a bit of a catch-22 situation, because if you is to decide on a value for the SD σ, because
A histogram would show that the data are knew the results you will get, there would be it is usually unknown. A conservative ap-
skewed toward the right, so that normality is no need to perform the experiment) and proach would be to use the maximum value of
unlikely, and the sample sizes are 11 and 13, making sure your interval will be small σ that seems reasonably likely to occur in the
hardly large. Hence, in this example the non- enough to land in intervals numbered 1, 3, or experiment.
parametric test is preferred because the as- 4 of Figure 2. The determination of an appro-
sumptions behind the t test do not seem to priate width is a nontrivial exercise, but a Two Samples
hold. In general, if the assumptions required reasonable target confidence interval width Let µ1 and µ2 be the means of two popula-
by a parametric test may not hold, a nonpara- can usually be found. tions, and suppose that we would like an ac-
metric test is to be preferred, whereas if the For estimating the sample size require- curate estimate of µ1 – µ2. Again assume a
distributional assumptions do likely hold, a ments in experiments involving population total confidence interval width of w (so that
parametric test provides slightly increased means, two different formulas are available, again 2 × d = w). Let σ1 and σ2 be the SD in each
power compared with a nonparametric test. depending on whether there is a single sample population, respectively.

AJR:184, April 2005 1053


Joseph and Reinhold

Then the sample size is given in equation 17, ment of the available past information, so may The most contentious element in Bayesian
vary from investigator to investigator. analysis is the need to specify a prior distribu-
The second basic element of Bayesian tion. Because there is no unique way to derive
(17) analysis is the likelihood function: f (x | θ). prior distributions, they are necessarily sub-
The likelihood function summarizes the in- jective, in the sense that one radiologist may
formation contained in the data, x. For in- derive a different prior distribution than an-
where now n represents the required sample stance, it may be created from a normal other and, hence, arrive at a different poste-
size for each group. As usual, z is chosen as distribution for a mean. It is important to re- rior distribution. Several points can be made
we did earlier and is usually 1.96, correspond- alize that Bayesians and frequentists can use regarding this controversy.
ing to a 95% confidence interval. the same likelihood function because both First, Bayesians can use diffuse, flat, or ref-
need to calculate the probability of data given erence prior distributions that, for all practical
Bayesian Inference various values for the parameter θ. The way purposes, consider all values in the feasible
Consider again the single-sample tumor di- the likelihood function is used, however, dif- range as equally likely. Hence, if little prior
ameter problem introduced in the Statistical In- fers between the two paradigms. information exists or if a Bayesian wishes to
ferences for Means section. Recall that in this The third basic element is the posterior dis- see what information the data themselves pro-
example patients undergoing the standard radi- tribution: f (θ | x). The posterior distribution vide, this choice of prior distribution can be
ation therapy schedule are assumed to have a summarizes the information in the data, x, to- used. In fact, in many situations, a Bayesian
mean of 3.5 cm, whereas the data collected so gether with the information in the prior distri- analysis using reference priors will result in
far for the new accelerated schedule indicate a bution. Thus, it summarizes what is known similar interval estimates as those provided
mean of 3.0 cm, but are based on only 10 sub- about the parameter of interest θ after the data by frequentist confidence intervals, but with a
jects. The frequentist confidence interval was are collected. more natural interpretation: Unlike confi-
wide, ranging from approximately 2.1 to 3.9 Bayes’ theorem, posthumously published dence intervals, Bayesian intervals (often
cm, so it has not been particularly helpful in by Thomas Bayes [14] in 1763, relates the called credible intervals) can be directly inter-
making a decision about which technique to use three quantities: posterior distribution = [like- preted as containing the true parameter value
for the next patient. At this point, with the data lihood of the data × prior distribution] / a nor- with the indicated probability. Thus, no refer-
being relatively uninformative, the treating phy- malizing constant, or using our notation ences to long runs of other trials are necessary
sician may decide to be conservative and remain above in equation 18, to properly interpret a credible interval.
with the standard schedule until more informa- Second, although many frequentists have
tion becomes available about the new schedule been quick to criticize Bayesian analysis be-
or may go with their “gut feeling” as to the like- (18) cause of the difficulty in deriving prior distri-
lihood that the new schedule is truly better or butions, frequentist analysis formally ignores
not. If there have been data from animal exper- this information, which can hardly be consid-
iments or strong theoretic reasons why the new or, omitting the normalizing constant in equa- ered as a better solution.
schedule may be better, there may be temptation tion 19, Third, if different clinicians have a range of
to try the new one. Can anything be done to aid prior opinions and hence a range of prior dis-
in this decision-making process? tributions, there will also be a range of poste-
(19)
Bayesian analysis has several advantages rior distributions. Presenting several Bayesian
over the standard or frequentist statistical analyses matching this range of prior opinions
analyses discussed in this article so far, in- where ∝ indicates “is proportional to.” helps to raise the level of debate after the pub-
cluding the ability to formally incorporate rel- Thus, we update the prior distribution to a lication of results in medical journals, because
evant information not directly contained in posterior distribution after seeing the data via it accurately reflects the range of clinical opin-
the current data set into any statistical analy- Bayes’ theorem. The current posterior distri- ion that exists in the community. Furthermore,
sis. We will see how this can help with the bution can be used as a prior distribution for it can be shown that as more data accumulate,
problem discussed earlier, but first we will the next study, so Bayesian inference pro- the posterior distributions from different priors
cover some basics of Bayesian analysis. vides a natural way to represent the learning tend to converge toward a single distribution,
Let us generically denote our parameter of that occurs as science progresses. accurately mirroring the process of eventual
interest by θ. Hence, θ can be a binomial pa- Radiologists are already familiar with the consensus among clinicians as data accumu-
rameter, the mean from a normal distribution, Bayesian way of thinking, using it every day in late. When viewed in this light, prior distribu-
an odds ratio, a set of regression coefficients, the context of interpreting diagnostic tests. The tions can be seen as a great advantage. See
and so on. Note in particular that θ can be two prior probability used in Bayes’ theorem is anal- Spiegelhalter et al. [15] or a more introductory
or more dimensional. The parameter of inter- ogous to the background rate of a condition in level article [9] for more information on using
est is sometimes usefully thought of as the the population, which is updated to a positive or a range of prior distributions when carrying out
“true state of nature.” negative predictive value (analogous to a poste- a Bayesian analysis.
The three basic elements of any Bayesian rior distribution) after seeing the results of a di- Having discussed the basic elements, let us
analysis are, first, the prior probability distribu- agnostic test (analogous to seeing the data). It is see how Bayesian analysis works in practice by
tion, f (θ). This prior distribution summarizes thus just a short step from using predictive val- again considering our example of tumor diame-
what is known about θ before the experiment is ues in a clinical setting to using Bayes’ theorem ters after radiation for brain cancer. We will dis-
performed. It is based on a “subjective” assess- in a research setting. cuss the three elements that lead to the posterior

1054 AJR:184, April 2005


Statistical Inference for Continuous Variables

distribution calculated from Bayes’ theorem, Fig. 3.—Graph shows two prior
and corresponding posterior den-
which are listed in the previous section. sities for tumor diameter example.
Recall that in our data set we had x = 3.0,
σ = 1.5, and n = 10, so that our likelihood
function is a normal distribution with mean
3.5 and SE of 0.474, the same as was used in
the frequentist inferences discussed previ-
ously. In general, the choice of prior distri-
bution is based on any information that is
available at the time of the experiment. We
will consider two different prior distribu-
tions. The first (prior distribution 1 in Fig. 3)
will be a normal distribution with a mean of
3.5 cm and a very large variance, say,
10,000. This is a noninformative prior, be-
cause all values in the likely range have an
approximately equal chance of being the true
value, the curve being quite flat over a wide
range. Note that an equal 50% chance is
given to both the null and alternative hypoth-
eses that the new schedule is superior to that
of the old, because the distribution is cen- ond more informative prior, the correspond- interval from our second posterior distribution
tered at 3.5 cm. The second prior distribution ing posterior distribution is N (3.0, 0.118). is given by (2.3–3.7), which is somewhat nar-
(prior distribution 2 in Fig. 3) will be cen- The two prior and two posterior densities rower than the first interval.
tered at 3.0, with an SD of 0.5 (variance of are displayed in Figure 3. Note that the sec- We can also perform Bayesian hypothesis
0.25). This would represent the opinion of a ond posterior distribution is narrower, be- tests, again just using the posterior distribu-
radiologist who is enthusiastic about the new cause a stronger prior distribution was used. tions. For example, suppose we wish to test
schedule, with a prior opinion that the new These posterior distributions can be used to H0 (µ ≥ 3.5) versus HA: (µ < 3.5). We can cal-
mean tumor diameter will be between about derive 95% credible intervals and to test hy- culate Pr{H0 | data} = Pr{µ ≥ 3.5 | data},
2.0 and 4.0 cm, with 95% probability (as cal- pothesis from a Bayesian viewpoint. These which is equal to 14.5% for posterior 1 and
culated from the range of the normal [µ = calculations can be done using normal ta- 7.3% for posterior 2. Thus, we are approxi-
3.0, τ2 = 0.25] distribution, where τ2 is our bles. Because these posterior distributions mately 85.5% or 92.7% sure that the tumor di-
prior variance). Do not be confused by the directly represent the probability distribu- ameter under the accelerated schedule is
two distinct SDs that are used here: σ repre- tion for our unknown parameter, interpreta- better than the standard schedule, depending
sents the variability of the tumor diameters tion of these quantities is straightforward. on which prior we use. Based on this, each cli-
among the patients, whereas τ represents For example, a 95% credible interval from nician can make a decision about which
how certain we are of our prior mean value. posterior distribution 1 is given by (2.1–3.9). schedule to apply to the next patient. Note
We now wish to combine this prior density In comparing this interval to the prior 95% again the very direct statements available for
with the information in the data as repre- confidence interval calculated in the Statisti- Bayesian hypothesis tests, compared with the
sented by the likelihood function to derive the cal Inferences for Means section, we see that nonintuitive interpretation of a p value. This
posterior distribution, using Bayes’ theorem. they are numerically identical (at least to one clarity, however, comes at the expense of
After some algebra, the posterior distribution decimal place). However, the interpretations having to specify a prior distribution.
can be shown to be given by a normal distri- of these two intervals are different because Carrying out Bayesian analyses is made
bution shown in equation 20, the Bayesian credible interval is directly in- easier via the use of freely available custom-
terpreted as the probability that the true mean ized software. The posterior distributions
tumor diameter lies in the given interval, shown earlier were performed using the First
(20) given the data and the prior information used. Bayes package [16], and more complex Bayes-
This is in contrast to the less direct interpreta- ian analyses can be done via specialized
tion of a confidence interval, discussed ear- Monte Carlo numeric routines implemented
where A = [(σ2 / n) / (τ 2 + σ2 / n)] and B = [(τ2) lier. Many people misinterpret confidence in WinBUGS software [17] made freely
/ (τ2 + σ2 / n)]. Note that the posterior mean intervals as if they were Bayesian intervals. available by the Medical Research Council of
value depends on both the prior mean, µ, and This error is often not too serious, because if Great Britain [18]. An excellent introductory
the observed mean in the data set, x. Plugging little prior information is available, the two text on Bayesian analysis is one written by
these values into the previous equation and intervals are numerically similar. Therefore, Gelman et al. [19].
using the first (very flat) prior distribution, we even though it is technically incorrect, one does
find that the posterior distribution for our not go too far wrong thinking of confidence in- Conclusions
mean tumor diameter is N (A × µ + B × x = tervals as approximate Bayesian intervals, when This module has introduced some of the ma-
3.0, [(τ2σ2) / (nτ2 + σ2)] = 0.225). For the sec- there is little prior information. A 95% credible jor ideas behind statistical inference, with em-

AJR:184, April 2005 1055


Joseph and Reinhold

phasis on the simple methods for continuous docrine carcinoma of the lung in 38 patients. AJR 11. Sprent P. Applied nonparametric statistical meth-
variables. Rather than a simple catalogue list- 2004;182:87–91 ods. New York, NY: Chapman and Hall, 1989
2. Karlik SJ. Visualizing radiologic data. AJR 12. Bristol D. Sample sizes for constructing confi-
ing of which tests to use for which types of
2003;180:607–619 dence intervals and testing hypotheses. Stat Med
data, we have tried to explain the logic behind 3. Joseph L, Reinhold C. Fundamentals of clinical 1989;8:803–811
the common statistical procedures seen in the research for radiologists: introduction to probabil- 13. Lemeshow S, Hosmer D, Klar J, Lwanga S. Ade-
medical literature, the correct way to interpret ity theory and sampling distributions. AJR 2003; quacy of sample size in health studies. Chichester,
the results, and what their advantages and 180:917–923 England: Wiley, 1990
4. Moore D, McCabe G. Introduction to the practice 14. Bayes T. An essay towards solving a problem in
drawbacks may be. We have also introduced
of statistics, 3rd ed. New York, NY: Freeman and the doctrine of chances: 1763. Philos Trans R Soc
Bayesian inference as a strong alternative to Company, 1988 1763;53:370–418
standard frequentist statistical methods, both 5. Armitage P, Berry G. Statistical methods in med- 15. Spiegelhalter D, Freedman L, Parmar M. Baye-
for its ability to incorporate the available prior ical research, 3rd ed. Oxford, England: Blackwell sian approaches to randomized trials. J R Stat Soc
information into the analysis and for its ability Scientific Publications, 1994 [Ser A] 1994;157:387–416
to address questions of direct clinical interest. 6. Rosner B. Fundamentals of biostatistics. Belmont, 16. O’Hagan A. First Bayes software. Available at:
MA: Duxbury, 1995 www.shef.ac.uk/st1ao/1b.html. Accessed De-
The next few modules in this series will 7. Rothman K. Writing for epidemiology. Epidemi- cember 25, 2003
cover techniques suitable for other types of ology 1998;9:333–337 17. Spiegelhalter D, Thomas A, Best N. WinBUGS
data, including proportions and regression 8. Evans S, Mills P, Dawson J. The end of the p- version 1.4 user manual. Cambridge, England:
methods. value. Br Heart J 1988;60:177–180 MRC Biostatistics Unit, 2003
9. Brophy J, Joseph L. Placing trials in context using 18. WinBUGS, version 1.4. Available at:
Bayesian analysis: GUSTO revisited by Reverend www.mrc-bsu.cam.ac.uk/bugs/. Accessed De-
Bayes. JAMA 1995;273:871–875 cember 25, 2003
References 10. Lilford R, Braunholz D. The statistical basis of 19. Gelman A, Carlin J, Stern H, Rubin D. Bayesian
1. Oshiro Y, Kusumoto M, Matsuno Y, et al. CT public policy: a paradigm shift is overdue. BMJ data analysis, 2nd ed. London, England: Chap-
findings of surgically resected large cell neuroen- 1996;313:603–607 man and Hall, 2003

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 8. Exploring and Summarizing Radiologic Data, January 2003
2. The Research Framework, April 2001 9. Visualizing Radiologic Data, March 2003
3. Protocol, June 2001 10. Introduction to Probability Theory and Sampling
4. Data Collection, October 2001 Distributions, April 2003
5. Population and Sample, November 2001 11. Observational Studies in Radiology, November 2004
6. Statistically Engineering the Study for Success, July 2002 12. Randomized Controlled Trials, December 2004
7. Screening for Preclinical Disease: Test and Disease 13. Clinical Evaluation of Diagnostic Tests, January 2005
Characteristics, October 2002 14. ROC Analysis, February 2005

1056 AJR:184, April 2005


Research
Statistical Inference for Proportions
Joseph and Reinhold
Statistical Inference for Proportions

Fundamentals of Clinical Research


for Radiologists
Lawrence Joseph1,2
Caroline Reinhold3,4
Statistical Inference for Proportions
Lawrence Joseph3,4 and Caroline Reinhold

his module will discuss the most Suppose further that one wishes to investigate

T commonly used statistical proce-


dures when the parameters of interest
arrive in the form of proportions. Un-
whether this new system provides improved
sensitivity compared with standard detection
via non-computer-aided methods of analyz-
derstanding these methods is especially important ing chest radiographs. In other words, sup-
to radiologists because so much radiologic re- pose that chest radiographs are taken from a
search and clinical work involves dichotomous series of subjects who all truly have lung nod-
(e.g., yes or no, present or absent) outcomes sum- ules, and we know that using standard (non-
marized as proportions. For example, a given dis- computer-aided) methods 90% of them will
ease or condition may be present or absent in any be found to have lung nodules and 10% of
given subject, and any time a diagnostic tool is these cases will be missed. Is there evidence
Received November 5, 2004; accepted after revision used, test characteristics such as sensitivity, spec- that the new computer-aided automated sys-
November 10, 2004. ificity, and positive and negative predictive values tem provides increased sensitivity compared
Series editors: Nancy Obuchowski, C. Craig Blackmore, are all summarized as proportions. with the standard method of detection?
Steven Karlik, and Caroline Reinhold. We will continue to use the three basic meth- To look for evidence of improved sensitiv-
This is the 16th in the series designed by the American ods for statistical inferences, including p values ity in the new automated system, we might
College of Radiology (ACR), the Canadian Association of and confidence intervals (CIs) from a frequentist wish to test the null hypothesis (H0) that the
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is viewpoint, and posterior distributions leading to automated system is in fact not better than
designed to progressively educate radiologists in the credible intervals from a Bayesian viewpoint. We standard detection, versus an alternative hy-
methodologies of rigorous clinical research, from the most will only briefly review the basic principles be- pothesis (HA) that it is better. Formally, we
basic principles to a level of considerable sophistication.
The articles are intended to complement interactive hind these generic inferential principles, so read- can state these hypotheses as:
software that permits the user to work with what he or she ers may wish to ensure they have a good
has learned, which is available on the ACR Web site
(www.acr.org).
understanding of the previous module [1] in this H0: p ≤ 0.9
series before tackling this one. It may also be use-
Project coordinator: G. Scott Gazelle, Chair, ACR
ful to recall the basic properties of the binomial HA: p > 0.9
Commission on Research and Technology Assessment.
distribution [2] because it is the central distribu-
Staff coordinator: Jonathan H. Sunshine, Senior Director
for Research, ACR. tion used for inferences involving proportions. where p represents the unknown true proba-
We begin with inferences for single pro- bility of success of the new automated system
1Division of Clinical Epidemiology, Montreal General portions, which are covered in the next section. in detecting lung nodules.
Hospital, Department of Medicine, 1650 Cedar Ave., Then we discuss inferences for two or more pro-
Montreal, QC H3G 1A4, Canada. Suppose that we observe the results from 10
portions from independent groups, inferences for subjects with lung nodules, and all 10 test posi-
2Department of Epidemiology and Biostatistics, McGill
University, 1020 Pine Ave. W, Montreal, QC H3A 1A2,
dependent proportions, sample size determination tively with the new automated system. Recalling
Canada. Address correspondence to L. Joseph for studies involving one or two proportions, and the correct definition of a p value [1] (it is the
(Lawrence.Joseph@mcgill.ca). Bayesian methods for proportions. Finally, we will probability of obtaining a result as extreme as or
3Department of Diagnostic Radiology, Montreal General summarize what we have learned in this module. more extreme than the result observed, given
Hospital, McGill University Health Centre, 1650 Cedar Ave.,
Montreal, QC H3G 1A4, Canada. that the null hypothesis is exactly correct), how
4Synarc Inc., Inferences for Single Proportions would we calculate the p value in this case? For
575 Market St., San Francisco, CA 94105.
Standard Frequentist Hypothesis Testing our example of the new automated technique,
AJR 2005;184:1057–1064
Suppose a new computer-aided automated the definition implies that we need to calculate
0361–803X/05/1844–1057 system for the detection of lung nodules on the probability of obtaining 10 (or more, but in
© American Roentgen Ray Society chest radiographs has been developed [3]. this case more than 10 is impossible) successful

AJR:184, April 2005 1057


Joseph and Reinhold

lung nodule detections in the 10 patients to 98% or more extreme will be observed in 100 tri- which here gives (0.930–0.994).
whom the technique was applied, given that the als if the true rate is in fact only 90%. Therefore, Technical note: This formula uses the normal
true rate of success is exactly 90%. Recall [2] in this case, sufficient evidence exists to reject the approximation to the binomial distribution [2].
that if x follows a binomial distribution with null hypothesis in favor of the alternative. Exact formulae are also available [4], which are
probability of success p, then Pr (x successes in Although p values are still often found in the lit- especially useful for small sample sizes or for es-
n trials) = [n!/(x!(n–x)!)]px(1 – p)n–x, where x! is erature, several major problems are associated timates p̂ near 0 or 1. For example, using an ex-
read as “x factorial” and is equal to x(x – 1) (x – with their use, as we have previously discussed [1]. act approach to this CI yields (0.930–0.998),
2)… (2) (1). For example, 5! = (5) (4) (3) (2) (1) = Briefly, the null hypothesis is virtually never ex- which is very close to but not identical to that
120, and by convention 0! = 1. Using this bi- actly true (is it possible that the true underlying sen- given by the indicated normal approximated in-
nomial probability function, we can calculate the sitivity is exactly 90%, as opposed to, say, terval. In addition, when p̂ equals 0 or 1 exactly,
probability of 10 successes in a row with p = prob- 89.9999% or 90.0001%?), so we know it should the normal approximation breaks down, because
ability of success = 0.9 as shown in equation 1: be rejected regardless of the data we observe. Fur- the variance is estimated to be 0. Here one has no
thermore, the p value says nothing about the effect choice but to use a different procedure. The exact
size, which is crucial to clinical decision making, method yields a wider 95% CI of (0.741–1.000)
(1)
with large sizes usually implying a more clinically in the case of our smaller data set, where 10 pos-
important effect than small sizes. A much more in- itive values were found in 10 subjects. There is
So there is about a 34.9% chance of obtaining teresting question is to estimate the rate or propor- also an easy-to-use and reasonably accurate rule
results as extreme as or more extreme than the tion of interest, together with a measure of the of thumb when calculating a binomial CI and one
10 of 10 results observed, if the true rate for accuracy of the estimate. CIs are one answer to this observes 0 events. The rule is this: If you observe
the new technique is exactly 90%. Therefore, question, and we discuss them next. The Bayesian n patients, and none of these patients have an
the observed result is not unusual, and hence solution—credible intervals—is discussed later. event, then a 95% CI for the probability of the
compatible with the null hypothesis, so we event goes from 0 to 3 / n. For example, if you ob-
Confidence Intervals for Single Proportions
cannot reject H0. serve 0 events in 10 binomial trials, then an ap-
Continuing the previous example, we have ob-
This calculation could be done exactly, be- proximate 95% CI would go from 0 to 3 / 10 =
served rates of 100% (10/10 in our smaller sample)
cause the sample size was quite small. For larger 0.3. By symmetry, the rule would say that if you
or 98% (98/100 in our larger sample), but we know
sample sizes, the normal approximation to the bi- observe only events in n trials, then the 95% CI
that these are estimates only, not guaranteed (in
nomial distribution [2] could be used. Also, this would go from (1 – 3 / n) to 1. For example, if you
fact, unlikely) to exactly equal the true rates. On
test was one-sided, but two-sided hypotheses are observe 10 events in 10 trials, then the 95% CI
the basis of these data, however, what can we say
also of interest. For example, suppose we wish to would go from 0.7 to 1, which is reasonably close
about what we would expect the true rate to be?
test a similar null hypothesis as above (H0: p = to the exact solution of (0.741–1.000) given here.
One way to answer this question is with a
0.9) but against a two-sided alternative (HA: p ≠ How does one interpret this CI? Recall from
CI. CIs usually have the form
0.9). Suppose we observed 98 successes in 100 the previous module [1] that the 95% confi-
trials. Because our test is two-sided, according to estimate ± k × standard error dence value (often called the confidence coeffi-
the definition of a p value we need to calculate the where the estimate and SE are calculated cient) is a long-run probability over repeated
probability of obtaining data as extreme as or from the data, and where k is a constant de- uses of the CI procedure. In practice, there are
more extreme than the observed 98 of 100. Now, pendent on the width of the CI desired. The five different interpretations associated with
98 is 8 higher than the 90 expected under the null value of k is usually near 2 (e.g., k is 1.96 for CIs, depending on where the upper and lower CI
hypothesis, so that to be as extreme as or more ex- a 95% CI). limits fall with respect to clinical cut points of
treme than the 98 observed, we need to be 8 or If one observes x = 98 positive tests in n = 100 interest (see Fig. 2 of Joseph and Reinhold [1]).
more above or below the expected 90. That is, we subjects known to have lung nodules, a point es- The formula displayed in equation 5 of this arti-
need to calculate the probability of 98, 99, or 100 timate of the success rate is p̂ = x / n = 0.98 or cle provides a procedure that, when used repeat-
successes on one side, and 82, 81, 80, …, 2, 1, 0 98%. We use the notation p̂ rather than p to in- edly across different problems, will capture the
on the other side. This lengthy calculation, in- dicate that this is an estimated rate, not necessar- true value of p 95% of the time and fail to cap-
volving the sum of 85 binomial calculations, can ily equal to the true rate, which we denote by p. ture the true value 5% of the time. In this sense,
be well approximated by using the normal ap- Following this generic formula, a CI for a bino- we have confidence that the procedure works
proximation to the binomial distribution [2]. Let mial probability of success parameter is given well in the long run, although in any single ap-
our estimate of the unknown proportion be p̂ = by the formula in equation 5, plication, of course, the interval either does or
98 / 100 = 0.98. We can calculate equations 2–4: does not contain the true proportion p.
For our smaller data set, with 10 subjects found
(5)
to be positive in 10 trials, the 95% CI ranges from
(2)
where z is derived from normal tables, and is 74.1% to 100%, providing a large and inconclu-
given by z = 1.96 for the usual 95% CI (z = 1.64 sive interval, because it may well be better or
(3)
for a 90% CI and z = 2.56 for a 99% CI). There- worse than the standard diagnosis, which is as-
fore, the 95% CI in our example is calculated sumed to be successful 90% of the time. In our
(4)
as shown in equation 6, larger data set, the 95% CI ranged from 93.0% to
Looking up 2.67 on normal tables, we find 0.004, 99.4%, so we can be quite certain that it is better
and doubling this value gives us our two-sided p (6) than standard diagnoses. However, it can be as lit-
value, which is 0.008. It is unlikely that rates of tle as 3% better (90% compared with the lower CI

1058 AJR:184, April 2005


Statistical Inference for Proportions

TABLE 1 Data from a Two-Group Study a chi-square test, we now calculate as shown
in equations 8–10:
Diagostic Method Test Positive Test Negative Total
Automated system 285 15 300
(8)
Standard diagnosis 265 45 310
Total 550 60 610

(9)

TABLE 2 Expected Data for the Example in Table 1 Under the Null Hypothesis
Diagnostic Method Test Positive Test Negative Total
Automated system 270.49 29.51 300
(10)
Standard diagnosis 279.51 30.49 310
Total 550 60 610
Comparing the χ2 = 15.57 value on chi-
square tables with 1 degree of freedom (df)
limit of 93%). Whether this is enough evidence to which extends equation 5 to the case of two pro- (see Armitage and Berry [4] or almost any ba-
switch to the new automated system or not de- portions. In this formula, p̂ 1 and p̂ 2 are the ob- sic textbook on statistics to find such tables),
pends on clinical judgment. This in turn depends served proportions in the two groups out of we find that p ≈ 0.0001 so that we have strong
on many factors, including the cost and availability sample sizes n1 and n2, respectively, and z is the evidence to reject the null hypothesis. This
of the new automated system and the average clin- relevant percentile from normal tables, chosen coincides with our conclusion from the CI,
ical benefits that will accrue to those diagnosed according to the desired level of the CI. For ex- but note that the CI is more informative than
earlier by the more sensitive diagnostic method. ample, for a 95% CI z = 1.96, for a 90% interval simply looking at the p value from the chi-
z = 1.64, and so on. Using this formula for the di- square test, because a range for the difference
Inferences for Two or More Independent agnosis data given, one finds that a 95% CI for in sensitivities is provided by the CI. Thus,
Proportions the difference in sensitivity is (0.049–0.141). the clinical importance of any differences can
Let us continue with our example comparing the This interval suggests that the automated system be more easily evaluated.
diagnostic properties of a new automated system is indeed better, likely by at least as much as The chi-square test can be extended to in-
for the detection of lung nodules on chest radio- 0.049. Unless cost is a prohibitive factor, from clude tables larger than the so-called 2 × 2 table
graphs compared with standard detection via non- these data it looks like the automated system is of this example. For instance, a 3 × 2 table
computer-aided methods. Earlier we assumed that worthwhile (at least in these hypothetical data). could arise if, rather than classifying patients
the rate in the standard diagnosis group was exactly Although CIs are preferred for reasons we as positive or negative, we included a third out-
known before the study, but this is somewhat unre- have briefly discussed here and which were come category, such as “chest radiograph is in-
alistic. We will now relax this assumption, and con- more extensively discussed in a previous conclusive.” A 3 × 2 table could also arise if we
sider the data from the two-group study shown in module in this series [1], we will also discuss considered comparing a third method of diag-
Table 1 (presented in the form of a 2 × 2 table of hypothesis testing for proportions, because one nosis rather than the two considered here. In
data because we have two possible outcomes in often sees such tests in the literature. Suppose these cases we would sum over 3 × 2 = 6 terms
each of the two groups being compared). we wish to test the null hypothesis that p1 = rather than the four terms of a 2 × 2 table. Al-
Again, we assume that all 610 subjects studied p2—that is, the null hypothesis states that the though for 2 × 2 tables the df is always equal to
are truly positive, so that one would like to draw success rates are identical in the two units. Be- 1, in general the df for chi-square tests is given
inferences about whether the automated system cause we hypothesize p1 = p2, we expect to ob- by (r – 1) × (c – 1), where the number of rows
has increased sensitivity compared with the usual serve, on average, the data in Table 2. in the table is r and the number of columns is
diagnosis group. Although one observes p̂ 1 = Why do we expect to observe this table of c. Thus, in the case of a 3 × 2 table, we would
[285 / 300] = 0.95 sensitivity for the automated data if the null hypothesis is true? We have have (3 – 1) × (2 –1) = 2 df. In general, cases
system compared with p̂ 2 = [265 / 310] = 0.855 observed a total of 550 “successes” divided with arbitrary numbers of rows and columns
sensitivity using standard diagnosis, for a 9.5% among the two groups. If p1 = p2 and if the can be constructed and analyzed using the chi-
observed difference, a CI will provide us with a sample sizes were equal in the two groups, we square test.
range of values compatible with the data that will would have expected (550 / 2) = 275 suc- In order for the chi-square test to be valid,
help draw a better conclusion than simply look- cesses in each group. However, because the one needs to ensure that the expected value
ing at the observed point estimates. To calcu- sample sizes are not equal, we expect 550 × for each cell in the table is at least 5. This was
late a CI for this difference in proportions, we (300 / 610) = 270.49 to go to the automated satisfied in the previous example, in which
can use the formula in equation 7, system group, and 550 × (310 / 610) = 279.51 our smallest expected table value was 29.51,
to go the standard diagnosis group. Similarly, much larger than 5. Fisher’s exact test [4] is
expected values for the 60 negatively testing often used if this criterion is not satisfied for
patients can be calculated. Observed discrep- a particular table. The Fisher’s exact test is
(7)
ancies from these expected values are evi- valid for tables of any size, in particular for
dence against the null hypothesis. To perform small sample sizes.

AJR:184, April 2005 1059


Joseph and Reinhold

TABLE 3 Generic Setup of a 2 × 2 Table between diminished renal size and impaired
renal function. The McNemar test focuses on
First Test the discordant pairs, represented in Table 4 by
Second Test Total
Positive Negative b and c. We can formulate the statistic shown
in equation 11,
Positive a b a+b
Negative c d c+d (11)
Total a+c b+d N=a+b+c+d
which approximately follows a chi-square
distribution with 1 df. Thus, a p value can be
Inferences for Dependent Proportions function. Of course, patients with impaired re- calculated for this test.
A two-group clinical trial, where n1 sub- nal function may tend to be different from sub- For example, suppose we observe the fol-
jects receive treatment A and n2 different sub- jects without (control subjects) in many ways, lowing data: a = 200, b = 100, c = 75, and d =
jects receive treatment B, usually results in so to minimize possible confounding one may 300. According to the McNemar test, we
independent samples. That is, the results un- want to control for age, sex, height, hyperten- calculate as shown in equation 12:
der treatment regimen A (number of success- sion, diabetes, and so on. For each patient, one
ful outcomes among the n1 subjects given may want to find a control subject with similar
(12)
treatment A) do not depend on the outcomes age, sex, height, and other characteristics, thus
in group B (number of successful outcomes forming a series of matched pairs. Within each
among the n2 subjects given treatment B). of these pairs, one then classifies each patient Looking up 3.29 on chi-square tables yields
Sometimes, however, subjects or data and control subject into whether they have di- a p value of 0.069, so that it is close to but
points may come in pairs, so that dependen- minished renal size at sonography or not. does not cross the (admittedly arbitrary)
cies among the groups are naturally induced. Within each matched pair are four possibil- threshold of 0.05. Thus, at least at the type
Consider, for example, the frequently occur- ities: Both the patients and control subjects 1 error level of 0.05, we do not have evi-
ring situation in which two diagnostic tests show diminished renal size, or both may not dence to reject the null hypothesis.
are given to each of a series of subjects. Each show diminished renal size. These two possi- Of course, the McNemar test can also be
subject may test positively or negatively on bilities form concordant pairs (introduced in used for testing hypotheses relating to diag-
each of the two tests, so that the data arising previous text) because similar renal size is nostic test data of the type described at the be-
from such a study may be summarized in a 2 × shown for each subject forming the pair. Of ginning of this section.
2 table, as seen in Table 3. course, the other two possibilities are that the The general criticisms relating to hypothe-
Thus, we observe a number of subjects patient shows diminished renal size and the sis testing and p values carry over the partic-
who are positive on both tests, b subjects who control does not, and vice versa, forming the ular case of testing dependent proportions
are negative on the first test but positive on nonconcordant pairs. As was the case with di- through the McNemar test. Odds ratios and
the second test, c subjects who are positive on agnostic test studies, the data may be formed associated CIs can be calculated from
the first test but negative on the second, and d into a 2 × 2 table, as shown in Table 4. matched pair studies, and these will be cov-
subjects who test negatively on both tests. Note that there are a total of N pairs of sub- ered in a future module in this series.
The cells with a and d contain concordant jects in this study, meaning that we in fact
pairs, because the two test results agree with have 2N individuals (similarly, in the diag-
each other, whereas the cells with b and c con- nostic test case, we have 2N tests, but only N Sample Size Determination for One
tain discordant pairs. subjects). We have a subjects in whom both and Two Proportions
Similar data can arise from a matched the patient and the matched control subject As previously discussed [1], there has been a
case–control study. In this type of study de- showed diminished renal size, b subjects in strong trend away from hypothesis testing and p
sign, cases (e.g., those with a particular dis- whom the control but not the patient showed values toward the use of CIs in the reporting of
ease) are first found and then matched to a diminished renal size, and so on. results from biomedical research. Because the
particular control case with similar character- Suppose we would like to test the null hy- design phase of a study should synchronize with
istics but without the condition of interest. pothesis that diminished renal size is unre- the analysis that will be eventually performed,
As a concrete example, suppose we wish to lated to impaired renal function versus the sample size calculations should be performed
investigate whether impaired renal function is alternative hypothesis that a relation exists on the basis of ensuring adequate numbers for
related to diminished renal size. Because we
would otherwise require large numbers of sub-
jects to be followed up over a long period of TABLE 4 Data in a Case Control Study
time, a case–control design may be considered. Diminished Renal Size
Thus, one finds patients with impaired renal Diminished Renal Size
function and control subjects without impaired Patient Has Patient Does Not Have Total
renal function, and discovers whether there is a Control has a b a+b
tendency of those with impaired renal function Control does not have c d c+d
to show diminished renal size on sonography
Total a+c b+d N=a+b+c+d
compared with those without impaired renal

1060 AJR:184, April 2005


Statistical Inference for Proportions

accurate estimation of important quantities that value is conservative in the sense that the desired in the form of prior information about param-
will be estimated in the study, rather than by CI width will be respected regardless of the esti- eters of interest.
power calculations. For one- and two-sample mated value of p that will be observed in the The third advantage is that Bayesian anal-
problems, the formulae are as given in the fol- study. This conservative value, however, may ysis is a natural way to update statistical anal-
lowing paragraphs. provide too large a sample size and therefore be yses as new information becomes available.
wasteful of resources if the true proportion is far A main theoretic difference between fre-
Single Sample from 0.5. A conservative rule of thumb is to use quentist and Bayesian statistical analyses is that
Let p be the proportion that is to be esti- the value of p that is closest to 0.5, selected from Bayesian analysis permits parameters of interest
mated, and assume that we wish to estimate p the set of all plausible values. Similarly, equation (binomial probabilities, population means, and
to an accuracy of a total CI width of w = 2 × 14 is maximized for p1 = p2 = 0.5, so a similar rule so on) to be considered as random quantities, so
h, where h is half the total CI width. of thumb applies for each of p1 and p2. that probabilities can be attached to the possible
Then we can perform the calculation shown values that they may attain. On the other hand, fre-
in equation 13, Bayesian Inference for Proportions quentists consider these parameters to be fixed (al-
Consider again the problem introduced in beit possibly unknown) constants, so they have no
(13) the section called Inferences for Single Pro- choice but to attach their probabilities to the data
portions. Recall that in that example the sen- that could arise from the experiment, rather than to
where, again, z is the appropriate normal quan- sitivity of standard interpretation of the parameters. This distinction is the main reason
tile (e.g., z = 1.96 for a 95% CI). radiographs is assumed to be 90%, whereas Bayesian analysis can answer direct questions of
the small data set collected so far for the new interest, whereas frequentist analyses must settle
Two Sample automated radiograph interpretation system for answering more obscure questions in the form
Let p1 and p2 be the two proportions whose indicates a 100% success rate but is based on of p values and CIs.
difference we would like to estimate to a total only 10 subjects. The frequentist CI was very The ability to address questions of direct inter-
CI width of w = 2 × h. wide, ranging from 74.1% to 100%. There- est, however, comes at the cost of having to do a
Then we can perform the calculation fore, the data themselves have not been par- bit more work. Not only do Bayesians have to
shown in equation 14, ticularly helpful in making a decision as to collect data from their experiments, but they also
which technique to use for the next patient, have to quantify the state of knowledge of all pa-
because values indicating a new test that is rameters before their collecting this data. This
(14) both more and less sensitive than the standard nontrivial step is summarized in a prior distribu-
diagnostic method have not been ruled out by tion. The information in the prior distribution is
the CI. At this point, with the data being rela- updated by the information in the data to arrive at
where n represents the required sample size tively uninformative, the radiologist may de- a posterior distribution, which summarizes all
for each group. cide to be conservative and remain with the available information, past and current. We will
As an example, suppose we want to design standard method until more information be- apply a Bayesian analysis to our radiologist’s de-
a study to measure the difference in diagnos- comes available about the new automated cision later in this section, but first we need to re-
tic accuracy for two types of imaging tech- technique, or may go with his or her “gut feel- call the basic elements of all Bayesian analyses
niques, say MRI versus CT for staging ing” as to the likelihood that the new therapy and see how they are applied to drawing infer-
cervical carcinoma. Suppose that CT is is truly better or not better. If there have been ences about our parameter of interest here, the bi-
thought to be successful in staging patients data from animal experiments or strong theo- nomial success rate of the new automated
with cervical carcinoma, with probability p1 = retic reasons why the new technique may be radiographic technique.
0.70, and MRI may improve this to p2 = 0.80. better, the radiologist may be tempted to try Let us generically denote our parameter of in-
We would like to estimate the true difference the new one. Can anything be done to aid in terest as θ. Hence, θ can be a binomial pa-
to within h = 0.05, so that not only will we be this decision-making process? rameter, a set of two independent or dependent
able to detect any differences of 10%, but the Bayesian analysis has several advantages binomial parameters, or the mean and variance
95% CI will be far enough away from 0 (if our over standard or frequentist statistical analy- from a normal distribution, or an odds ratio, or
predicted rates are correct) so that we can ses. These advantages include the following: a set of regression coefficients, and so on. Note
make a more definitive conclusion as to the First is the ability to address questions of in particular that θ can be two- or more dimen-
clinical usefulness of MRI. We calculate as direct clinical interest, such as direct proba- sional. The parameter of interest is sometimes
shown in equations 15 and 16 bility statements about hypotheses of inter- usefully thought of as the “true state of nature.”
est and credible intervals with similarly As discussed in more detail in the previous
(15) easy interpretations [1]. Hence, results of module in this series [1], the basic elements of a
(16) Bayesian analyses are straightforward to Bayesian analysis then are as follows:
interpret, in contrast to the obscure and dif- First is the prior probability distribution, f
ficult-to-understand (and frequently misin- (θ). This subjective prior distribution summa-
so that 569 patients are required in each group. terpreted) inferences provided by p values rizes what is known about θ before the exper-
The main practical difficulty with equations and CIs [1]. iment is performed.
13 and 14 is assigning appropriate values for p, Second is the ability to incorporate rele- Second is the likelihood function, f (x | θ).
p1, and p2. It is therefore useful to note that equa- vant information not directly contained in the The likelihood function provides the distribu-
tion 13 is maximized when p = 0.5, so using this data into any statistical analysis. This enters tion of the data, x, given the parameter value θ.

AJR:184, April 2005 1061


Joseph and Reinhold

For instance, for proportions it may be a bino- (18) more data accumulate. This is how Bayes’ the-
mial likelihood, as in equation 17: orem operates, because the prior becomes a less
where ∝ indicates “is proportional to.” important contributor to the posterior distribu-
Thus, we update the prior distribution to a tion as more data become available. See the pre-
(17) posterior distribution after seeing the data via vious module for more discussion about prior
Bayes’ theorem. The current posterior distri- distributions [1].
Third is the posterior distribution, f (θ | bution can be used as a prior distribution for We now will apply the general Bayesian tech-
x). The posterior distribution summarizes the next study; hence, Bayesian inference nique we have described to the specific problem
the information in the data, x, together with provides a natural way to represent the learn- of inferences for binomial proportions.
the information in the prior distribution, f ing that occurs as science progresses. Suppose that in a given experiment x “suc-
(θ). Thus, it summarizes what is known The prior distribution is subjective and cho- cesses” are observed in N binomial trials. Let
about the parameter of interest θ after the sen by each investigator according to his or her θ = p denote the parameter of interest—the
data are collected. appreciation of the past literature regarding the true but unknown probability of success—
Bayes’ theorem relates the above three unknown parameters of interest. Hence, the and suppose that the problem is to find an in-
quantities: prior distribution is not unique to each experi- terval that covers the most likely locations for
ment but can vary from investigator to investi- p given the data.
posterior distribution =
gator. This can be seen as accurately reflecting The Bayesian solution to this problem fol-
[likelihood of the data × prior distribution] /
clinical reality. Different clinicians can have dif- lows the usual pattern, as outlined previously.
a normalizing constant,
ferent initial opinions about a parameter value, Hence, the main steps can be summarized as
or using our notation and omitting the nor- although these opinions tend to concentrate first, write down the likelihood function for the
malizing constant, as shown in equation 18, about a constantly narrowing range of values as data. Second, write down the prior distribution

A B

C D
Fig. 1.—Series of four beta densities.
A–D, Graphs show beta(1,1) (A), beta(10,10) (B), beta(2,8) (C), and beta(8,2) (D) densities. Beta(1,1) distribution (A) is also known as the uniform density.

1062 AJR:184, April 2005


Statistical Inference for Proportions

for the unknown parameter p. Third, use Bayes’ and the SD is given by equation 21. hypotheses given at the start of the section on
theorem (i.e., multiply the equation for the Inferences for Single Proportions. We will re-
likelihood function of the data by the prior dis- (21) turn to this example again shortly.
tribution) to derive the posterior distribution.
Use this posterior distribution, or summaries of To choose a prior distribution, one needs Step 3
it like 95% credible intervals, for statistical in- only to specify values for α and β. This can be As always, Bayes’ theorem says
ferences. Credible intervals are the Bayesian done by finding the α and β values that give
analogues to frequentist CIs. the correct prior mean and SD values. Solving posterior distribution ∝ prior distribution ×
For the case of a single binomial parame- these two equations in two unknowns, the for- likelihood function.
ter, these steps are realized in this manner: mulae are shown in equations 22 and 23.
In this case, it can be shown (by relatively
Step 1 (22) simple algebra) that if the prior distribution is
The likelihood function is the usual bino- beta(α,β) and the data are x successes in N tri-
mial probability formula shown in equation als, then the posterior distribution is again a
17, where l(x | p) represents the likelihood (23) beta distribution, beta(α + x, β + N – x). This
function for the success rate p given data x. simplicity arises from noticing that both the
For example, if we wish to find a member beta prior distribution as represented in equa-
Step 2 of the beta family centered near µ = 0.9 and tion 19 and the binomial likelihood as given in
Although any prior distribution can be with σ = 0.05, then plugging these values for equation 17 have the general form pa × (1 – p)b,
used, two distributions are of particular inter- µ and σ into these two equations gives α = so that when multiplying them as required by
est. The first prior distribution we will discuss 31.5 and β = 3.5, so that a beta(31.5, 3.5) will Bayes’ theorem, the exponents simply add,
is the uniform prior distribution, which spec- have the desired properties. This curve, pic- and the form is once again recognized to be
ifies that all possible values (for proportions, tured in Figure 2, may be an appropriate prior from the beta family of distributions.
this implies all values in the range of 0–0) are distribution for the problem introduced at the Hence, if we observe the new automated
equally probable, a priori. See Figure 1A. The beginning of this section if the radiologist be- computer-aided radiologic method to cor-
uniform distribution is suitable for use as a lieves, a priori, that the new technique is rectly identify 10 patients in a row with lung
“diffuse” or a “noninformative” distribution, likely to be successful between 80% and nodules, and if we use the prior distribution
when little or no prior information is available 100% of the time, and whose best guess of the discussed previously, then the posterior
or when one wishes to see the information rate is 90%. Note that this clinician has cen- distribution is a beta(31.5 + 10, 3.5 + 0) =
contained in the data by itself. tered the prior around the rate thought to be beta(41.5, 3.5) distribution, which is illustrated
A second particularly convenient prior dis- equal to the standard treatment. Thus, this in Figure 2. The mean of this distribution is
tribution, for reasons to be explained, is the beta prior distribution would give equal a priori [41.5 / (41.5 + 3.5)] = 0.922, and the 95% pos-
distribution. A random variable, θ, has a distri- weight to both the null and alternative terior credible interval is (0.844–0.988). The
bution that belongs to the beta family if it has a
probability density given by equation 19

(19)

for 0 ≤ θ ≤ 1, and α, β > 0. B(α,β) represents the


beta function evaluated at (α,β). It is simply the
normalizing constant that is necessary to make
the total area under the curve equal to 1, but oth-
erwise plays no role.
Some beta distributions are illustrated in Figure
1. For example, using a beta(α = 1, β = 1) distri-
bution reproduces the perfectly flat or uniform dis-
tribution discussed previously. Thus, the uniform
distribution is really just a special case of the beta
distribution. On the other hand, a beta(α = 10, β =
10) density produces a curve similar in shape to a
normal density centered at θ = 0.5. If α > β the
curve is skewed toward values near 1, whereas if
α < β the curve is skewed toward values near 0.
The mean of the beta distribution is given
by equation 20,

(20)
Fig. 2.—Prior (dotted line) and posterior (solid line) beta densities for automated radiology example.

AJR:184, April 2005 1063


probability of being greater than 90% is 0.748 compare two proportions,
Joseph see Brophy and Jo-
and Reinhold The previous module covered similar tech-
(area under the curve to the right of 0.9 in Fig. seph [5]. This example also illustrates the use niques to those covered here for continuous
2). Therefore, the radiologist may or may not of a range of prior distributions and shows data, and future modules in this series will
be tempted to try the automated technique on that Bayesian analysis can often come up with cover techniques suitable for other types of
the next patient but should realize that this de- answers that are quite different from those ob- study designs and questions that arise in radi-
cision is mostly based on the prior informa- tained using a frequentist approach. ology, including linear and logistic regression
tion, to which the data contributed only a methods. The latter is especially relevant be-
small amount of new information. Looking at Discussion cause logistic regression allows one to ana-
Figure 2, we see that the prior density was This module has introduced some of the lyze dichotomous outcomes from one or more
shifted only a small amount by the data. If in- major ideas behind statistical inference for groups while adjusting the analysis for poten-
stead the radiologist “lets the data speak for proportions, with emphasis on the simple tial confounding factors.
themselves” by using a beta(1,1) or uniform methods for one and two samples. Rather
prior distribution (Fig. 1), then the 95% inter- than a simple catalogue listing of which meth- References
val is (0.773–0.971), very similar numerically ods to use for which types of dichotomous 1. Joseph L, Reinhold C. Fundamentals of clinical
to the frequentist CI of the section Inferences data, we have tried to explain the logic behind research for radiologists. Statistical inferences for
for Single Proportions, although their inter- the common statistical procedures seen for bi- continuous variables. AJR 2005;184:1047–1056
pretations are quite different. Bayesian inter- nary data in the medical literature, the correct 2. Joseph L, Reinhold C. Fundamentals of clinical
vals (deliberately called credible intervals to way to interpret the results, and what their ad- research for radiologists. Introduction to probabil-
distinguish them from frequentist confidence vantages and drawbacks may be. We have ity theory and sampling distributions. AJR
intervals) are interpreted directly as the poste- also introduced Bayesian inference as a 2003;180:917–923
3. Kakeda S, Moriya J, Sato H, et al. Improved de-
rior probability that p is in the interval, given strong alternative to standard frequentist sta- tection of lung nodules on chest radiographs using
the data and the prior distribution. No refer- tistical methods, for both its ability to incor- a commercial computer-aided diagnosis system.
ences to long-run frequencies or other exper- porate the available prior information into the AJR 2004;182:505–510
iments are required, as is the case for CIs. analysis and its ability to address questions of 4. Armitage P, Berry G. Statistical methods in med-
In general, one should usually perform a direct clinical interest. ical research, 3rd ed. Oxford, England: Blackwell
Bayesian analysis using a diffuse prior distri- For more information about inferences on Scientific Publications, 1994
bution like a beta(1,1) distribution, to exam- proportions, see the books by Fleiss [6] for 5. Brophy J, Joseph L. Placing trials in context using
Bayesian analysis: GUSTO revisited by Reverend
ine what information the current data set the frequentist perspective and by Gelman et
Bayes. JAMA 1995;273:871–875
provides. Then one or more Bayesian analy- al. [7] for the Bayesian view. General books 6. Fleiss J. Statistical methods for rates and propor-
ses with more informative prior distributions on statistical inferences in medicine [8–10] tions. New York, NY: Wiley, 1981
could be performed, depending on the avail- all contain many techniques on inferences 7. Gelman A, Carlin J, Stern H, Rubin D. Bayesian
able prior information. If opinions in the for proportions that are beyond the scope of data analysis, 2nd ed. London, England: Chap-
medical community are widely divergent this module. man and Hall, 2003
8. Rosner B. Fundamentals of biostatistics. Belmont,
concerning the parameters of interest, then Software is available that makes carrying
several prior distributions should be used. If out all the analyses discussed in this module MA: Duxbury, 1995
9. Bland M. An introduction to medical statistics, 3rd
the data set is large, then similar conclusions relatively easy. From the frequentist view-
ed. Oxford, England: Oxford University Press, 2000
will be reached no matter which prior distri- point, there are literally dozens of statistical 10. Le C. Introductory biostatistics. New York, NY:
bution one starts with. On the other hand, with packages available for purchase, but much Wiley, 2003
smaller data sets, diversity of opinions will excellent free software is also available. For 11. R, version 1.8.0. Available at: cran.r-project.org/.
still exist, even after the new data are ana- example, the R package [11] is freely avail- Accessed February 2, 2004
lyzed. Bayesian analysis allows this situation able for most computer platforms, including 12. O’Hagan A. First Bayes software. Available at:
to be accurately represented and assessed. Windows (Microsoft) and Linux PCs and www.shef.ac.uk/~st1ao/1b.html. Accessed December
25, 2003
Although we discuss only the simple case MacOS (Apple). It is a comprehensive pack-
13. Spiegelhalter D, Thomas A, Best N. WinBUGS
of Bayesian inference for a single binomial age that is constantly being updated. Free version 1.4 user manual. Cambridge, UK: MRC
proportion, these methods are easily extended Bayesian software includes First Bayes [12] Biostatistics Unit, 2003
to the case of two or more proportions. For a for simple problems and WinBUGS [13, 14] 14. WinBUGS, version 1.4. Available at: www.mrc-bsu.
clinical example using Bayesian analysis to for more complicated problems. cam.ac.uk/bugs/. Accessed February 2, 2004

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 9. Visualizing Radiologic Data, March 2003
2. The Research Framework, April 2001 10. Introduction to Probability Theory and Sampling
3. Protocol, June 2001 Distributions, April 2003
4. Data Collection, October 2001 11. Observational Studies in Radiology, November 2004
5. Population and Sample, November 2001 12. Randomized Controlled Trials, December 2004
6. Statistically Engineering the Study for Success, July 2002 13. Clinical Evaluation of Diagnostic Tests, January 2005
7. Screening for Preclinical Disease: Test and Disease 14. ROC Analysis, February 2005
Characteristics, October 2002 15. Statistical Inference for Continuous Variables, April 2005
8. Exploring and Summarizing Radiologic Data, January 2003

1064 AJR:184, April 2005


Research
Crewson
Reader Agreement Studies

Fundamentals of Clinical
Research for Radiologists
Philip E. Crewson1 Reader Agreement Studies
Crewson PE

his article presents several ap- on different occasions; agreement studies do

T proaches for evaluating reader


agreement. The dominant tech-
nique in the radiology literature is
not require a reference standard.
Several methods are available for evaluat-
ing reader agreement, but the dominant tech-
weighted and unweighted Cohen’s kappa and nique in the radiology literature is weighted
the associated measure, percent agreement. and unweighted Cohen’s kappa and the asso-
Percent agreement is an intuitive approach to ciated measure, percent agreement. Because
measuring agreement but does not adjust for of the popularity of kappa in radiology re-
chance. Kappa provides a measure of agree- search, this paper will focus on bi-rater and
ment beyond that which would be expected multirater kappa. Included in this presenta-
by chance, as estimated by the observed data. tion will be a discussion of the basic data re-
Both the bi-rater and multirater kappa statis- quirements, calculation formulas, interpretation
tics have several limitations that are difficult of the kappa coefficient as a measure of
to resolve. Although there are alternative ap- strength of agreement, and statistical signif-
proaches to measuring agreement, kappa re- icance testing. This discussion will be fol-
mains the most commonly used measure. lowed by an exploration of several
Reader agreement studies have an impor- limitations of kappa, especially those that
tant role in advancing radiology practice, pertain to comparability across studies. For-
technique, training, and quality control. Ex- mulas are provided in sufficient detail for
tremely common in the radiology literature, those who wish to replicate the calculations,
Received November 17, 2004; accepted after revision
November 23, 2004. reader agreement studies determine the mag- but an in-depth understanding of the mathe-
This is the 17th in the series designed by the American
nitude of agreement between or among read- matics is not necessary to appreciate the ap-
College of Radiology (ACR), the Canadian Association of ers. Potential applications include developing plication and limitations of kappa.
Radiologists, and the American Journal of Roentgenology. reliable diagnostic rules [1], understanding
The series, which will ultimately comprise 22 articles, is
designed to progressively educate radiologists in the variability in treatment recommendations [2], Bi-Rater Kappa
methodologies of rigorous clinical research, from the most evaluating the effects of training on interpre- Cohen’s kappa is a common technique for
basic principles to a level of considerable sophistication. tation consistency [3], determining the reli- estimating paired interrater agreement for nom-
The articles are intended to complement interactive
software that permits the user to work with what he or she ability of classification systems (lexicon inal and ordinal-level data [6]. Kappa is a coef-
has learned, which is available on the ACR Web site development) [4], and comparing the consis- ficient that represents agreement obtained
(www.acr.org).
tency of different sources of medical informa- between two readers beyond that which would
Project coordinator: G. Scott Gazelle, Chair, ACR tion [5]. Agreement studies should not be be expected by chance alone [7]. A value of 1.0
Commission on Research and Technology Assessment.
confused with studies of accuracy in which represents perfect agreement. A value of 0.0
Staff coordinator: Jonathan H. Sunshine, Senior Director measures of sensitivity and specificity and represents no agreement. Although such in-
for Research, ACR.
ROC curves are commonplace for compari- stances are rare, kappa can also exhibit negative
1Health Services Research and Development Service (124),

Department of Veterans Affairs, 810 Vermont Ave., NW,


sons when a reference standard (known truth) values when observed agreement is less
Washington, DC 20420. Address correspondence to exists. Although these studies evaluate the va- (worse) than chance. Key assumptions for us-
P. E. Crewson (philip.crewson@va.gov). lidity of a measure and require a reference ing kappa include the following: elements be-
AJR 2005;184:1391–1397 standard, agreement studies most commonly ing rated (images, diagnoses, clinical
0361–803X/05/1845–1391 focus on the reliability of evaluations be- indications, and so forth) are independent of
© American Roentgen Ray Society tween different readers or in the same reader each other, one rater’s classifications are made

AJR:184, May 2005 1391


Crewson

independently of the other rater’s classifica- which reader 1 and reader 2 agree that an image Using a common interpretation guideline
tions, the same two raters provide the classifi- is benign is 0.20 (20/100) or 20% (0.20 × 100). offered by Landis and Koch [7], a kappa of
cations used to determine kappa, and the rating The overall proportion of readings in which 0.52 reflects a moderate level of agreement
categories are independent of one another [8]. reader 1 and reader 2 agree is calculated by sum- (Table 3).
The last assumption may be difficult to satisfy ming the diagonal probabilities in Table 2:
in some imaging studies in which there are sub- Statistical Significance
tle differences in lesion characteristics and de- po = 0.20 + 0.60...p0 = 0.80. To test the null hypothesis that the kappa
cision criteria. When differences between coefficient is not different from zero (i.e., no
rating categories are not clear, careful study de- This “proportion agreement” is converted to a better than chance), an estimate of the stan-
sign is essential to maximize the independence percentage and reported as “percent agreement.” dard error (SE) for a one-sample test is calcu-
among rating categories. Alternatives include The interpretation of percent agreement is lated from the formula [10]:
dropping confusing categories or merging re- straightforward: Reader 1 and reader 2 agreed
lated categories. Although not always possible, with each other on 80% of the classifications. The
pe
adjustments in the classification scheme should approach and calculations are the same for larger SEk0 =
be consistent with clinical practice. tables in which readers must consider more than n(1 − pe )
Bi-rater kappa is used to test the hypothesis two options in their decision making. Using as an
that agreement exists between two raters be- example the American College of Radiology’s
yond that which would be expected by chance. BI-RADS lexicon [9] for final assessment, agree- .58
SEk0 =
Bi-rater kappa provides a measure of the rela- ment could be based on each reader assigning
100(.42 )
tive intensity of agreement or disagreement be- each case to one of four categories: benign, prob-
tween two readers rating the same elements ably benign, suspicious, or highly suggestive of
using an identical classification system. A two- malignancy. The resulting data would be re-
by-two contingency table illustrates hypothetic ported in a four-by-four table in which the sum of SEk0 = 0.12
data in which two readers independently viewed the probabilities in the four diagonal cells repre-
the same set of 100 images from diagnostic sents the proportion agreement (po). A kappa test statistic is compared with
mammograms with a simple classification crite- The advantage of the kappa statistic over per- the standard normal distribution. The
rion, malignant or benign (Table 1). To estimate cent agreement is its adjustment for the propor- equation for obtaining the test statistic is
kappa, both raters must use the same number of tion of cases in which the raters would agree by as follows:
rating criteria so that the number of columns chance alone. Because we are unlikely to know
representing the rating categories used by rater the true value of chance, the marginal probabili- ^

1 equals the number of rows representing the ties from the observed data are used to estimate a k
z=
rating categories used by rater 2. Kappa is cal- surrogate for chance. The proportions in the total SEk0
culated using the formula: column and in the total row represent the mar-
ginal probabilities. Chance agreement is derived
from the observed data, so it will likely change if
p − pe .52
kˆ = 0 different readers evaluate the same images. Us- z=
1 − pe ing Table 2, the proportion of chance agreement .12
(pe) is computed as follows:
where po is the proportion of cases in which
z = 4.33
agreement exists between two raters, and pe is pe = (0.35 × 0.25) + (0.65 × 0.75)
the proportion of cases in which raters would pe = 0.09 + 0.49 Using a one-tailed test, the test statistic
agree by chance. pe = 0.58 is statistically significant because it exceeds
If we divide each cell count by the total sample
the critical value of 1.645 (alpha, 0.05) [6].
size (n = 100), a matrix of probabilities is created Once the proportion of observed agree- This result supports the alternative hypothesis
(Table 2). Each cell contains the proportion of the ment (po) and the proportion of chance agree- that the kappa coefficient is different from
total number of images (n = 100), not the count. ment (pe) are established, kappa is calculated zero (i.e., better than chance).
As an example, the proportion of images in using the formula: Although some effort has been directed to-
p − pe ward estimating sample size requirements for
Two Readers Evaluating kˆ = 0 comparisons among two or more kappa coeffi-
TABLE 1
100 Images (Counts) 1 − pe cients [11, 12], methods for calculating power
Reader 1 for one kappa coefficient have not received
Reader 2 much attention [11]. As a general rule of
Benign Malignant Total .80 − .58 thumb, 30 cases with two readers is a reason-
kˆ =
Benign 20 5 25 1 − .58 able minimum sample size as long as a moder-
Malignant 15 60 75
ate-level or better kappa coefficient (κ > 0.40)
is expected and you want to show that kappa is
Total 35 65 100 kˆ = .52 different from a value of zero.

1392 AJR:184, May 2005


Reader Agreement Studies

Confidence Intervals in Appendix 1. For another example of cal- Two Readers Evaluating
For estimating confidence intervals, a dif- culating kappa weights, see Kundel and Po- TABLE 2
100 Images (Proportions)
ferent formula is used for SE [10]. There are lansky [15].
Reader 1
other more accurate and complicated formu- Reader 2
las for SE [6, 13, 14]: Special Considerations When Using Bi-Rater Benign Malignant Total
Kappa
Benign 0.20 0.05 0.25
p0 (1 − p0 ) For small sample sizes, kappa may be un-
SEk ≅ derestimated. In this case, a resampling tech- Malignant 0.15 0.60 0.75
n(1 − pe )
2
nique (jackknifing) can be used to calculate Total 0.35 0.65 1.00
an unbiased estimate of kappa [8]. Kappa
may also be lower if the number of decision
.80(.20) categories is excessive. Possible responses
SEk ≅ Interpretation Guidance
100(.42 ) to compensate for this effect are to use
2
TABLE 3 for Strength of
weighted kappa if the categories are rank-or- Agreement
dered or to combine similar categories, or
Kappa Coefficient Strength of Agreement
SEk ≅ .095 both. In any good study design, the choice of
a weighting or classification scheme should < 0.00 Poor

Given an estimate of kappa of 0.52, the be addressed and resolved before data col- 0.00–0.20 Slight
95% confidence interval would be 0.33–0.71: lection. Overall, the precision (SE) of kappa 0.21–0.40 Fair
is expected to improve as the number of pa-
0.41–0.60 Moderate
CI95% = κ ± 1.96 (SEκ) tients and raters increases [16]. Although the
preceding discussion was limited to two rat- 0.61–0.80 Substantial
CI95% = 0.52 ± 1.96(0.095)
CI95% = 0.52 ± 0.19 ers, the next section presents a technique for 0.81–1.00 Almost perfect
improving precision by comparing more Note.—Data are taken from Landis and Koch [17].
than two raters.
Weighted Kappa of decision categories, and nij = number of
Kappa treats disagreements the same re- Multirater Generalized Kappa raters who classified patient i (rows in Table
gardless of whether a close decision on a When there are more than two raters, gen- 4) in category j (columns in Table 4).
rank-ordered classification system has eralized kappa is the recommended approach The proportion (p^j) of all classifications
clinical relevance. As an example, a rank- for evaluating interrater agreement [6, 13, that fall within each decision category is pre-
ordered rating scale from benign, probably 17]. This statistic measures the degree to sented at the bottom of each column. In this
benign, suspicious, and highly suspicious which interpretation variability arises from example, 0.40 (40%) of the classifications
of malignancy, from which one rater con- differences among cases relative to differ- are in the benign category, 0.24 (24%) are
cludes a lesion is suspicious and the other ences among readers interpreting the same suspicious, and 0.36 (36%) are classified as
rater concludes that the lesion is highly case. It is analogous to analysis of variance malignant.
suspicious, may result in the same clinical and the intraclass correlation used in the as-
decision, immediate follow-up with bi- sessment of agreement when measured on a
Ratings by Five
opsy. In this event, a disagreement between continuous scale. TABLE 4 Radiologists for
these two categories is much less important The discussion that follows focuses on es- 10 Patients
than a disagreement in which one rater timating agreement among more than two rat-
Classification (nj )
nij(nij −1)
rates a lesion as highly suspicious and the ers, when the number of raters is kept
Patient R
other rater rates the same lesion as proba- constant and the number of rating categories (ni ) Benig Suspiciou Maligna ∑
bly benign. is greater than two. Slight modifications in n s nt j=1 K(K−1)
Weighted kappa was developed to pro- the calculations are required when general-
1 1 4 0 0.60
vide partial credit. The observed and ex- ized kappa is estimated for only two rating
pected proportions of each cell are categories or when the number of raters does 2 2 0 3 0.40
multiplied by a weight before using them to not remain constant from one classification to 3 0 0 5 1.00
calculate kappa. Weights can be established another (see Fleiss [6] for alternative calcula- 4 4 0 1 0.60
a priori (before data collection) using clini- tions). The approach presented here satisfies 5 3 0 2 0.40
cal experience [10], or they can be calculated the likely characteristics of a prospective im-
6 1 4 0 0.60
after data collection using a simple algo- aging study design [3].
rithm for assigning weights that uses the Table 4 presents hypothetic data for five 7 5 0 0 1.00
same weighting strategy regardless of the raters evaluating imaging from 10 patients 8 0 4 1 0.60
data characteristics or rating criteria. using three decision categories—benign, 9 1 0 4 0.60
Weighted kappa and unweighted kappa will suspicious, and malignant. The formulas that 10 3 0 2 0.40
be the same when there are only two deci- follow are from Woolson [13]. Assume the Total 20 12 18
sion categories. An example based on the following notation: N = total number of pa-
p̂ j 0.40 0.24 0.36 p = 0.62
BI-RADS classification system is provided tients, K = total number of raters, R = number

AJR:184, May 2005 1393


Crewson

TABLE 5 Implications of Case Distribution Limitations of Kappa


Considerable debate surrounds the use of bi-
Benign and Malignant Cases Evenly Distributed Malignant Cases Dominate Distribution (90%) kappa and generalized kappa as a measure of
Reader 2 Reader 1 Reader 2 Reader 1 agreement [18]. As a result, several alternative
approaches to measuring agreement have been
Benign Malignant Total Benign Malignant Total proposed but have yet to gain wide acceptance
Benign 0.45 0.05 0.50 Benign 0.05 0.05 0.10 in the peer-reviewed literature. A convenient
Malignant 0.05 0.45 0.50 Malignant 0.05 0.85 0.90 listing of several alternative approaches and ref-
Total 0.50 0.50 Total 0.10 0.90 erences is available on the Internet [19]. Given
po 0.90 po 0.90
the dominance of kappa as a measure of agree-
ment in imaging studies, it is important for both
pe 0.50 pe 0.82
investigators and consumers of the literature to
Kappa 0.80 Kappa 0.44 understand the limitations of kappa. Following
Note.—po = proportion of cases in which agreement exists between two raters (proportion observed), is a brief discussion of the negative effects re-
pe = proportion of cases in which raters would agree by chance (proportion expected). sulting from variations in case distribution, im-
proper use of weights, and restrictions on the
For each patient, the proportion of all pos- Using the proportion of observed agree- overall generalizability (external validity) of
sible pairings on which radiologists agree is ment and chance agreement, the generalized studies using kappa. This is not a complete list-
calculated using the formula: kappa statistic is: ing of all the limitations, but rather basic consid-
erations in interpreting any agreement study that
R nij (nij − 1) ^ p − pe uses kappa.
∑ KG =
j =1 K (K − 1) 1− pe
Effects of Case Distribution
For patient 1, this would be calculated as A fundamental aspect of agreement studies
^ .62 − .35 is the distribution of cases. Because it is un-
shown in:
KG = 1 − .35 likely that a study reflects the population prev-
R nij (nij − 1) 1(1 − 1) alence, marginals (row and column totals)
∑ based on reader agreement patterns are rou-
j =1 K (K − 1) = 5(5 − 1) + ^
tinely used as surrogates for prevalence [18].
K G = .42
This surrogate measure of chance agreement is
4(4 − 1) 0(0 − 1) 12 based on the distribution of the cases classified
= .60
5(5 − 1) + 5(5 − 1) = 20
by readers (both bi-rater kappa and generalized
Statistical Significance
kappa). It is possible to find a consistently high
To test the null hypothesis that the kappa
level of percent agreement while reporting
coefficient is not different from zero (i.e., no
The proportion of pairs agreeing for each widely differing kappa values from one study
better than chance), the generalized kappa
patient is provided in Table 4 in the right col- or one comparison to another because of the
statistic is compared with the standard normal
umn. The overall proportion of agreement (p) ¯ case distributions. Table 5 provides an exam-
distribution. The equation for obtaining the
is the mean agreement of all patients, or 0.62. ple in which two readers with the same percent
test statistic is as follows (see Appendix 2 for
In other words, we estimate that, on average, agreement are presented with differing distri-
SE calculations):
any two of the five radiologists will agree on a butions of cases. The examples provided in Ta-
classification about 62% of the time. ble 5 assume a high level of accuracy by both
As in bi-rater kappa, a correction for ^ readers, so that the marginal probabilities
chance agreement is necessary to calculate K .42 match the study case distribution. In both ex-
the kappa coefficient. To estimate chance
z= G z= amples, the readers agree in 90% of the classi-
SE K G .075
agreement for generalized kappa, the propor- fications; however, kappa is significantly
tion (p^j) of classifications in each decision reduced if one classification category domi-
category is squared and summed. For Table 4, z = 5 .6 nates. As shown, an increase in the dominance
the expected chance agreement is: of malignant cases from 50% to 90% resulted
For a one-tailed test (alpha = 0.05), the in kappa dropping from 0.80 to 0.44.
R ^ kappa coefficient is statistically significantly Limitation 1.—Because of variations in
p e = ∑ p 2j different from zero. Because of rounding, the case mix, reported kappa values may vary
j =1
SE and z-test statistic will be slightly different dramatically from one study to another even
when calculated by computer algorithm, and when the overall percent agreement is similar.
there are other calculation methods for SE not Limitation 2.—Because varying rater pairs will
p e = .40 2 + .24 2 + .36 2 presented here [6]. A confidence interval is likely change the category distributions, bi-kappa
created using the same procedure as that pre- values on the same set of elements may vary dra-
sented for bi-rater kappa using the general- matically from one reader pair to another, even
p e = .35
ized kappa coefficient and its SE. when percent agreement is relatively stable.

1394 AJR:184, May 2005


Reader Agreement Studies

Weighted Kappa TABLE 6 Calculations for Weighted Kappa: Cell Counts


Adding to the limited comparability of the
kappa statistic from one study to another is the Reader A (j)
use of weighted kappa. There are multiple meth- Row
ods to weight kappa, so the comparability be- Reader B (i)
Probably Row Proportion
Benign Suspicious Malignant
tween studies is often limited. This concern, Benign Total Observed
however, is minor when compared with the (poj)
problem of weight justification [13]. The as- Benign 4 1 0 0 5 0.17
signment of weights is an arbitrary exercise, Probably benign 1 3 1 1 6 0.20
even when an established algorithm is used [6,
Suspicious 1 4 5 0 10 0.33
7]. The subjectivity of assigning weights should
Malignant 0 1 2 6 9 0.30
be balanced with a clear explanation of why and
how the weights are used [10]. Unfortunately, it Column total 6 9 8 7 30
is not rare for agreement studies to report Column proportion observed (poi) 0.20 0.30 0.27 0.23
weighted kappa with little if any discussion re-
garding the justification for the weighting
scheme used in the study.
Calculations for Weighted Kappa: Proportions Observed and
Limitation 3.—Weighting schemes are of- TABLE 7
Proportions Expected
ten subjective.
Proportions Observed (po)
Generalizability Reader A
Several factors affect the generalizability Reader B Formula
(external validity) of an agreement study. Benign Probably Benign Suspicious Malignant
These include rater background, clarity of the Benign 0.13 0.03 0.00 0.00 po = nij / n
decision categories, and clinical relevance.
Probably benign 0.03 0.10 0.03 0.03
Rater Background.—When using kappa,
Suspicious 0.03 0.13 0.17 0.00
we assume that the raters have similar levels
of experience, training, and specialization Malignant 0.00 0.03 0.07 0.20
(e.g., general radiology residents are not Proportions Expected (pe)
paired with seasoned subspecialists). If this is
not the case, kappa may not be an appropriate Reader A
Reader B Formula
technique [6]. Benign Probably Benign Suspicious Malignant
Limitation 4.—Agreement is likely to be
Benign 0.03 0.05 0.04 0.04 pe = po × poj
underestimated when raters have dissimilar
experience and training. Probably benign 0.04 0.06 0.05 0.05
Characteristic Clarity.—Clear classification Suspicious 0.07 0.10 0.09 0.08
definitions and independence are essential in an Malignant 0.06 0.09 0.08 0.07
agreement study. As a result, if a general under-
standing regarding the basic concepts being Limitation 5.—Agreement is likely to be un- training, and quality control. Although the limita-
rated has not been reached, conducting an derestimated and not generalizable when rat- tions of kappa are known, it remains a common
agreement study is premature and inappropri- ing categories have questionable face validity. statistical technique for estimating agreement for
ate. Similarly, if the difference between classifi- Clinical Relevance.—A general question nominal and ordinal scale variables. The purpose
cation categories is not clear, agreement will for any agreement study is whether the ob- of this article has been to build a better under-
suffer and may not reflect the actual domain of served agreement is representative of clinical standing of both the bi-rater and multirater kappa
interest. As an example, is there an actual differ- practice. Factors to consider include the type statistic. As has been shown, several weaknesses
ence between “probably benign” and “suspi- of imaging technology used, amount of back- are intrinsic to kappa that are difficult to resolve.
cious,” or do radiologists treat them clinically ground information provided, type of imag-
the same? In this case, reasons for possible dif- ing (diagnostic or screening), prior imaging Calculations for Weighted
ferences among radiologists may include varia- results, time allowed for interpretation, prior TABLE 8 Kappa: Quadratic Cell
tion in attitudes toward the risk associated with risk of disease, and comorbidity. Weights (w)
false-negatives and unfamiliarity with subtle Limitation 6.—Agreement studies often do not Reader B Reader A Weights (jw)
differences among the rating categories [2, 20]. reflect actual clinical practice (less information) Weights Formula
It is unwise to give much credence to an agree- or imaging prevalence (case mix), so the general- (iw) 1 2 3 4
ment study that was based on a questionable izability of the findings may be overstated. 1 1.00 0.89 0.56 0.00
classification scheme. An exception would be 2 0.89 1.00 0.89 0.56 (iw − jw )2
wij = 1 −
pilot studies such as lexicon development ef- Conclusion (k − 1)2
3 0.56 0.89 1.00 0.89
forts, but they should be treated as experimental Reader agreement studies have an important
4 0.00 0.56 0.89 1.00
(efficacy) studies. role in advancing radiology practice, technique,

AJR:184, May 2005 1395


Crewson

enced breast imagers at mammography?


Calculations for Weighted Kappa: Weighted Proportions Observed
TABLE 9 Radiology 2002;224:871–880
and Weighted Proportions Expected 4. Ikeda DM, Hylton NM, Kinkel K, et al. Development,
Weighted Proportions Observed: po(w) standardization, and testing of a lexicon for reporting
contrast-enhanced breast magnetic resonance imaging
Reader A studies. J Magn Reson Imaging 2001;13:889–895
Reader B Formula 5. Kashner TM. Agreement between administrative files
Benign Probably Benign Suspicious Malignant and written medical records: a case of the Department
Benign 0.13 0.03 0.00 0.00 of Veterans Affairs. Med Care 1998;36:1324–1336
6. Fleiss JL. Statistical methods for rates and pro-
Probably benign 0.03 0.10 0.03 0.02 portions, 2nd ed. New York, NY: Wiley, 1981
Po( w) = ∑ (Poij * wij )
Suspicious 0.02 0.12 0.17 0.00 7. Landis JR, Koch GG. The measurement of ob-
Po( w) = 0.93 server agreement for categorical data. Biometrics
Malignant 0.00 0.02 0.06 0.20
1977;33:159–174
Weighted Proportions Expected: pe(w) 8. Cyr L, Francis K. Measures of clinical agreement
for nominal and categorical data: the kappa coef-
Reader A ficient. Comput Biol Med 1992;22:239–246
Reader B Formula
Benign Probably Benign Suspicious Malignant 9. American College of Radiology. Illustrated
Breast Imaging Reporting and Data System (BI-
Benign 0.03 0.04 0.02 0.00 RADS), 3rd ed. Reston, VA: American College of
Probably benign 0.04 0.06 0.05 0.03 Radiology, 1998
Pe( w) = ∑ (Peij * wij )
10. Cohen J. Weighted kappa: nominal scale agree-
Suspicious 0.04 0.09 0.09 0.07 Pe( w) = 0.75 ment with provision for scaled disagreement or
Malignant 0.00 0.05 0.07 0.07 partial credit. Psychol Bull 1968;70:213–220
11. Lin H-M, Williamson JM, Lipsitz SR. Calculating
Although there are alternative approaches to mea- from general clinical practice. Although the lim- power for the comparison of dependent K-coeffi-
cients. Appl Stat 2003;52:391–404
suring agreement, kappa will likely remain the itations of the kappa statistic may seem insur-
12. Donner A. Sample size requirements for the com-
most commonly used measure. Issues hindering mountable, the key to proper use and parison of two or more coefficients of inter-ob-
the use of alternatives include mathematic com- interpretation of kappa, and any other statistic, is server agreement. Stat Med 1998;17:1157–1168
plexity, reduced understanding and interpretabil- understanding its limitations and reporting suffi- 13. Woolson RF. Statistical methods for the analysis
ity, and lack of consistency with prior research. cient data so that others may judge the results. of biomedical data. New York. NY: Wiley, 1987
At present, agreement studies will continue to 14. Lee JJ, Tu ZN. A better confidence interval for
use bi-rater kappa, multirater kappa, and Acknowledgments kappa on measuring agreement between two rat-
ers with binary outcomes. J Computat Graph Stat
weighted kappa as a measure of agreement. I thank Caryn Cohen, the series editors,
1994;3:301–321
However, it is essential that researchers respond and anonymous reviewers for comments on 15. Kundel HL, Polansky M. Measurement of ob-
to the limitations of kappa not only by improving earlier drafts of the manuscript. server agreement. Radiology 2003;228:303–308
study design but also by reporting and interpret- 16. Kraemer HC. Evaluating medical tests: objective
ing the findings appropriately. Recommended and quantitative guidelines. Newbury Park, CA:
steps to improve the quality and usefulness of References Sage Publications, 1992
published reader agreement studies include re- 1. Kinkel K, Helbich TH, Esserman LJ, et al. Dy- 17. Landis JR, Koch GG. A one-way components of
namic high-spatial-resolution MR imaging of sus- variance model for categorical data. Biometrics
porting the characteristics of the raters and their
picious breast lesions: diagnostic criteria and 1977;33:671–679
similarities and differences; reporting the source 18. Feinstein AR, Cicchetti DV. High agreement but
interobserver variability. AJR 2000;175:35–43
and characteristics of the elements (images) pre- 2. Elmore JG, Wells CK, Lee CH, et al. Variability low kappa. I. The problems of two paradoxes. J
sented to raters; including percent agreement in radiologists’ interpretations of mammograms. Clin Epidemiol 1990;43:543–549
with any kappa coefficient, and including both N Engl J Med 1994;331:1493–1499 19. Uebersax J. ourworld.compuserve.com/homepages/
percent agreement and unweighted kappa if 3. Berg WA, D’Orsi CJ, Jackson VP, et al. Does jsuebersax/agree.htm Accessed November 16, 2003
weighted kappa is used; and tempering overgen- training in the Breast Imaging Reporting and Data 20. Beam CA, Sullivan DC, Layde PM. Effect of human
System (BI-RADS) improve biopsy recommen- variability on independent double reading in screen-
eralization by reflecting on how the raters, the el-
dations or feature analysis agreement with experi- ing mammography. Acad Radiol 1996;3:891–897
ements they rated, and the study design differ

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 9. Visualizing Radiologic Data, March 2003
2. The Research Framework, April 2001 10. Introduction to Probability Theory and Sampling
3. Protocol, June 2001 Distributions, April 2003
4. Data Collection, October 2001 11. Observational Studies in Radiology, November 2004
5. Population and Sample, November 2001 12. Randomized Controlled Trials, December 2004
6. Statistically Engineering the Study for Success, July 2002 13. Clinical Evaluation of Diagnostic Tests, January 2005
7. Screening for Preclinical Disease: Test and Disease 14. ROC Analysis, February 2005
Characteristics, October 2002 15. Statistical Inference for Continuous Variables, April 2005
8. Exploring and Summarizing Radiologic Data, January 2003 16. Statistical Inference for Proportions, April 2005

1396 AJR:184, May 2005


Reader Agreement Studies

APPENDIX 1. Weighted Kappa

The data and formulas used to calculate weighted kappa are shown in Tables 6–9. This example of weighted kappa is based on
a four-category BI-RADS scale. Using a weighting scheme from Fleiss [6], a weight factor of 1 is used for benign, 2 for probably
benign, 3 for suspicious, and 4 for malignant. The difference between the weight factors is used to estimate a weight for each cell.
For example, if both readers classify the same set of lesions as malignant (an exact match), each decision has a weight factor of 4.
Using the Fleiss formula results in a cell weight of 1. A weight of 1 allows the entire proportion of lesion classifications in this cell
(observed and expected proportions) to contribute to the kappa estimate (see diagonal data in Table 8).
In contrast, all other lesion classification alternatives (mismatches) are adjusted according to the difference between their weight
factors. As an example, if one reader classifies a set of lesions as malignant (weight factor of 4) and the other reader classifies the
same set of lesions as suspicious (weight factor of 3), the proportion of lesion classifications in this cell that contribute to kappa are
reduced (i.e., given less importance for estimating kappa than an exact match)—in this case, 89% (.89) as much weight as an exact
match. As the difference between weight factors increases, the contribution to kappa from that cell decreases to the point at which
none of the observations in a cell contribute to kappa (i.e., instances in which the same set of lesions is classified malignant by one
rater and benign by the other).
Observed proportions of agreement and expected proportions of agreement are calculated for each cell and then weighted (multiplied)
by the quadratic cell weight and summed. The resulting weighted kappa is 0.72, which is greater than the unweighted kappa (0.47).

Po( w) − Pe( w)
k ( w) =
1 − Pe( w)
0.93 − 0.75
k ( w) =
1 − 0.75
k ( w) = 0.72

APPENDIX 2. SE for Generalized Kappa

To test the null hypothesis that the kappa coefficient obtained in Table 4 is not different from zero (i.e., no better than chance), an estimate
of the SE is calculated using the formula shown.

2
^
⎛ R ^ ⎞ ^

∑ p 2 − (2 K − 3)⎜⎜ ∑ j =1 p 2j ⎟⎟ + 2(K − 2 )∑ j =1 p 3j
R R

⎛ 2 ⎞ j =1 j
⎝ ⎠
SEK G = ⎜⎜ ⎟⎟ *
⎝ NK (K − 1) ⎠
2
⎛ ^

⎜⎜1 − ∑ j =1 p 2j ⎟⎟
R

⎝ ⎠

SEK G = ⎜⎜
(
⎛ 2 ⎞ .35 − (7 )(.35)2 + 2(3) .403 + .243 + .363
⎟⎟ *
)
⎝ 50(4 ) ⎠ (1 − .35)2
SEK G = .075

AJR:184, May 2005 1397


Dendukuri R e s e a r c h • F und am en ta l s o f C li ni c al R es e a rc h for R ad io l ogist s
and Reinhold
Correlation
and
Regression

Correlation and Regression


Nandini Dendukuri1,2 his module covers common statisti- ous, both dichotomous (i.e., having only two
Caroline Reinhold3,4

Dendukuri N, Reinhold C
T cal methods used in radiologic ap-
plications for measuring relations
between variables. Under the topic
values), or a mix (one dichotomous and the
other continuous). We will also cover situa-
tions in which we wish to study the relation be-
of correlation we describe Pearson’s and tween more than two variables.
Spearman’s correlation coefficients and partial To illustrate the methods in this tutorial we
correlation, all of which are suitable for evalu- have used hypothetical examples that are all
ating the association between two continuous inspired from studies appearing in radiology
Received November 17, 2004; accepted after revision variables. In the section on regression we cover research journals. Some of the concepts cov-
November 23, 2004. linear and logistic regression models. Regres- ered in this tutorial assume knowledge of ear-
sion models are used to study the association lier articles in this series, to which the reader is
Series editors: Nancy Obuchowski, C. Craig Blackmore, between an outcome variable and one or more encouraged to refer [1–4].
Steven Karlik, and Caroline Reinhold.
predictor variables that may be continuous or
This is the 18th in the series designed by the American dichotomous. For linear regression models the Correlation
College of Radiology (ACR), the Canadian Association of outcome variable is continuous, whereas for lo- In Figure 1 we have two scatterplots between
Radiologists, and the American Journal of Roentgenology. gistic regression models it is dichotomous. We ejection fraction and myocardial infarct volume.
The series, which will ultimately comprise 22 articles, is
also briefly describe methods for model selec- At first glance, it appears that the relation be-
designed to progressively educate radiologists in the
methodologies of rigorous clinical research, from the most tion and sample size determination. tween the two variables is stronger in Figure 1A
basic principles to a level of considerable sophistication. In a hypothetical study evaluating the use of than in Figure 1B. In fact, the two figures are
The articles are intended to complement interactive MRI for the assessment of myocardial viabil- based on the same data from a hypothetical
software that permits the user to work with what he or she ity, researchers were interested in characteriz- study of 30 patients. Altering the scale of the
has learned, which is available on the ACR Web site
(www.acr.org).
ing the nature of the relation between myocar- ejection fraction axis makes the relation ob-
dial infarct volume and ejection fraction. Their served in Figure 1B appear less strong than in
Project coordinator: G. Scott Gazelle, Chair, ACR objective was to answer questions such as: Is Figure 1A. The purpose of this figure is to illus-
Commission on Research and Technology Assessment. there any relation between infarct volume and trate that a scatterplot alone is not sufficient to
ejection fraction? What is the strength of this make conclusions about the strength of the rela-
Staff coordinator: Jonathan H. Sunshine, Senior Director
for Research, ACR. relation? Does ejection fraction increase or de- tionship between two variables. The plot needs
crease with increasing myocardial infarct vol- to be accompanied by an objective measure.
1Technology Assessment Unit, Royal Victoria Hospital, ume? By how much would we expect the ejec-
Montreal, QC H3A 1A1, Canada. tion fraction to change when the myocardial Pearson’s Correlation Coefficient
infarct volume increases by 1 mL? Can we pre- Pearson’s correlation coefficient is one such
2Department of Epidemiology and Biostatistics,
dict a patient’s ejection fraction when given his objective measure of the linear relation be-
McGill University, 1020 Pine Ave. W, Montreal QC H3A 1A2,
Canada. Address correspondence to N. Dendukuri or her myocardial infarct volume? How accu- tween two variables. Pearson’s correlation co-
(nandini.dendukuri@mcgill.ca). rate is this prediction? efficient (which we denote by rP) between two
Questions such as these arise in situations in variables X (e.g., infarct volume) and Y (e.g.,
3Department of Diagnostic Radiology, Montreal General which more than one variable has been mea- ejection fraction) is given by:
Hospital, McGill University Health Centre, 1650 Cedar Ave.,
Montreal QC H3G 1A4, Canada.
sured on each patient (or observational unit) in rP = Correlation (X, Y) =
a sample, and the relationship between the dif- Covariance(X, Y)
4Department of Oncology, Synarc, 575 Market St., ferent variables is of interest. This module cov-
San Francisco CA, 94105.
Variance(X) Variance(Y)
ers some of the most commonly used statistical
N

AJR 2005;185:3–18
tools to answer such questions: correlation co- ∑ (x i − x )( y i − y )
efficients and regression models. We will = i =1
,
0361–803X/05/1851–3
cover methods for studying the relation be- N N

© American Roentgen Ray Society tween two variables that may be both continu-
∑ ( xi − x ) 2 ∑ ( y i − y ) 2
i =1 i =1

AJR:185, July 2005 3


Dendukuri and Reinhold

where xi and yi are the values of variables X (1) Correlation is independent of the units
and Y observed on each individual in the sam- in which the two variables are measured. If
ple, x and y are the sample means of X and our interest is in measuring the strength of the
Y, and N is the number of individuals in the relation between ejection fraction and myo-
sample. The denominator of this expression is cardial infarct volume, it does not matter
the square root of a positive quantity and is al- whether the latter was measured in milliliters
ways taken to be positive. The numerator, on (mL) or liters (L).
the other hand, can be positive or negative de- (2) High correlation may indicate a strong
pending on the nature of the relation between association but not causation. Note that in the
X and Y. If X tends to increase when Y in- expression for rP, X and Y may be inter-
creases, then it is likely that when an individ- changed with no difference to the result. This
ual xi exceeds the sample mean x , the corre- means that the variables X and Y are not dis-
sponding yi also exceeds its mean y . This tinguished as “predictor” and “outcome” and
would cause the numerator, and thus rP itself, it does not matter whether X causes Y or vice
to be positive. If, on the other hand, X de- versa. It would be incorrect to assume that a
creases as Y increases, it is likely that xi is less high correlation between myocardial infarct
than x when yi is greater than y . This would volume and ejection fraction means that one
result in a negative value of the numerator and of them is the cause of the other. Rather, we
of rP. In the example in Figure 1, we find that can only say that there is a strong association
ejection fraction tends to decrease with in- between them.
creasing myocardial infarction volume. Thus, A (3) The observed correlation (or lack of it)
patients whose ejection fraction exceeds the may be due to a confounding variable. In
mean ejection fraction of the sample are more some situations the observed association (or
likely to have myocardial infarct volumes that lack of it) may be spurious and, in fact, reflect
are smaller than the mean myocardial infarct the effect of a third variable, referred to by ep-
volume of the sample, resulting in a negative idemiologists as a “confounding variable”
value of rP. [5]. Such a variable is associated with both X
Pearson’s correlation coefficient can range and Y. Figure 3A is a scatterplot of the rela-
from a minimum value of −1 to a maximum tion between endometrial thickness (mea-
value of 1. Figure 2 illustrates the value of rP sured at transvaginal sonography) and peak
in various prototypical situations. A value of systolic velocity (measured at Doppler imag-
rP = 1 is obtained when an increase in X is al- ing) in postmenopausal women presenting
ways associated with an increase in Y and the with abnormal vaginal bleeding. The value of
points in the scatterplot between X and Y can rP for the entire sample is only moderate (rP =
be joined to form a perfect straight line 0.36). The sample was then divided into
(Fig. 2A). A value of rP = –1 is indicative of women with endometrial atrophy, those with
a perfect negative linear relation between X endometrial hyperplasia, and those with en-
and Y (Fig. 2B). As the strength of the linear dometrial carcinoma, and rP was calculated
relation between X and Y diminishes, the separately within each group. We find that the
value of rP approaches 0 (Figs. 2C and 2D). A true strong relation between endometrial
correlation coefficient of 0 indicates that thickness and peak systolic velocity is ob-
there is no relation between the two variables. scured because both variables have an associ-
For the hypothetical data in Figure 1 we find B ation with the histologic subgroups.
that rP is −0.91, suggesting a fairly strong Fig. 1—Scatterplots show relation between myocar- (4) Correlation between aggregate values
negative relation between myocardial infarct dial infarct volume and ejection fraction and illustrate is stronger than at the individual level. In Fig-
volume and ejection fraction. The interested effect of changing scale of ejection fraction axis. ure 3B, the blank circles form a scatterplot of
A and B, Relation between the two variables may
reader is referred to the table at the end of the appear stronger in A than in B, but both figures are endometrial thickness versus peak systolic
appendix for a more detailed explanation of based on same data. Altering scale of ejection fraction velocity in postmenopausal women present-
how to calculate the correlation coefficient. axis makes relation in B appear less strong than in A. ing at three health centers (university-based,
Figures 2E and 2F illustrate two situations community hospital, and walk-in clinic). The
in which there is a perfect, though nonlinear, close to 0, suggesting only a weak relation dark circles plot the relation between the av-
relation between X and Y. In Figure 2E, an between X and Y. These plots serve to illus- erage endometrial thickness and average peak
increase in X is always accompanied by an trate that a value of rP close to 0 does not rule systolic velocity for each of the three health
increase in Y. Here, rP is quite high (0.92), out the possibility of a strong nonlinear rela- centers. The correlation between the average
although not equal to 1. In Figure 2F we have tionship between the variables. values is almost 1, despite a weaker correla-
a U-shaped relation between the variables, Interpreting Pearson’s correlation coeffi- tion at the patient level.
with both low and high values of X being as- cient—A few things need to be kept in mind (5) Correlation is influenced by the range
sociated with high values of Y. Here rP is when interpreting a correlation coefficient: of the X and Y variables. The greater the range

4 AJR:185, July 2005


Correlation and Regression

A B

C D

E F
Fig. 2—Examples of different values of Pearson’s (rP) and Spearman’s (rS) correlation coefficients.
A, Value of rP = 1 is obtained when increase in X is always associated with increase in Y and points in scatterplot form a straight line.
B, Value of rP = –1 is indicative of negative linear relation between X and Y.
C and D, As strength of linear relation between X and Y diminishes, value of rP approaches 0.
E and F, Plots show nonlinear relation between X and Y.

of the X and Y variables in the sample, the large mean difference would suggest that the struct confidence intervals for rP, as we will
greater the correlation between them. Thus, a two measures are in fact not equivalent [6]. see.
single outlying observation might give us a Assumptions used in calculating Pearson’s Inference for Pearson’s correlation coeffi-
falsely elevated correlation coefficient. correlation coefficient—Some important things cient—The sample correlation coefficient, rP,
(6) High correlation does not mean mea- need to be kept in mind before calculating rP. is a statistic the value of which changes de-
surement equivalence. When comparing two First, it is based on the assumption that both pending on the sample collected. It is only an
imperfect measurements of the same underly- X and Y are measured on an interval scale. estimate of the population correlation coeffi-
ing quantity, a high correlation is often used When we say myocardial infarct volume has cient, ρP, that we would have obtained if it
as a proof of strong agreement, but that is not been measured on an interval scale, we mean were possible to observe the entire population
correct. For example, we might be interested that a myocardial infarct volume of 4 mL is of patients (or study units) from which the
to determine whether measurements of the twice as large as a myocardial infarct volume sample was collected. When reporting the
length of liver lesions using MRI and sonog- of 2 mL. This would not have been true if it sample correlation coefficient, we also need
raphy are equivalent. A high positive correla- were measured by a nominal variable having to report some measure of our uncertainty in
tion suggests only that increasing values of values 1 (small), 2 (medium), and 3 (large) the knowledge of the population correlation
one measure are associated with increasing because we cannot say that a patient rated as coefficient. This uncertainty may be ex-
values of the second; it does not necessarily “medium” has twice the myocardial infarct pressed in terms of a p value or a confidence
mean that they are measuring the same thing. volume of a patient rated as “small.” Second, interval [3]. Confidence intervals are pre-
A better approach to evaluating equivalence both X and Y are assumed to follow a normal ferred to p values because they provide more
would be to examine the difference in magni- probability distribution [2]. This assumption information regarding the parameter esti-
tude of the observations on each patient. A allows us to perform hypothesis tests and con- mated. An earlier article in this series ex-

AJR:185, July 2005 5


Dendukuri and Reinhold

A B
Fig. 3—Pearson’s correlation coefficients.
A and B, Graphs show correlation coefficients in the presence of confounding (A) and from aggregate data (B).

plains in detail the distinction between confi- H0: ρP = 0 and conclude that there is an asso- how to interpret a confidence interval). The
dence intervals and p values [3]. However, p ciation between myocardial infarct volume 95% confidence interval may also be inter-
values are still frequently reported in the med- and ejection fraction. preted as the range of values of the null hy-
ical literature, so we cover methods for their Confidence interval: The hypothesis test- pothesis (ρ0) that cannot be rejected at the 1 −
calculation and interpretation here. ing approach limits us to a single hypothesis, 0.95 = 0.05 level of significance.
p value: A p value measures the strength of which is often artificially set up. Rather than The fact that our 95% confidence interval
the evidence in favor of a null hypothesis of simply concluding that the population corre- does not include 0 means that the null hypoth-
the form H0: ρP = ρ0, where ρ0 is a predeter- lation coefficient is not 0, we might want to esis of ρ0= 0 would be rejected, which is the
mined value of the correlation coefficient of say a little more about the strength of the cor- same conclusion we reached earlier using the
interest. In our example on myocardial infarct relation. A confidence interval is more infor- p value. A better approach would be to com-
volume and ejection fraction, we can set ρ0 = mative in that it gives us the range of possible pare the confidence interval with a predeter-
0 to measure the evidence in favor of “no as- values of ρP that are compatible with the ob- mined range of values indicative of no rela-
sociation between the two variables.” When served value of the correlation coefficient. tion between the variables. For example, let
the p value is very low (typically < 0.05 or Details of the calculation of the confidence us say that a correlation coefficient in the
0.01) we reject the null hypothesis. Details on interval are given in Appendix 1. The 95% range from –0.1 to 0.1 is in practice indicative
how to calculate the p value are provided for confidence interval for the correlation coeffi- of no relation between myocardial infarct vol-
the interested reader in Appendix 1. We find cient between myocardial infarct volume and ume and ejection fraction. Then the fact that
that the p value for our example is very, very ejection fraction is (−0.96 to −0.81). If our hy- our confidence interval clearly lies outside
small (<< 0.001). In other words, the proba- pothetical study were repeated several times this region leads us to conclude there is a
bility that we would have observed a correla- and a confidence interval calculated each strong, negative relation between myocardial
tion as strong as rP = −0.91, when in fact the time, then 95% of the confidence intervals infarct volume and ejection fraction.
true correlation between myocardial infarct would capture the true value of ρP. However,
volume and ejection fraction was ρP = 0, is we cannot say if the interval obtained from Partial Correlation
very, very small—much less than 0.0001. our sample is one of the 95% that capture the It is possible that the observed correlation
Therefore, we reject the null hypothesis of true value of ρP (see [3] for more details on between two variables (X and Y) may be in part

6 AJR:185, July 2005


Correlation and Regression

because of a third variable (Z) that is related to after adjusting for the effect of two or more the direction of an association. Regression
both of these variables. When this third con- variables. Multiple regression, which is dis- models go a step further and can be used to
founding variable is also observed, we may be cussed later in this article, can be used for the predict the value of one variable given the
interested in estimating the correlation between same purpose and is more straightforward to other. This quality makes them suitable for
X and Y after eliminating the effect of their cor- perform using commonly available statistical the study of relationships when the two vari-
relation with Z. For example, in a study of liver software packages. ables can be distinguished as “predictor” and
lesion characterization using three diagnostic “outcome.” Note, however, that fitting a re-
tests—sonography, CT, and MRI—the Pear- Spearman’s Rank Correlation gression equation between two variables does
son’s correlation coefficient between the accu- Spearman’s rank correlation, which we de- not imply a causal relation between them. Re-
racy of the different diagnostic tests was as note by rS, is another statistic used for mea- gression models also provide a more straight-
shown in the following equations: suring the correlation between a pair of vari- forward approach to adjusting for the effect of
ables. It is called a nonparametric measure confounding variables. They can be used to
rP (sonography, MRI) = 0.7 and is preferred when assumptions required deal with a variety of types of outcome vari-
for calculating Pearson’s correlation coeffi- ables (continuous, dichotomous, ordinal,
rP (CT, sonography) = 0.8 cient are violated—that is, when X and/or Y count data, and so forth). Here, we focus on
are not measured on an interval scale, or when two of the most commonly used models for
rP (CT, MRI) = 0.9 X and/or Y do not follow a normal probability radiologic applications—linear regression
distribution. To calculate Spearman’s corre- models, in which the outcomes are continu-
Clearly, all three methods are correlated with lation coefficient, we need to assign a rank to ous, and logistic regression models, in which
each other. What is the correlation between the individual values of X and Y—that is, sort the outcomes are dichotomous.
the diagnostic performance of sonography each of X and Y in increasing order and assign Regression is a broad area to which this ar-
and MRI alone, after eliminating the effect of them ranks so that the smallest observation ticle provides but a brief introduction. Greater
the correlation that both have with CT? To es- has a rank of 1 and the highest observation has detail on estimation and inference for linear
timate this, we can calculate a partial correla- a rank of N. The expression for Spearman’s and logistic regression is covered in introduc-
tion coefficient. The partial correlation be- correlation coefficient is similar to Pearson’s tory biostatistics textbooks [7–9]. More com-
tween X and Y after having eliminated the correlation coefficient, except that xi and yi plex topics, such as regression model diagnos-
effect of a third variable Z is given by: are replaced by the rank(xi) and rank(yi) as tics, variable selection, and logistic regression
follows: for ordinal variables, are covered in greater
rP ( X , Y ) − rP ( X , Z ) rP (Y , Z ) depth in advanced textbooks [10–13].
rXY .Z = N
N+1 N+1
1 − rP ( X , Z ) 2
1 − rP (Y , Z ) 2 ∑ ⎛⎝ rank ( xi ) – ------------
2 ⎠⎝
-⎞ ⎛ rank ( y i ) – -------------⎞
2 ⎠
i=1 Simple Linear Regression
r S = ------------------------------------------------------------------------------------------------------------------------------
N N Like Pearson’s correlation coefficient, sim-
⎛ rank ( x ) – N + 1-⎞ 2 ⎛ rank ( y ) – N + 1-⎞ 2
If Z is not a confounding variable, one or both ∑ ⎝ i ------------
2 ⎠ ⎝ i ∑
------------
2 ⎠ ple linear regression is also used to character-
of rP (X,Z) and rP (Y,Z) would be 0 or very i=1 i=1 ize linear relationships between variables. It is
small. In such a situation, the partial correla- Spearman’s correlation coefficient ranges distinguished from multiple variable linear re-
tion between X and Y (rXY.Z) would be similar between −1 and 1, with these extreme values gression (discussed later) in that it involves
to the Pearson’s correlation coefficient be- indicating a perfect negative or positive rela- only two variables, the outcome or dependent
tween them (rP [X,Y]). tionship, respectively, between X and Y. It variable and the predictor or independent vari-
The partial correlation coefficient between takes the value 0 when there is no relation be- able. The standard form of the simple linear re-
performance in sonography and MRI in our tween the variables (Figs. 2A–2D). An ad- gression equation is as follows:
example is shown in these equations (where vantage of Spearman’s correlation coefficient
US = sonography): over Pearson’s correlation coefficient is that Y = α + βX + ε ,
it can be used to evaluate a nonlinear relation where X and Y are the observed values of the
rP (US , MRI ) − rP (US , CT )rP ( MRI , CT )
rUS MRI .CT = between variables when the direction of the predictor and the outcome variables, respec-
1 − rP (US , CT ) 2 1 − rP ( MRI , CT ) 2 relationship does not change. In Figure 2E, tively. The parameters α and β are called the
0.7 − 0.8 × 0.9 where Y continuously increases with X, we intercept and the slope, respectively. For a
= = −0.08
1 − 0.8 2 1 − 0.9 2 see that the perfect nonlinear relationship be- given value of X, the predicted value of Y is α +
tween the variables is captured by Spear- βX. The term ε, the residual (or error), is the
Thus, after eliminating the contribution of CT, man’s correlation coefficient, although not by difference between the observed value of Y
we find that the strong relation between sonog- Pearson’s correlation coefficient. However, and the predicted value of Y. The intercept
raphy and MRI vanishes. Moreover, it appears like rP, rS is inappropriate for measuring the and slope parameters are estimated with the
that the direction of the relation changes as strength of a nonlinear relationship that both aim of reducing this difference. The estimated
well, suggesting that after removing the contri- increases and decreases, such as the U-shaped values of the intercept and slope are denoted
bution of CT, lesions that are accurately diag- relation in Figure 2F. by a and b, respectively. An important as-
nosed with sonography in fact are poorly diag- sumption of the linear regression model is
nosed with MRI and vice versa. Regression that the residuals are assumed to follow a nor-
This concept can be extended to calculate The correlation coefficients described thus mal distribution with mean 0 and a variance
the partial correlation between two variables far can be used to measure the strength and σ2, which remains constant for all values of X.

AJR:185, July 2005 7


Dendukuri and Reinhold

Fig. 4—Graph shows simple linear regression line


between ejection fraction and myocardial infarct vol-
ume (MIV).

These assumptions imply that for a given in the lowest possible residual), ε i, for each the myocardial infarct volume is about
value of X, the error in predicting the outcome patient. We use a criterion that minimizes the 3.53%. This error is quite small when com-
is 0 on the average. Moreover, the magnitude sum of the squared residual terms: pared with the range of ejection fraction val-
of the error is not associated with X. ues—roughly 40–70%—suggesting that our
For our hypothetical example of the rela- N N
regression equation has a good predictive
tion between myocardial infarct volume and ∑ ε =∑ ( y
i =1
i
2

i =1
i − a − bxi ) 2 ability on average.
ejection fraction, the estimated simple linear The residual SE, s, can be used to obtain
regression equation is as follows: This is known as the method of least squares. estimates of the SEs of a and b and of the
The expressions for the estimated values of predicted value of the outcome variable us-
ejection fraction = 70 − 3.6 (myocardial the intercept and the slope obtained using the ing the formulae given in Appendix 2. These
infarct volume) + ε method of least squares are given in SEs can be used to perform inferences for
these parameters via hypothesis tests or con-
(see the solid line in Fig. 4). a = y − bx fidence intervals. In our example we find
The intercept of the regression model is that the confidence interval for the slope of
equal to the predicted value of the outcome where the regression line is (−4.3% to −2.9%). Be-
when the predictor variable is 0. This param- cause this interval does not include 0, we
N
eter is of interest only in those situations in can conclude that there is an association be-
which 0 lies within the plausible range of X ∑(x i − x )( yi − y )
tween myocardial infarct volume and ejec-
values. Figure 4 shows that when the myocar- b= i =1
N tion fraction.
dial infarct volume is 0 mL, the ejection frac- ∑ (x i − x) 2 Model diagnostics—After having ob-
tion is predicted to be equal to the intercept, or i =1 tained the intercept and slope of a regres-
70%. The slope of the regression model is the (See the table in Appendix 2 for an illustrative sion model, we need to verify whether the
change in the outcome corresponding to a unit example of how to calculate a and b for a basic assumptions on which the model was
change in the predictor variable. A slope of 0 smaller sample of five patients. Notice that built were satisfied. We need to evaluate
indicates that no relation exists between the much of the calculation involves the terms al- whether the residuals follow a normal prob-
predictor and outcome variables. From Figure ready used in the calculation of Pearson’s cor- ability distribution, whether the variance of
4, we see that when the myocardial infarct relation coefficient.) In addition to a and b, the residuals is constant for all values of X,
volume increases by 1 mL, the predicted we also obtain an estimate for the SE (i.e., and whether the relation between Y and X is
value of the ejection fraction decreases by an square root of the variance) of the residuals, linear. All of these assumptions can be ver-
amount equal to the slope, or –3.6%. which we denote by s: ified using the following simple plots of the
Selecting the “best-fitting” line—We need residuals.
an objective criterion to help us estimate α N
Normal probability plot—A normal proba-
and β so that we have a best-fitting straight ∑(y i − a − bxi ) 2 bility plot is used to verify whether the resid-
line. As explained earlier, we would like to s= i =1
uals follow a normal probability distribution.
use the regression equation to predict the out- N −2 Most standard statistical software packages
come variable using the predictor variable. For our example, the SE of the residuals is can be used to produce this plot. Figure 5A il-
Clearly, we would like to do so in a way that given by s = 3.53. This tells us that the aver- lustrates the ideal situation, in which the re-
minimizes the error in prediction (i.e., results age error in predicting the ejection fraction by siduals do indeed follow a normal distribution

8 AJR:185, July 2005


Correlation and Regression

A B
Fig. 5—Prototype normal probability plots.
A and B, Graphs show plots with normally distributed residuals (A) and with residuals skewed to the right (B).

and we observe a straight line along the diag- relation between outcome and predictor is non- sample would have been the sample mean
onal of the plot. Any departure of the residu- linear. We find that values of X that are close to ejection fraction y —that is, the predicted
als from a normal distribution will show up as its minimum or maximum are associated with value of the ejection fraction would be identi-
a deviation from this straight line. Figure 5B positive residuals, whereas values of X in the cal for all patients and equal to y = 54.2%.
illustrates a case in which the residuals are middle of its range are associated with nega- This would be equivalent to assuming a = y and
skewed to the right and we observe a curved tive residuals. The parabolic relation between b = 0 (the horizontal dotted line in Fig. 4) and
line below the diagonal. A possible corrective the residuals and X in this plot suggests that Y would result in the maximum possible value
measure for this problem is to model the nat- is in fact a quadratic function of X—that is, Y for the sum of the squared residuals. A com-
ural logarithm of the outcome instead of the is a function of both X and X2. In Figure 6C, we monly used method to estimate the usefulness
outcome itself. see an increase in the magnitude of the residu- of a linear regression line is to compare the
Scatterplot of residuals versus X—Fig- als with increasing X. This tells us that our as- decrease in the sum of the squared residuals
ures 6A–6C are prototype scatterplots of the sumption of a constant variance has been vio- with this maximum value. This is done using
residuals versus the predictor variable, X. In lated. As a result, the prediction of the outcome the R2 statistic, which is an estimate of the
Figure 6A, we have the ideal situation, in is better for lower values of X than for higher proportion of the total variation in Y that is ex-
which the model is appropriate. The residuals values. plained by X. The R2 statistic ranges from a
are randomly scattered about the value of 0 for Model fit—The usefulness of the regres- minimum of 0% when X is not related to Y to
the entire range of X. Furthermore, the residu- sion model is determined by how well it pre- 100% when there is a perfect relation between
als fall in a horizontal band of equal width for dicts the outcome—that is, how well it fits the the two variables. In our example, we found
the entire range of X, meaning that they have a data. In the absence of information on myo- that R2 = 82.5%, meaning that myocardial in-
constant variance. In Figure 6B, we have a sit- cardial infarct volume, our best guess at pre- farct volume explains 82.5% of the observed
uation in which the residuals indicate that the dicting the ejection fraction for patients in our variation in the ejection fraction.

AJR:185, July 2005 9


Dendukuri and Reinhold

A B
Fig. 6—Graphs show prototype plots for linear regression diagnostics using residuals.
A–C, In ideal situation (A), model is appropriate; in B, residuals indicate that relation
between outcome and predictor is nonlinear; in C, prediction of outcome is better for
lower values of X than for higher values.

Multiple Variable Linear Regression widely available statistical software pro- an increase of 1 year in a patient’s age is as-
Simple linear regression can be extended to grams can calculate these quantities. We fo- sociated with a decrease in the GFR of 0.06
accommodate more than one predictor vari- cus instead on the interpretation of the model. mL/min.
able. For example, a patient’s glomerular fil- Table 1 presents the results from a hypo- Ordinal and nominal predictor variables—
tration rate (GFR) can be predicted by a linear thetical study relating the GFR to the predic- When including nominal predictors (e.g.,
combination of the patient’s age, weight, sex, tor variables mentioned here among 100 pa- variables such as sex or country of origin that
and the inverse of his or her serum creatinine tients with ages ranging from 40 to 60 years, have no natural ordering) or ordinal predic-
value by using an equivalent of the form: weight ranging from 40 to 100 kg, and serum tors (e.g., age measured in 5-year categories)
GFR =
creatinine levels between 180 and 200 in a regression model, we need to create what
mmol/L. The intercept is the predicted value are called “dummy variables” or “indicator
α + β1 (age) + β 2 (weight) + β 3 (sex) +
of the outcome in the event that all predictor variables.” To do this, we identify one of the
1
β4 ( )+ ε. variables are equal to 0. This quantity is of in- categories of the predictor as a reference cat-
serum creatinine terest only when it is possible for all predictor egory. In the case of ordinal variables, the ref-
As in the case of the simple linear regres- variables in the model to be simultaneously erence category is typically the lowest cate-
sion model, the unknown parameters α, β1, equal to 0. In the example in Table 1, the in- gory. For example, if age is a three-category
β2, β3, and β4 are estimated with the objective tercept is not of interest because the values ordinal variable having values 61–65 years,
of minimizing the sum of the squared residu- age = 0, weight = 0, and 1 / serum creatinine = 66–70 years, and 71–75 years, the 61–65 year
als (i.e., the sum of the squared differences 0 are not possible. The regression coefficients category could be selected as the reference. In
between the observed GFR values for each (estimates of the β1 parameters) correspond- the case of nominal variables, where there is
patient and the predicted values according to ing to continuous predictors are interpreted as no clear ordering of the categories, any cate-
the regression model). We do not present the the change in the outcome variable for a unit gory may be arbitrarily selected as the refer-
expressions for calculating the different coef- change in the predictor variable, while the re- ence. Once the reference category has been
ficients and their confidence intervals be- maining predictor variables are constant. This determined, we create indicator variables cor-
cause these are cumbersome, requiring means that among a group of patients with a responding to each of the remaining catego-
knowledge of matrix theory. Moreover, most common weight, sex, and serum creatinine, ries of the predictor. The indicator variables

10 AJR:185, July 2005


Correlation and Regression

TABLE 1: Multiple Variable Linear Regression Model for Predicting Glomerular TABLE 2: Comparing Different
Filtration Rate Candidate Models for Predicting
Estimated SE of
Glomerular Filtration Rate
Regression Regression 95% CI for Regression Bayesian
Predictor Coefficient Coefficient t Statistic pa Coefficient Independent Variables R2 Information
Intercept –9.30 17.90 –0.52 0.60 (–44.38 to 25.78) in Model (%) Criterion
1 / SCR 9859.28 3194.50 3.09 0.003 (3598.05–16120.51) 1 / SCR 10 –6.22
Age –0.06 0.08 –0.75 0.46 (–0.21 to 0.09) Age 1 4.02
Sex –2.60 1.06 –2.46 0.02 (–4.67 to –0.53) Sex 10 –5.65
Weight 0.07 0.05 1.42 0.16 (–0.03 to 0.17) Weight 7 –2.76
Note—CI = confidence interval, SCR = serum creatinine. 1 / SCR + sex + weight 20 –9.44
aObtained from the tables of the t-distribution with N − k = 100 − 4 = 96 degrees of freedom.
1 / SCR + sex + weight + age 21 –5.42
Note—SCR = serum creatinine.
take the value of 1 if a patient is in the cate- sponding to age and weight are not. A similar
gory to which it corresponds or 0 otherwise. conclusion is obtained on the basis of the p penalty for every additional predictor added.
Because three categories were defined for the values. Our interest is not in the actual value of the
variable age, this means we need to create two Model fit—The R2 statistic introduced ear- BIC for a given model, but rather the differ-
indicator variables—one would take the lier can also be used to evaluate model fit for ence in the BIC between two models. The
value 1 for patients in the 66–70 year cate- multiple variable linear regression models. lower the BIC, the better the fit of the model.
gory, and the second would take the value 1 The R2 statistic is defined as the proportion of From Table 2, we see that according to the
for patients in the 71–75 year category. Both the variance in the outcome variable ex- BIC criterion, adding age to the model wors-
indicator variables are added to the regression plained by the regression model. It ranges be- ens the model fit. Although criteria such as
model as predictors. tween 0% and 100%, with values closer to the R2 and BIC may be used to assess model
In the example for GFR, the only noncontin- 100% indicating a better model fit. In our ex- fit, the choice of which predictor variables go
uous predictor is sex. The category “male” was ample for predicting GFR from age, weight, into a model depends also on their clinical rel-
regarded as the reference category. Thus, the sex, and serum creatinine level, the R2 statis- evance, their impact on the magnitude of re-
variable “sex” is an indicator for the female tic was quite low, meaning that the informa- gression coefficients associated with the re-
sex. It takes the value 1 if the patient is female tion obtained explained only 21% of the ob- maining predictors, and their statistical
and 0 if the patient is male. The regression co- served variation in GFR. A low value of R2 is significance.
efficient corresponding to sex tells us that after not unusual in real-life applications. Model validation—An important way to
adjusting for the effect of other predictor vari- Model selection—When we have several evaluate a model is to use it to predict the out-
ables, female patients have a GFR that is 2.60 candidate predictor variables, we are often come in a data set that is independent of the
mL/min lower than that of male patients. faced with the challenge of choosing between one used to fit the regression model. This step
Inference for regression coefficients— different models that are based on different is referred to as “model validation.” Repeat-
Along with regression coefficients, we can re- predictors. Besides assessing the fit of a ing the study to collect new data may not al-
port confidence intervals that give an idea of model, the R2 statistic may also be used to ways be a feasible option because of the cost
the uncertainty in estimating them. If the con- compare two different models for the same and time involved. Instead, if we have a suf-
fidence interval corresponding to a predictor outcome. Table 2 lists R2 values for different ficiently large sample, we may choose to split
variable does not include 0, we conclude that candidate multiple regression models with the data set into two parts—a model-building
it is statistically significant. Alternatively, we GFR as the outcome. The model with the or training data set that is used to estimate the
could perform a hypothesis test based on the highest value of R2—that is, the model that regression coefficients, and a validation data
t distribution and report a p value that tells us best explains the observed variation in set. This is known as cross-validation [11].
the probability of observing our estimated re- GFR—is the model with all four predictor The model-building data set needs to be suf-
gression coefficient if its true value is 0. If the variables included simultaneously. In inter- ficiently large to obtain the required precision
p value is much smaller than a predetermined preting these results, it must be noted that the in estimating the regression coefficients. If
level of significance (typically 0.05 or 0.01), R2 statistic is influenced by the number of this is not possible with half the data, the
we reject the null hypothesis that the regres- predictor variables in the model. Notice that model-building data set may be larger than
sion coefficient is equal to 0. If there are k pa- in Table 2 the R2 statistic increases with every the validation data set.
rameters in a model, the p value is obtained additional predictor added to the model. Confounding and effect modification—A
from the tables of the t distribution with N – k Thus, when comparing two models, the R2 multiple linear regression model allows us to
degrees of freedom (df), where N is the sam- statistic may simply favor the model with the study the relation between a primary predic-
ple size and k is the number of predictors in greater number of predictors. tor, X (e.g., the experimental treatment), and
the regression model. In our example, we can Besides the R2 statistic, several other crite- the outcome, Y, while adjusting for the effect
deduce from the 95% confidence intervals ria have been proposed for model selection. of one or more secondary predictor variables
that the regression coefficients corresponding One such criterion is the Bayesian informa- (e.g., the patient’s demographic characteris-
to both 1 / serum creatinine and sex are signif- tion criterion (BIC). This criterion assesses tics). For illustration, we will consider only
icantly different from 0, and those corre- model fit while simultaneously applying a one secondary predictor, Z, but the concepts

AJR:185, July 2005 11


Dendukuri and Reinhold

discussed here can be extended to the case of both the outcome (bone density) and the pre- Figure 7C illustrates the case when sex is an
more than one secondary predictor. A vari- dictor (weight)—women tend to have a lower effect modifier of the relation between weight
able Z is said to be a confounder if it is asso- bone density and a lower weight than men. and bone density—that is, the strength of the
ciated with both X and Y. The true relation be- Figure 7A illustrates the case when sex is a association between weight and bone density
tween Y and X is not determined by Z. confounding variable but not an effect modi- is modified by the variable sex. This means the
However, not including Z in the regression fier. Fitting a single regression line for both regression lines between bone density and
model results in an incorrect estimate of mag- men and women that includes only weight as weight among men and women have different
nitude or direction of the regression coeffi- a predictor, we obtain a regression coefficient slopes (see Fig. 7D). In our hypothetical exam-
cient of X. A variable Z is said to be an effect of 0.4 mass/volume units corresponding to ple, bone density increases more rapidly with
modifier if it affects the magnitude of the as- weight. Fitting two separate regression weight among women than among men. We
sociation between Y and X. To determine if Z lines—one among men and the other among can evaluate whether sex is an effect modifier
is an effect modifier, we must add both Z and women—we find that the slope of the two using a single multiple variable regression
the product XZ to the regression model be- lines is the same and is equal to 0.2 mass/vol- model that includes weight, sex, and their
tween Y and X. It is possible for a variable to ume units (Fig. 7B). This is the correct value product as predictors, as follows:
be both a confounder and an effect modifier. of the slope, which can also be obtained by
The difference between a confounder and fitting a single multiple variable regression Bone density = 105 + 0.2 weight – 10 sex –
an effect modifier is illustrated graphically in model in the entire sample that includes both 0.15 weight × sex
Figure 7. In this example, we are interested in weight and sex as predictors, as follows:
studying the relation between the primary From this single equation we can determine
predictor variable, weight (kg), and the out- Bone density = 105 + 0.2 weight − 10 sex, the different associations between bone den-
come variable, bone density (mass/volume sity and weight among men and women. By
units). In our hypothetical sample, the pa- where the predictor “sex” is an indicator vari- setting sex = 0 in this equation, we find that
tient’s sex is a variable that is associated with able for female sex. the regression coefficient associated with

A B

C D
Fig. 7—Confounding and effect modification.
A–D, Graphs illustrate confounding (A and B) and effect modification (C and D).

12 AJR:185, July 2005


Correlation and Regression

weight is 0.2, the same as was obtained by fit- tor variables. In the logistic regression 15%. The odds ratios for all predictor vari-
ting a separate linear regression model among equation for extremely high breast density, ables are obtained by taking the exponent of
men. Similarly, when setting sex = 1 in the BMI and age are both continuous variables, the regression coefficient.
equation, we find that the regression coeffi- and nulliparous is an indicator that the woman We can test whether each regression coef-
cient associated with weight = 0.2 – 0.15 = is nulliparous. ficient is different from 0 using a chi-square
0.05 mass/volume units, which is the same as The best estimates for the unknown param- test with N – k df, where N is the sample size
the regression coefficient obtained when fit- eters α, β1, β2, and β3 may be obtained by a and k is the number of predictors in the regres-
ting the model among women alone. If the re- statistical method known as maximum likeli- sion model. By comparing the chi-square p
gression coefficient corresponding to the hood. This method helps us identify the most values in Table 3 with the traditional level of
product term is significantly different from 0, likely value of the true parameters given the significance of the null hypothesis of α =
we conclude that there is an interaction be- observed data and under the assumption that 0.05, we conclude that the predictors nullipa-
tween weight and sex. the number of patients with the outcome of in- rous and BMI are statistically significantly
terest follows a binomial distribution [2]. associated with an extremely high breast den-
Logistic Regression The relation between each predictor vari- sity. Alternatively, we can report a confi-
Logistic regression, like linear regression, able and the outcome in a logistic regression dence interval for the odds ratio. If the confi-
can be used to relate a single outcome vari- model is expressed in terms of an odds ratio dence interval does not include 1, then the
able to one or more predictor variables. How- (for more about odds ratios see the article by predictor is considered statistically signifi-
ever, the outcome variable is dichotomous, Blackmore and Cummings [4] in this series). cant. If the confidence interval includes 1, as
having only two values (e.g., success or fail- When the predictor variable is ordinal or in the case of the predictor age in Table 3, we
ure of an experimental treatment, survival or nominal, the odds ratio is a comparison be- conclude that it is not significantly associated
death at the end of a 10-year follow-up). One tween each indicator variable and the refer- with the outcome.
value of the dichotomous outcome variable ence category. An odds ratio of 1 indicates As in linear regression a logistic regression
must be designated as the outcome of inter- there is no difference in the odds of the out- model can also be used to determine whether
est—for example, success when the outcome come of interest between the category associ- a particular predictor variable is a confounder
has the values success or failure, or death if ated with the indicator variable and the refer- or effect modifier. The fit of a logistic regres-
the outcome has the values death or survival. ence category. An odds ratio greater (lesser) sion model may be assessed using the BIC or
The odds of the outcome of interest are given than 1 indicates the outcome of interest is a statistic similar to the R2 statistic.
by the ratio of the probability of observing the more (less) likely in the category associated
outcome of interest, to the probability of not with the indicator variable than in the refer- Sample Size Determination
observing it: probability of success / probabil- ence category. Results for the extremely high Any well-designed research study must be-
ity of failure, or probability of death / proba- breast density example are given in Table 3. gin with an idea of the sample size required.
bility of survival. The logistic regression The odds ratio of 5.53 corresponding to nul- An insufficient sample size might leave us
equation relates the logarithm of the odds of liparous tells us that the odds of extremely with important questions unanswered. On the
the outcome to the predictor variables. high breast density are (5.53 − 1) × 100 = other hand, too large a sample size might mean
In a hypothetical study, logistic regres- 453% greater among women who are nullip- an unnecessarily expensive study. The sample
sion was used to predict the extremely high arous compared with those who are not. For a size required for a study is calculated so that it
breast density on mammography using in- continuous predictor variable, the odds ratio provides sufficient evidence to make infer-
formation on a woman’s parity (i.e., number gives the relative increase (or decrease) in the ences about the primary parameter(s) of inter-
of children), body mass index (BMI), and odds of the outcome for a change of one unit est in the study. As mentioned throughout this
age. Extremely high breast density was de- of the predictor variable. For example, in Ta- series, there is an increasing emphasis in scien-
fined as a dichotomous variable taking the ble 3, the odds ratio of 0.85 corresponding to tific journals on reporting of confidence inter-
value 1 when a woman’s breast density was BMI means that for a unit increase in the vals rather than p values. Thus, for this article
greater than or equal to 75%, and taking the BMI, a woman’s odds of extremely high we will limit ourselves to sample size formulae
value 0 when a woman’s breast density was breast density decrease by (1 – 0.85) × 100 = that are suitable for studies having the objec-
less than 75%. The resulting multiple logis-
tic regression equation had the following TABLE 3: Logistic Regression Model for Predicting Extremely High Breast
form: Density

Probability of EHBD Estimated SE of


ln( )= Regression Regression Chi-Square 95% CI for
1 - Probability of EHBD
Predictor Coefficient Coefficient Value (p)a Odds Ratio Odds Ratio
α + β1 (nulliparous ) + β 2 ( BMI ) + β 3 ( age) ,
Nulliparous
where ln is the logarithm to the natural base e No (reference) 0 — — 1 —
and EHBD is extremely high breast density.
Yes 1.71 0.62 7.61 (0.006) exp(1.71) = 5.53 (1.64, 20)
The predictor variables in a logistic regres-
sion equation may be continuous, nominal, or Body mass index –0.16 0.07 5.22 (0.023) exp(–0.16) = 0.85 (0.73, 0.98)
ordinal. As in the case of multiple linear re- Age –0.02 0.04 0.25 (0.599) exp(–0.02) = 0.98 (0.89, 1.07)
gression, nominal and ordinal predictor vari- Note—CI = confidence interval. Dash (—) indicates not applicable.
ables are entered into the equation as indica- aObtained by comparison with chi-square distribution with N – k = 102 – 3 = 99 degrees of freedom.

AJR:185, July 2005 13


Dendukuri and Reinhold

tive of reporting a confidence interval for the (see Appendix 1). Therefore, we need to de- ogists to evaluate the relation between vari-
primary parameters of interest. Furthermore, termine the maximum permissible value of ables. The article stresses the interpretation
we focus on sample size calculations for Pear- the confidence interval half-width on the of these statistics and describes formulae to
son’s correlation coefficient and simple linear transformed scale. To do this, we transform implement some of the simpler methods. Al-
regression. Sample size formulae for multiple both the guess value of the correlation coeffi- though it is unlikely that readers will actu-
variable linear regression and logistic regres- cient and the lower end of the confidence in- ally perform these calculations by hand be-
sion are available but involve complex meth- terval and calculate their difference. The cause they are all available in standard
ods and are typically implemented by software maximum permissible half-width of the statistical packages, our aim in discussing
programs [14]. These programs also provide transformed confidence interval, is given by them is to give the interested reader a better
calculations for studies in which the primary 1 1 + 0.85 1 1 + 0.8 understanding of the motivation behind the
objective is to test a null hypothesis and report w z = --- ln ------------------- – --- ln ---------------- = statistical methods. Because of limited space
2 1 – 0.85 2 1 – 0.8
a p value. we can only scratch the surface of many of
1.26 – 1.10 = 0.16. the topics under regression models. More
General Concepts for Sample Size Calculation details on the topics discussed here may be
Whatever the parameter of interest, cer- The sample size required to obtain a (1–α)% found in introductory [7–9] and advanced
tain concepts remain common to the exercise confidence interval is then calculated as [10–13] textbooks.
of sample size calculation. ⎛ Z 1– α ⁄ 2 ⎞
2
First, the sample size calculation requires a N = ⎜ ------------------⎟ ,
guess value for the parameter of interest (e.g., ⎝ wZ ⎠ References
correlation coefficient or the slope of a re- 1. Karlik SJ. Exploring and summarizing radiologic
gression model) and parameters of its proba- where Z1–α/2 is the (1–α/2) quantile of the data. AJR 2003; 180:47–54
bility distribution (e.g., SE of the slope). This standard normal distribution. Thus, to obtain 2. Joseph L, Reinhold C. Introduction to probability
is rather paradoxical because the goal of the a 95% confidence interval for our study, we theory and sampling distributions. AJR 2003;
study is to find out more about this parameter. would need a sample size of approximately 180:917–923
However, some reasonable range of guess ⎛ Z 1– α ⁄ 2 ⎞
2
⎛ 1.96⎞
2 3. Joseph L, Reinhold C. Statistical inference for con-
values for the parameter can usually be found N = ⎜ ------------------⎟ = ⎜ ----------⎟ = 150. tinuous variables. AJR 2005; 184:1047–1056
from the literature. ⎝ wZ ⎠ ⎝ 0.16⎠ 4. Blackmore C, Cummings P. Observational studies
Second, identify a clinically meaningful in radiology. AJR 2004; 183:1203–1208
range of values for this parameter. 5. Hennekens CH, Buring E. Epidemiology in medi-
Sample Size for the Slope of a Simple Linear cine. Boston, MA: Lippincott, Williams & Wilkins,
Sample Size for Pearson’s Correlation Coefficient Regression Model 1987
Assume we want to perform a study the Sample size calculation for the simple linear 6. Bland JM, Altman DG. Statistical methods for as-
goal of which is to measure the correlation be- regression model typically focuses on deter- sessing agreement between two methods of clinical
tween ratings of two experienced radiologists mining whether the slope is different from 0. measurement. Lancet 1986; 1:307–310
on a series of mammograms. Based on an ear- The required sample size can be obtained using 7. Moore DS, McCabe GP. Introduction to the practice
lier pilot study, our guess value for the corre- the same approach as that given in this article of statistics, 3rd ed. New York, NY: Freeman, 1998
lation coefficient is ρP = 0.85. A sufficiently for the correlation coefficient, by exploiting 8. Glantz SA. Primer of biostatistics, 5th ed. New
high correlation is deemed to be in the order the fact that a slope of 0 in a simple linear re- York, NY: McGraw-Hill, 2001
of 0.8–0.9. Any value less than this is consid- gression equation is equivalent to a correlation 9. Dawson B, Trapp RG. Basic and clinical biostatis-
ered poor correlation. Ideally, we would like of 0 between the predictor and outcome vari- tics, 3rd ed. New York, NY: McGraw-Hill Lange
our research study to unequivocally deter- ables. Suppose we plan to study the relation be- Medical Series, 2001
mine whether the true correlation between the tween renal length as measured by sonography 10. Harrell F. Regression modeling strategies: with ap-
reviewers is sufficiently high. This means we (predictor) and GFR (outcome) via simple lin- plications to linear models, logistic regression, and
would like our sample size to be large enough ear regression. Suppose also that a smaller pi- survival analysis, 1st ed. New York, NY: Springer-
to ensure that the confidence interval lies en- lot study of the relation between these vari- Verlag, 2001
tirely within or below the range 0.8–0.9—that ables had reported a correlation coefficient of 11. Kleinbaum DG, Kupper LL, Muller KE, Nizam A.
is, the half-width of the confidence interval 0.3 (−0.2 to 0.8). To conclusively show a rela- Applied regression analysis and multivariable
(or precision of our estimate) should be a tion between the two variables, we would like methods, 3rd ed. Pacific Grove, CA: Duxbury
maximum of 0.85 – 0.8 = 0.9 – 0.85 = 0.05. the confidence interval to lie within 0.1–0.5 Press, 1998
The calculation of the confidence interval re- (i.e., to eliminate 0). The required sample size 12. Hosmer D, Lemeshow S. Applied logistic regres-
quires the transformation of the correlation can be calculated using the methods described sion, 2nd ed. New York, NY: Wiley, 2000
coefficient, ρP, into earlier for Pearson’s correlation coefficient. 13. Kleinbaum DG. Logistic regression: a self-learning
text, 2nd ed. New York, NY: Springer-Verlag, 2002
Conclusion 14. Hintze JL. PASS [power analysis and sample size]
1 ⎡1 + ρ P ⎤
ZP = ln ⎢ ⎥ This article describes some of the most user’s guide. Kaysville, UT: NCSS [Number
2 ⎢⎣1 − ρ P ⎥⎦ common statistical methods used by radiol- Cruncher Statistical System], 1996

Appendix 1 appears on the next page.

14 AJR:185, July 2005


Correlation and Regression

APPENDIX 1. Inference for Pearson’s Correlation Coefficient (rP)

p value
To calculate the p value, we need to transform rP as follows: 1 ⎡1 + rP ⎤ 1 ⎡1 + (−0.91) ⎤ 1 ⎡1 − 0.91⎤
ZP = ln ⎢ ⎥ = ln = ln = −1.53
2 ⎣⎢1 − rP ⎦⎥ 2 ⎢⎣1 − (−0.91) ⎥⎦ 2 ⎢⎣1 + 0.91⎥⎦
1 ⎡1 + rP ⎤
ZP = ln ⎢ ⎥
2 ⎢⎣1 − rP ⎥⎦ Then transform ρ0 into
1 ⎡1 + ρ 0 ⎤ 1 ⎡1 + 0 ⎤
where ln is the natural logarithm. This transformation is required be- Z0 = ln ⎢ ⎥ = ln =0
cause even though X and Y may follow a normal distribution, rP does 2 ⎣1 − ρ 0 ⎦ 2 ⎢⎣1 − 0 ⎥⎦
not. However, ZP is known to follow a normal distribution with a stan-
dard deviation Finally, calculate the SD of ZP as

1 1 1
σZ = σZ = = = 0.19
n−3 N −3 30 − 3

making the calculation of the p value and confidence intervals easier. Using these three quantities, the test statistic can now be calculated
The remaining steps involved in calculating a p value are explained in as z = (ZP – Z0) / σZ = (–1.53 – 0) / 0.19 = –8.05. The evidence in favor
the box below. of the null hypothesis against an alternative hypothesis of “there is a
relation between myocardial infarct volume and ejection fraction”—
that is, HA: ρP ≠ 0 is equal to P(Z ≥|–8.05|). This is the probability that
Compute the test statistic a variable following a standard normal distribution is less than –8.05
or greater than 8.05. From the normal distribution tables, we find that
Z P − Z0 this probability is less than 0.0001. See module 10 in this series [2] for
z= an explanation of how to use the tables of the normal distribution.
σZ
The rule for estimating the p value depends on the alternative hy- Confidence interval
pothesis HA as follows (see [3] for more on hypothesis testing): As in the case of the p value, to construct a confidence interval for
ρP we first need to transform rP into ZP. The upper (uZ) and lower
When HA : ρP > ρ0, the p value is given by the probability P(Z ≥ z). (lZ) limits of the (1–α)% confidence interval on the transformed
scale are given by (lZ = ZP – Z1–α/2 σZ, uZ = ZP + Z1–α/2 σZ), where
When HA : ρP < ρ0, the p value is given by the probability P(Z ≤ z). σZ is the previously defined SD of ZP, and Z1–α/2 is the (1–α/2) quan-
tile of the standard normal distribution. The latter is the point below
When HA :ρP ≠ ρ0, the p value is given by the probability P(Z ≥|z|). which the area under the normal distribution curve is equal to 1 –α /
2. We then retransform these limits to obtain the (1–α)% confidence
The p value is calculated by comparing the test statistic with the ta- interval for ρP as (l = [exp(2lZ) – 1] / [exp(2lZ) + 1], u = [exp(2uZ) –
bles of the normal distribution. Typically, if the p value is less than 1] / [exp(2uZ) + 1]). In our example of myocardial infarct volume and
a predetermined level of significance, such as 0.05 or 0.01, the null ejection fraction, we can use the previously calculated values of ZP
hypothesis is rejected in favor of the alternative. and σZ to obtain a 95% confidence interval on the transformed scale
as (lZ = –1.53 – 1.96[0.19], uZ = –1.53 + 1.96[0.19]) = (–1.90 to
–1.16). The value Z1–α/2 = 1.96 is obtained from the normal distribu-
tion table. On retransformation, we obtain the limits of the 95% con-
Recall that in our example of myocardial infarct volume and ejec- fidence interval for ρP as
tion fraction, the correlation coefficient for the entire sample of n = 30 exp ( 2l z ) – 1 exp ( 2 ( – 1.90 ) ) – 1- = -------------------
0.02 – 1 = – 0.96,
patients was rP = –0.91. To estimate the evidence in favor of the hy- l = -----------------------------
- = --------------------------------------------
exp ( 2l z ) + 1 exp ( 2 ( – 1.90 ) ) + 1 0.02 + 1
pothesis “there is no relation between myocardial infarct volume and
ejection fraction”—that is, H0: ρP = 0—we begin by calculating the exp ( 2lu z ) – 1 exp ( 2 ( – 1.16 ) ) – 1- = ---------------- 0.1 – 1 = – 0.81.
u = --------------------------------
- = --------------------------------------------
test statistic. First transform rP into exp ( 2lu z ) + 1 exp ( 2 ( – 1.16 ) ) + 1 0.1 + 1

Appendix 2 appears on the next page.

AJR:185, July 2005 15


Dendukuri and Reinhold

APPENDIX 2. Inference for the Simple Linear Regression Model

Standard errors (SEs) for the intercept and slope of the simple linear regression model, and expressions for calculating the p value and con-
fidence interval for these parameters are given in the box below:

Typically, we are more interested in the slope than in the intercept. 2.05. For our example, we have already calculated b = –3.6% and sb =
A natural null hypothesis of interest is H0: β = 0. The SE of the slope 0.32. Thus, the 95% confidence interval is given by
in our example is given by sb = 0.32. See the table in this appendix
for an illustration of how to calculate sb in a smaller sample of five (b – t0.975,28 × sb to b + t0.975,28 × sb)
patients. Note the results there are slightly different from those in
this section because they are based on a different sample. Using the = (–3.6 – 2.05 × 0.32 to –3.6 + 2.05 × 0.32)
formula in the box, the test statistic can be calculated as
b = –---------
3.6- = – 11.5 = (–4.3% to –2.9%).
t b = ----
sb 0.32
This interval gives us an idea of the range of values of the slope
As in the case of the correlation coefficient, the p value that we re- that is compatible with the data and cannot be rejected by a hypoth-
port depends on the direction of the alternative hypothesis. If the al- esis test. Because the interval does not include 0, we can conclude
ternative hypothesis was HA: β ≠ 0, then the p value is given by that there is a negative relation between ejection fraction and myo-
P(tN–2 ≥|tb|)—that is, the probability that the standard t distribution cardial infarct volume.
with N – 2 = 28 degrees of freedom (df) takes values less than or For a given value of myocardial infarct volume, our simple linear
equal to –|tb| = –11.38 or greater than or equal to |tb| = 11.38. (Recall regression model may also be used to predict the ejection fraction for
N = our sample size of 30. See [3] more details on the t distribution.) an average patient or to predict the ejection fraction for an individual
Looking up the t distribution tables corresponding to N – 2 = 30 – 2 = patient. The SEs for the predicted mean ejection fraction and for an
28 df, we find that this probability is less than 0.001. Because this individual’s ejection fraction are as follows:
probability is much less than the traditional significance levels of SE for predicted mean outcome at x
0.05 or 0.01, we reject the null hypothesis and conclude that there is
a relation between ejection fraction and myocardial infarct volume.
Alternatively, we could construct a 95% confidence interval for the
slope. As mentioned previously, this is more informative than simply
reporting whether we did or did not reject a single null hypothesis. The SE for predicted individual outcome at x
term “t1–α/2, N–2” in the formula above denotes the 1–α/2 quantile of
the t distribution with 28 df (i.e., the point on the standard t distribution
below which there is a 1–α/2 probability). For a 95% confidence in-
terval, we have α = 1 – 0.95 = 0.5. The value of t1–α/2, N–2 = t0.975,28 =

16 AJR:185, July 2005


Correlation and Regression

TABLE: Calculating Pearson’s Correlation Coefficient and the Simple Regression Equation Between Myocardial Infarct
Volume and Ejection Fraction

(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Patient xi yi xi- x yi- y (xi- x )2 (yi- y )2 (xi- x )(yi- y ) y i − a − bxi ( yi − a − bxi ) 2
number (Myocardial (Ejection
(i) infarct fraction,
volume, %)
mL)
1 2.5 65 -2.5 15 6.25 225 -37.5 2 4
2 3.75 55 -1.25 5 1.5625 25 -6.25 -1.5 2.25
3 5 50 0 0 0 0 0 0 0
4 6.25 40 1.25 -10 1.5625 100 -12.5 -3.5 12.25
5 7.5 40 2.5 -10 6.25 100 -25 3 9
x=5 y =50 ∑(x − x) 2
∑ ( y − y) 2
∑ ( x − x )( y − y ) ∑( y i − a − bxi ) 2
= 15.625 = 450 = -81.25 = 27.5
N

∑ (x i − x )( y i − y )
− 81.25
rP = i =1
= = -0.97
N N
15.625 × 450
∑ (x
i =1
i − x ) 2 ∑ ( yi − y ) 2
i =1
N

∑ (x i − x )( y i − y )
− 81.25
b= i =1
N
= = -5.2
∑ (x
15.625
i − x) 2

i =1
a = y − bx = 50 − ( −5.2)(5) = 76
N

Re sidual sum of squares ∑( y i − a − bxi ) 2


27.5
s = 2
= i =1
= = 9.17
N −2 N −2 3
Therefore, s = 9.17 = 3.03
1 x2 1 52
sa = s + N
= 3.03 + = 4.06
∑(x
N 5 15.625
i − x )2
i =1

s 3.03
sb = = = 0.77
N
250
∑(xi =1
i − x) 2

1 ( x − x )2 1 ( 2 − 5) 2
s M, 2 = s + N
= 3.03 + = 2.67
∑ (x
N 5 15.625
i − x )2
i =1

1 ( x − x )2 1 ( 2 − 5) 2
sI, 2 = s 1 + + N
= 3.03 1 + + = 4.04
∑(x
N 5 15.625
i − x )2
i =1

Note—For ease of illustration, we limit the sample to five patients. These results are slightly different from those reported in the text because
they are based on a different sample. The mean myocardial infarct volume among these five patients is 5 mL, and the mean ejection fraction
is 50%. First, subtract the mean infarct volume from each patient’s infarct volume (see the column xi – x ). For example, for patient 2, we have
xi – x = 3.75 – 5 = –1.25. Then take the square of this value for each patient (see the column (xi – x )2). For patient 2, this would be –1.252 =
1.5625. Do the same for ejection fraction. Finally, for each patient multiply xi – x and yi – y . For patient 2, this is –1.25 × 5 = –6.25. In each
of columns 6, 7, and 8 above, add the values across all patients. The correlation coefficient can then be calculated from the resulting sums.
The slope and intercept of the regression model are calculated using columns 2, 3, 6, and 8. In column 10 we have the sum of the squared
residuals across patients. This is used in the calculation of the SEs for the slope, intercept, predicted average ejection fraction, and predicted
individual ejection fraction.

AJR:185, July 2005 17


Dendukuri and Reinhold

Notice that these two SEs are very similar except for the fact that an mean ejection fraction when myocardial infarction volume = 2 mL is
additional 1 appears in the term under the square root for the SE of the given by
predicted outcome for an individual. This causes the SE of the predicted
outcome for a single individual to always be greater than the predicted = ( ŷ 2 – t 0.975, 28 × s M, 2, ŷ 2 + t 0.025, 28 × s M, 2
outcome for an average individual. This is because of the additional
variance of the individual outcomes above the average outcome. In our = ( 62.8 – 2.05 × 1.14 to 23.6 + 2.05 × 1.14 )
example, SM,2 = 1.14, and SI,2 = 3.71. The predicted value of the outcome
when the predictor is equal to x is denoted by ŷ x . The predicted average = ( 60.5% to 65.1% )
ejection fraction corresponding to a myocardial infarct volume of 2 mL The confidence interval for an individual’s ejection fraction when
(denoted by ŷ 2 ) can be calculated using the regression equation as 70 – myocardial infarction volume is 2 mL is obtained by replacing the SE
3.6(2) = 62.8%. The expression for a (1–α)% confidence interval for the in the this expression by sI,x—that is, by
average ejection fraction is
= ( ŷ 2 – t 0.975, 28 × s I, 2, ŷ 2 + t 0.975, 28 × s I, 2 )
yˆ x − t1−α / 2, N −2 × sM,x, yˆ x + t1−α / 2, N −2 × sM, x
= ( 62.8 – 2.05 × 3.71 to 23.6 + 2.05 × 3.71 )
Recall that we had determined from the tables of the t distribution that
t0.975,28 is 2.05. Thus, the 95% confidence interval for the predicted = ( 55.2% to 70.4% )

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 10. Introduction to Probability Theory and Sampling
2. The Research Framework, April 2001 Distributions, April 2003
3. Protocol, June 2001 11. Observational Studies in Radiology, November 2004
4. Data Collection, October 2001 12. Randomized Controlled Trials, December 2004
5. Population and Sample, November 2001 13. Clinical Evaluation of Diagnostic Tests, January 2005
6. Statistically Engineering the Study for Success, July 2002 14. ROC Analysis, February 2005
7. Screening for Preclinical Disease: Test and Disease 15. Statistical Inference for Continuous Variables, April 2005
Characteristics, October 2002 16. Statistical Inference for Proportions, April 2005
8. Exploring and Summarizing Radiologic Data, January 2003 17. Reader Agreement Studies, May 2005
9. Visualizing Radiologic Data, March 2003

18 AJR:185, July 2005


Stolberg et R e s e a r c h • Fun da me ntal s of Cl in ica l Re se arch for Ra di ologi sts
al.
Survival
Analysis

Survival Analysis
Harald O. Stolberg1,2 he breadth of radiology research Figure 1 shows how we can illustrate these
Geoffrey Norman3
Isabelle Trop4
Stolberg HO, Norman G, Trop I
T is expanding. Previously, a large
proportion of radiology research
projects were observational stud-
different outcomes, indicating what happened
to the first 10 patients in a study. Subjects A,
C, D, and F died during the trial; they are la-
ies. Increasingly, research now involves beled “D” for dead. Subjects B, G, and I were
groups of patients to whom specific interven- lost to follow-up, hence the label “L,” at var-
tions are administered in a randomized fash- ious times after they started the drug. The
ion. Analysis of data obtained from these ex- other subjects, E, H, and J (labeled “C”), were
perimental studies varies, depending on the still alive at the time the trial ended. These last
end point of interest. Research protocols that three data points are called “right-censored.”
Received November 17, 2004; accepted November 23, 2004. are designed to evaluate the interval between Subjects are considered “censored” when
entry of a patient into the study and the time their data are incomplete. They are said to be
Series editors: Nancy Obuchowski, C. Craig Blackmore,
Steven Karlik, and Caroline Reinhold. until the event of interest are referred to as right-censored because they have been fol-
time-to-event studies, a form of follow-up lowed to the end of the study (the “right-hand
This is the 19th in the series designed by the American study [1]. The event may be death in a diag- part” of the graph), but the outcome of inter-
College of Radiology (ACR), the Canadian Association of nostic study of cancer or a progression of var- est has not occurred to them. To be more
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is
ious chronic disease entities to a defined quantitative about the data, Table 1 shows
designed to progressively educate radiologists in the stage. In interventional studies, such as vas- how long each person was in the study and
methodologies of rigorous clinical research, from the most cular and neuroradiologic procedures, the fate what the outcome was.
basic principles to a level of considerable sophistication. of grafts, stents, and other devices may be fol-
The articles are intended to complement interactive
lowed through time. Survival analysis, also The Kaplan-Meier Approach to
software that permits the user to work with what he or she
has learned, which is available on the ACR Web site called “life table” analysis, refers to the meth- Survival Analysis
(www.acr.org). odology of analysis of data gathered in such To do a survival analysis, we must figure
protocols. Survival analysis, then, is the topic out how many people survive for at least
Project coordinator: G. Scott Gazelle, Chair, ACR of this article [2]. 1 year, for at least 2 years, and so on, in what
Commission on Research and Technology Assessment.

Staff coordinator: Jonathan H. Sunshine, Senior Director Overview TABLE 1: Outcomes of the First 10
for Research, ACR. Under ideal circumstances, a study would Subjects
enroll all its subjects simultaneously and fol-
Length of Time
low them either for a fixed period of time or in Trial
1Department of Radiology, McMaster University Medical until they all reach some end point, such as Subject (months) Outcome
Centre, 1200 Main St. W, Hamilton, ON, L8N 3Z5 Canada.
recovery or death. However, more com- A 61 Died
2Deceased. monly, studies require a large number of
B 111 Lost
subjects or look at relatively rare conditions,
3Department of Educational Research, Clinical and so must enter subjects over a period of C 29 Died
Epidemiology and Biostatistics, McMaster University, several months or even years. When the D 46 Died
Hamilton, ON, Canada.
study finally ends, the subjects will have E 92 Censored
4Department of Radiology, Hospital St.-Luc, 1058 St-Denis been followed for varying lengths of time, F 22 Died
St., Montreal, QC, H2X 3J4 Canada. Address during which a number of different out- G 37 Lost
correspondence to I. Trop. comes have to be considered: the event has
H 76 Censored
AJR 2005;185:19–22
not yet occurred (outcome A), some patients
are lost to follow-up (outcome L), or the I 14 Lost
0361–803X/05/1851–19
event has occurred (an example of the event J 45 Censored
© American Roentgen Ray Society or end point is death) (outcome D). Note—Reprinted with permission from [4].

AJR:185, July 2005 19


Stolberg et al.

Fig. 1—Entry and withdrawal of subjects in a 10-year study. (Reprinted with permis- Fig. 2—Figure 1 redrawn so all subjects have a common starting date. (Reprinted
sion from [4]) with permission from [4])

Fig. 3—Survival curve for data in Table 2. Fig. 4—Survival curves for both groups in study of patients with intramural
hematoma of the aorta. (Reprinted with permission from [4])

Fig. 5—Probability of survival after aortic intramural hematoma (IMH) in 66 study Fig. 6—Cumulative survival of patients with intramural hematoma (IMH) with (exper-
patients. Small triangles indicate censored cases. imental group) and without (control group) treatment with β-blockers. Upper curve
(triangles) indicates treated patients; small squares indicate censored cases. Differ-
ence between two subgroups was statistically significant (p = 0.004).

20 AJR:185, July 2005


Survival Analysis

is called a “life table” technique. There are survival curve for the treatment group No matter which form of survival analysis
two ways to go about calculating a life table: dropped at a faster rate than that for the statistical test is used, four assumptions must
the actuarial approach and the Kaplan-Meier control group. But is the difference statisti- be met:
approach [3]. The Kaplan-Meier approach is cally significant? • Each person must have an identifiable
far more common in medical literature, so we The best approach for evaluating whether starting point. All subjects should enter the
will describe it. the difference is indeed significant is to use the trial at the same time in the course of their
The first step involves redrawing the Mantel-Cox log-rank test, which is a modifica- illness. Using diagnosis as an entry time
graph, so that all the people appear to start at tion of the Mantel-Haenszel chi-square test [4]. can be problematic, because people may
the same time. Figure 2 shows the same data This test is a powerful method for analyzing have had the disorder for varying lengths
as Figure 1; however, instead of the x-axis be- data when the time to the outcome is impor- of time.
ing Calendar Year, it is now Number of Years tant; it deals with censored data and differential • A clearly defined and uniform end point is
in Study. The lines are all the same length as length of follow-up of different subjects. As required. This is not a problem if the end
in Figure 1; they have just been shifted to the with most chi-square tests, the log-rank test point is death, but it can be a problem if the
left so that they all begin at time 0. compares the observed number of events with end point is recurrence of disease.
The Kaplan-Meier approach uses the exact the number expected, under the assumption • The reasons that people drop out of the
time of death in the calculation of survival. It that the null hypothesis of no group differences study cannot be related to the outcome. If
also computes the survival function only is true. That is, if there were no differences be- persons have dropped out because they can
when an outcome occurs. To show how this is tween the groups, then at any interval, the total no longer travel to their scheduled appoint-
done, let us use the data for the 10 subjects in number of events should be divided between ments as a result of the worsening of symp-
Table 1. The first step is to rank-order the the groups roughly in proportion to the number toms of the disease under study, the
length of time in the trial and flag which en- of subjects at risk. The test determines how chances of survival could be seriously
tries reflect the outcome of interest (death in much the observed event rate differs from the overestimated. Otherwise, any changes we
this case) and which are due to withdrawal or expected rate. see may be due to these secular changes,
censoring. We have done this by putting an rather than the intervention.
asterisk after the data for subjects who were The Cox Proportional Hazards Model • Diagnostic and treatment practices must
lost to follow-up or were censored by the ter- A more sophisticated method of analysis not change over the life of the study.
mination of the study: commonly used, which examines the differ-
ence in the survival curves while also ac- We have said that survival or life table an-
14* 22 29 37* 45* 46 61 76* 92* 111* counting for other variables (covariates), is alysis allows us to look at how long people are
the Cox proportional hazards model [5]. Un- in one state (e.g., life) followed by a discrete
This data set would generate a life table (Ta- like the log-rank test, the proportional hazards outcome (e.g., death). This analysis can han-
ble 2) with only four rows, one for each of the model allows adjustment for any number of dle situations in which the people enter the
four patients who died. covariates, whether they are discrete (e.g., the trial at different times and are followed up for
One person was lost to follow-up before technique used [CT or MRI]) or continuous varying periods; it also allows us to compare
the first person died, so the number of re- (e.g., age or serum electrolyte level), and then two or more groups [4]. The methods of life
maining patients at risk at 22 months is only computes a test for each, including, of course, table (survival) analysis are increasingly used
nine. Death rate, survival rate, or any other a statistical test of the difference overall be- in diagnostic imaging research in recent
statistical estimate is calculated on the basis tween the treatment and control groups. Both years, and we therefore offer a recent review
of the population at risk (Table 2). At 46 survival and hazard functions can refer to out- of a relevant research study [6].
months, two people had died and three were comes other than death. In the Cox model, This multicenter study evaluated patients
lost to follow-up, so the number of patients this hazard is assumed to be separable into a with intramural hematoma of the aorta and
at risk is five, and so on. This little data set product of one function that depends on time hospital admission less than 48 hr after onset
would generate a survival curve like that and another function that captures all the of initial symptoms. Patients were enrolled
shown in Figure 3 except for fewer steps. other variables including, specifically, the between January 1994 and December 2000
relative difference between treatment and after confirmation of intramural hematoma
Comparing Two (or More) Groups control groups. on two imaging studies (transesophageal
with the Log-Rank Test
Although the survival curve shown in Fig-
ure 3 tells us what happened to patients over TABLE 2: Kaplan-Meier Life Table Analysis of the Data in Table 1
time, we often want to compare two or more Cumulative
groups of patients—for example, patients Time (months) No. at Risk No. of Deaths Death Rate Survival Rate Survival Rate
with different kinds of stents, or patients who t Rt Dt qt pt Pt
were screened (experimental group) versus 22 9 1 0.1111 0.8889 0.8889
patients who were not screened (control
29 8 1 0.1250 0.8750 0.7778
group). So we will create an expanded sur-
vival table with 250 experimental subjects 46 5 1 0.2000 0.8000 0.6222
and 250 control patients. These data are pre- 61 4 1 0.2500 0.7500 0.4667
sented in Figure 4. This graph shows that the Note—Reprinted with permission from [4].

AJR:185, July 2005 21


Stolberg et al.

echocardiography, CT, or MRI). Sixty-six pa- trol group) are displayed in Figure 6. Visual These methods of data analysis have potential
tients were consecutively enrolled over the analysis easily reveals that patients taking β- applications in many fields of radiology, most
course of 7 years. They were subjected to blockers (upper curve) enjoyed much greater notably in the analysis of screening tech-
medical treatment in an ICU setting and sur- survival than patients who did not receive niques and interventional studies.
gical treatment if indicated (criteria for surgi- the medication (lower curve). In fact, the up-
cal intervention are available in the original per curve shows that only one patient died Acknowledgments
article). Follow-up of these patients ranged early in the study, and that subsequently all Sadly, Dr. Stolberg passed away last Jan-
from 6 to 123 months and included outpatient patients from whom information is available uary. His determination and enthusiasm
visits and CT 6 months after the event and are still alive. However, many censored data were key in seeing this project to comple-
yearly thereafter. points are seen, but there is no reason to be- tion, and we are indebted to him for all he
From the raw data collected from 66 lieve these patients have died without accomplished.
patients, a Kaplan-Meier curve was built knowledge of the study’s investigators, We express our appreciation and gratitude
(Fig. 5). Dissecting Figure 5, we obtain the which would falsely lead to the conclusion to Monika Ferrier for her patience and sup-
following information: survival is set at 100% that β-blockers have a protective effect. The port, which is always a difficult task with sev-
at the beginning of the study, when patients log-rank test performed on these two sub- eral authors. She has kept us on track and pre-
initially present to the emergency department. groups of patients revealed important infor- pared the manuscript.
Each ladder step indicates a drop in sur- mation that was embedded in the initial Ka-
vival—that is, the death of a patient because plan-Meier curve (Fig. 5) and could not have
that was the event defined as the main out- been obtained had it not been for this sepa- References
come. A rapid decline ensues because close to rate analysis. 1. Norman GR, Streiner DL. PDQ statistics, 2nd ed.
20% of patients die in the acute phase. The St. Louis, MO: Mosby, 1997
first loss of information occurs around 6 Conclusion 2. Altman DG, Machin D, Bagant TN, Gardner MJ.
months, when the first follow-up is sched- In this article, we address life table and sur- Statistics with confidence, 2nd ed. London, UK:
uled. The triangles indicate censored data, vival analysis and describe life table tech- BMJ Books, 2000
and the figure shows that at 20 months, 12 pa- niques such as the Kaplan-Meier approach. 3. Kaplan EL, Meier P. Nonparametric estimation
tients have already been censored. Figure 5 For the comparison of two or more groups, from incomplete observations. J Am Statist Assoc
shows that the drop in survival is faster in the we describe the Mantel-Cox log-rank test. 1958; 53:457–481
initial months after intramural hematoma: the Finally, we discuss the Cox proportional haz- 4. Norman GR, Streiner DL. Biostatistics: the bare
curve drops faster between 20 and 60 months ards model, which examines the difference in essentials, 2nd ed. Hamilton, ON: B. C. Decker,
than later in the study. the survival curves and also accounts for 2000
Differential survival of subgroups of the other variables (covariates). These statistical 5. Cox DR. Regression models and life tables. J Roy
study was assessed using the log-rank test. methods allow one to work with nontradi- Statist Soc 1972; 34:187–220
The resulting Kaplan-Meier curves obtained tional units of analysis: person–time rather 6. Von Kodolitsch Y, Csosz SK, Koschyk DH, et al.
from comparison of patients who received than person only. These tools are seen in- Intramural hematoma of the aorta: predictors of pro-
oral β-adrenergic receptor blockers (experi- creasingly in the research literature and are gression to dissection and rupture. Circulation
mental group) and those who did not (con- gaining popularity in radiology research. 2003; 107:1158–1163

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 10. Introduction to Probability Theory and Sampling
2. The Research Framework, April 2001 Distributions, April 2003
3. Protocol, June 2001 11. Observational Studies in Radiology, November 2004
4. Data Collection, October 2001 12. Randomized Controlled Trials, December 2004
5. Population and Sample, November 2001 13. Clinical Evaluation of Diagnostic Tests, January 2005
6. Statistically Engineering the Study for Success, July 2002 14. ROC Analysis, February 2005
7. Screening for Preclinical Disease: Test and Disease 15. Statistical Inference for Continuous Variables, April 2005
Characteristics, October 2002 16. Statistical Inference for Proportions, April 2005
8. Exploring and Summarizing Radiologic Data, January 2003 17. Reader Agreement Studies, May 2005
9. Visualizing Radiologic Data, March 2003 18. Correlation and Regression, July 2005

22 AJR:185, July 2005


Obuchowski R e s e a r c h • Fun da me ntal s of Cl in ica l Re se arch for Ra di ologi sts
Multivariate Statistical
Methods

Multivariate Statistical Methods


Nancy A. Obuchowski1 What Is “Multivariate”? the PET scans: the change in size and the
In radiology studies we often measure change in severity of perfusion abnormalities.
Obuchowski NA more than one end point, or outcome variable, A fourth study was performed to assess the
on each patient. “Multivariate” means multi- image quality of abdominal CT [2]. Patients’
ple outcome variables measured on the same images were scored on 18 characteristics (i.e.,
patient. We might use multiple end points in 18 outcome variables): organ edge sharpness
a study for several reasons. In designing a measured at six sites, visibility of 10 different
study we might not know which end point is vessels, and motion of the abdominal wall
important, so we measure a variety of end above and below the umbilicus. The final ex-
points to find which ones are important. In ample comes from a study investigating the
other studies, we may have a set of variables quantitative characteristics that can be used to
that have been shown in the past to be impor- distinguish benign and malignant breast le-
tant or that are important for clinical reasons, sions on MRI (Radhika Sivaramakrishna,
so we measure the set of variables. personal communication). Four variables—
Many examples of multivariate data occur margin fluctuation, tumor border roughness,
in radiology studies. Consider the following entropy from 2D surface temperature, and a
Series editors: Nancy Obuchowski, C. Craig Blackmore,
five examples. A study was conducted to as- function of the convex hull area—were mea-
Steven Karlik, and Caroline Reinhold. sess the effects of diagnostic imaging infor- sured on each lesion and were used to distin-
mation on patients with lower back pain guish the two types of lesions.
This is the 20th in the series designed by the American (Michael T. Modic, personal communica-
College of Radiology (ACR), the Canadian Association of
tion). Half the study patients were given the When Should We Use Multivariate
Radiologists, and the American Journal of Roentgenology.
The series, which will ultimately comprise 22 articles, is results of their imaging test; the other patients Statistical Methods?
designed to progressively educate radiologists in the were not given their results. Six weeks later, There are essentially three situations when
methodologies of rigorous clinical research, from the most the investigators recorded five variables (i.e., multivariate statistical methods are needed
basic principles to a level of considerable sophistication. pain, function, absenteeism, quality of life, [3]. This section describes each situation and
The articles are intended to complement interactive
software that permits the user to work with what he or she
and self-efficacy) on each patient and com- provides examples. The appropriate multi-
has learned, which is available on the ACR Web site pared the two groups. In a second study [1], variate statistical methods are applied later
(www.acr.org). mammographers were randomized to one of for each example.
two groups: an intervention group to improve First, multiple individual variables may be
Project coordinator: G. Scott Gazelle, Chair, ACR
reviewer performance or a control (no inter- of interest to us, and we want to explore each
Commission on Research and Technology Assessment.
vention) group. The two groups interpreted one. The lower back pain study has five vari-
Staff coordinator: Jonathan H. Sunshine, Senior Director mammograms before and after the interven- ables of interest (pain, function, absenteeism,
for Research, ACR. tion period. For this study there were two out- quality of life, and self-efficacy), and we want
come variables: change in reviewer perfor- to explore the effect of diagnostic imaging in-
Received November 17, 2004; accepted after revision
November 23, 2004.
mance on mammograms with malignant formation on each. One common approach
lesions and change in reviewer performance would be to test each variable and report the
1Department of Quantitative Health Sciences and
on mammograms not containing malignant resulting p value; if a p value is less than the
Department of Radiology, Rm. Wb4, The Cleveland Clinic lesions. A third study compared the cardio- conventional level of 0.05, or 5%, then we
Foundation, 9500 Euclid Ave., Cleveland, OH 44195. Address vascular effects of an intensive, cholesterol- might conclude that diagnostic information
correspondence to N. A. Obuchowski. reducing diet with those of a standard diet has an effect on this variable. Such an ap-
AJR 2005; 185:299–309
(R. Brunken, C. Esselstyn, personal commu- proach can provide misleading results. For
nication). Each patient in the two groups un- example, suppose that diagnostic information
0361–803X/05/1852–299
derwent PET before and after a short trial on really has no effect on any of the five vari-
© American Roentgen Ray Society the diets. Two variables were measured on ables. If we adopt a 5% significance level for

AJR:185, August 2005 299


Obuchowski

each variable, then we have a 5% chance of ages not containing a malignant lesion are variables makes sense to us, we can create a
incorrectly concluding that diagnostic infor- treated as a single set. If a difference in this set new variable from the variables in each
mation has an effect on the particular vari- is found between the two groups of physi- group. In this way we have reduced the data
able, and a 95% chance on each variable of cians, then we might like to investigate from 18 variables into however many group-
correctly concluding that there is no effect. whether a difference exists in performance ings we think are appropriate, and we have
If the five variables act independently of just on images with a malignant lesion, just on created new variables that are functions of the
each other, the probability of drawing the cor- images without a malignant lesion, or both. In old variables; no variables are omitted.
rect conclusion on all five variables is this example it is not clear whether both mea- In the MRI breast imaging study, the goal
(0.95)5 = 0.77, or only 77%. There is a 23% sures of reviewer performance will be im- was to identify the variable or set of vari-
chance that we will make at least one mistake. proved (or worsened) by the intervention, or ables that best distinguished known benign
This is known as the experiment-wise error whether one measure will be affected and the lesions from known malignant lesions. Once
rate [3]. Note that here we have assumed that other measure will not be affected, or even a variable or set of variables is found, then it
the variables are independent; often, how- whether the measures will be affected in op- can be used in the future to differentiate le-
ever, they are correlated in some way. This posite ways. For this example, we will apply sions of unknown status. The multivariate
means that 0.23 is the upper limit on the prob- a multivariate test, called the Pillai-Bartlett methods needed here are different from
ability of making at least one mistake (the test [3], that looks for any type of difference those needed in the previous example. In the
good news), but now we cannot calculate the between the two groups. CT image quality study, we wanted to group
exact probability (the bad news) [3]. In the cardiovascular diet study, the primary the variables, not the patients or lesions; we
This example illustrates that p values from question is whether diet affects perfusion ab- had no way of knowing if the groupings of
individual statistical tests (i.e., univariate anal- normalities. If a difference is found in the per- variables were correct. In the MRI breast le-
yses) are not necessarily significant just be- fusion abnormalities of patients with and those sion study, we want to group the lesions into
cause they are less than 0.05 [4]. A simple so- without the intensive diet, then we would like benign or malignant; because we know the
lution, which is particularly useful when the to investigate which variable is most affected pathology of each lesion, we know whether
variables are only loosely biologically related by the diet; however, this is a secondary issue. the groupings are correct. For this example,
to one another, is to calculate and report ad- The two outcome variables in this example we will use multivariate methods such as
justed p values [4]. Adjusted p values can be (i.e., extent and severity of perfusion abnor- discriminant analysis and multiple-variable
compared with 0.05, and if they are less than malities) are closely biologically related. We logistic regression analysis to identify the
0.05, then we can conclude that the variable is expect them both to improve or neither to im- best set of variables for grouping lesions of
significant. With adjusted p values, if there re- prove with the intensive diet. They may be im- known status.
ally is no effect for any of the variables, then proved by the diet to different degrees, or mag-
there is a 5% chance that we will make one or nitudes, but we expect them to be affected in Adjusted p Values
more mistakes and a 95% chance that we will the same direction by the diet. For this example (Lower Back Pain Example)
make the correct conclusion on all the vari- we will apply a multivariate test, a linear com- In an ongoing study, patients with an acute
ables. We describe and illustrate this approach bination test that takes into account the close episode of lower back pain were consented
in the section on Adjusted p Values. relationship of the variables and their consis- for the study and underwent MRI. Patients
In the second situation, we have a set of tent direction for change. were randomized at presentation to one of
variables that we are interested in examining The third situation in which multivariate two groups: diagnostic imaging information
as a set. In many situations the variables in the methods are needed is when we are not par- provided at presentation versus diagnostic
set are measured on two groups, and we are ticularly interested in the raw variables them- imaging information not provided. Patients in
interested in the patterns of differences be- selves, but rather in the use of a combination the first group were told about the findings on
tween the two groups for the set of variables. or a subset of them. We again present two ex- their MRI examination, whereas patients in
If differences between the groups are found amples, each with different goals for the anal- the latter group were not provided any infor-
for the set of variables, then we may want to ysis. First, in the image quality study the 18 mation about the findings of their examina-
explore which variable or group of variables questions posed to the reviewers represented tion. Six weeks later, patients in both groups
is different for the two groups, but this is a a list of important image quality characteris- recorded their pain, function, quality of life,
secondary issue. It can happen that no one tics, but none of the questions by themselves self-efficacy, and absenteeism using stan-
variable distinguishes the two groups, but the is of primary interest. Furthermore, with 18 dardized questionnaires.
combination of variables in the set distin- variables and only 37 total patients in this Because this is an ongoing study, we do not
guishes the two groups well. study, we have far too many variables to in- have raw data to present. For calculation of
For this second situation, we present two vestigate with this sample of patients. To re- adjusted p values, however, we just need the
examples that illustrate slightly different sta- duce the number of variables, we could just p values from the univariate analysis of each
tistical methods. In the mammography study discard some variables on the basis of some variable (i.e., the unadjusted p values).
described earlier, the primary question fo- preliminary analyses; however, it would be Table 1 provides an illustrative example of
cuses on the differences in the change in per- better to keep all of the information if we the sort of findings that might be obtained. In
formance of the two groups of physicians; for could condense it into new, fewer variables. the second column, p values are presented
this primary question, the change in perfor- Multivariate methods such as cluster analysis from Student’s t tests on each of the five vari-
mance on images containing a malignant le- can be used to identify similar groups, or clus- ables. Quality of life, function, and pain are
sion and the change in performance on im- ters, of variables. Then, if the grouping of all significant at the 0.05 level.

300 AJR:185, August 2005


Multivariate Statistical Methods

TABLE 1: Calculation of Adjusted p Values for Lower Back Pain Example Table 2 summarizes a set of fictitious data
Variable (i) Unadjusted p (pi) ri Conclusion Adjusted p (i.e., no actual data were reported by Pepe et
Quality of life (1) 0.003 0.015 Reject 0.015 al. [1]). The first two columns are the changes
in sensitivity and FPR for the intervention and
Function (2) 0.007 0.028 Reject 0.028
control groups for the 14 reviewers (seven per
Pain (3) 0.048 0.144 NS 0.144 group). The sensitivity changes are illustrated
Self-efficacy (4) 0.070 0.140 NS 0.144a in Figure 1A, and the FPR changes are illus-
Absenteeism (5) 0.145 0.145 NS 0.145 trated in Figure 1B. Note that in both figures
Note—NS = not significant. there is considerable overlap—that is, physi-
aAdjusted p values are usually just the r values. However, because the adjusted p values must be sequentially
i
cians in the two groups have similar increases
ordered, the adjusted p value for self-efficacy takes on the value of the previous adjusted p value. in sensitivity and similar increases in FPR. In
fact, t tests (i.e., univariate analysis) on the
There are several methods for calculating “pain” is not statistically significant (i.e., changes in sensitivity and FPR indicate no
adjusted p values [4]; we describe and illus- r3 > 0.05); thus, none of the remaining vari- statistically significant differences between
trate one simple method [5] here. ables (i.e., self-efficacy and absenteeism) is the control and intervention groups (last col-
considered to be statistically significant. umn of Table 3).
Step 1 On the basis of the unadjusted p values, we Figure 1C illustrates, simultaneously, the
Order the unadjusted p values from small- would have concluded that quality of life, changes in sensitivity and FPR. The figure
est to largest, so that p1 < p2 < p3 …< pi … < function, and pain are all affected by diagnos- shows two distinctly separate groups of data
pn, where pi is the i-th variable and n is the to- tic information. However, we know that the points—that is, physicians in the control
tal number of outcome variables. The unad- experiment-wise error rate (i.e., the overall group have changes toward the lower left,
justed p values for the lower back pain study significance level) greatly exceeded 5%. On whereas physicians in the intervention group
have been ordered in this way in Table 1. the basis of the adjusted p values, we con- have changes toward the upper right; the dis-
Note that n = 5 in this example. clude that quality of life and function (not tinction is not apparent from the univariate
pain) are affected by diagnostic information. displays of the data.
Step 2 Because these are adjusted p values, the over- Clearly, we want a test statistic that takes
Compute the value of ri as (n – i + 1)pi. For all significance level has been maintained at both measures of performance into account
example, r1 = np1 and r2 = (n – 1)p2. The val- equal to or less than 5%. simultaneously. There are four well-known
ues of ri for the lower back pain study are and related test statistics that can be applied
given in the third column of Table 1. Pillai-Bartlett Statistic (Mammography here: the Pillai-Bartlett trace (also called the
Example) Pillai-Bartlett or Bartlett statistic), Wilks’
Step 3 Pepe et al. [1] describe a study design to lambda (also called the likelihood ratio test
Compare r1 with the planned type 1 error test whether a specific intervention (i.e., an statistic), the Hotelling-Lawley trace, and
rate (usually 0.05). If r1 is less than or equal educational program) improves the perfor- Roy’s largest eigenvalue statistic (also called
to the planned type 1 error rate, then conclude mance of mammographers. Radiologists are Roy’s maximum characteristic root or the
that the variable is statistically significant and randomly assigned to either the intervention union-intersection statistic). Most statistics
continue to step 4. If r1 is greater than the group or a control (i.e., no intervention) packages will output the results of all four.
planned type 1 error rate, then we conclude group. The radiologists in both groups first They require certain assumptions about the
that none of the n variables is statistically sig- interpret a common set of images. The perfor- basic data distributions—that is, that the data
nificant. In the lower back pain example, r1 mance of each reviewer for images with and follow a multivariate normal distribution and
equals 0.015 and is less than 0.05. So we con- without breast cancer (e.g., sensitivity and that the variances and covariances in each
clude that quality of life is affected by diag- false-positive rate [FPR], respectively) is re- group are identical (i.e., homoskedastic).
nostic imaging information. corded. After the intervention period, a sec- Many different methods are available for as-
ond set of images is interpreted by the same sessing the multivariate normality and homo-
Step 4 radiologists. The authors want to test whether geneity of variance and covariance assump-
Continue to compare ri with the planned the radiologists’ performances are altered by tions. Some simple methods are described
type 1 error rate, starting with i = 2 and con- the intervention. and illustrated as follows:
tinuing to i = n. If ri is less than or equal to the (Note that for convenience, we will use the
planned type 1 error rate and if the previous terms “sensitivity” and “false-positive rate” to Assessing Multivariate Normality Assumption
variable (i.e., i –1) is determined to be statis- denote the two measures of reviewer perfor- The following are the steps for assessing the
tically significant, then conclude that the i-th mance. In this example, however, we are em- multivariate normality assumption [9, 10].
variable is also statistically significant. As phasizing the performance of a sample of re- Step 1—For each (treatment) group and
soon as a variable is determined not to be sta- viewers on sets of fixed images; we are each outcome variable, test that the data fol-
tistically significant, then all remaining vari- deemphasizing the sampling of the patients for low a univariate normal distribution. This is
ables are also considered not statistically sig- the study. A variety of statistical methods are best accomplished by calculating the Sha-
nificant. In the lower back pain study, the first available [6–8] for characterizing and compar- piro-Wilk W test and examining the statistical
two variables (i.e., quality of life and func- ing diagnostic accuracy that take into account measures called skewness (i.e., symmetry)
tion) are statistically significant. The variable the sampling of both patients and reviewers.) and kurtosis (i.e., peakedness). Most standard

AJR:185, August 2005 301


Obuchowski

statistical packages can do this for you. In compute the skewness and kurtosis measure- our example, the test statistic for the control
SAS [11], the code for our mammography ex- ments for each principal component. Most group is (7 / 24)1/2(–0.06 + –1.55) = –0.87;
ample is proc univariate normal; by trt; var standard statistical packages compute princi- the test statistic for the intervention group is
sen fpr;. (Note that sen and fpr are the vari- pal components. The SAS code for our (7 / 24)1/2(–1.07 + –0.06) = –0.55. Because
able names for the change in sensitivity and mammography example is proc princomp both test statistics have values between –1.96
change in FPR; trt is the variable name for the out = prin; by trt; var sen fpr; proc univari- and 1.96, we conclude that the multivariate
treatment effect: 1 = intervention, 0 = no in- ate; by trt; var prin1 prin2;. In our mammog- normal distribution is reasonable.
tervention.) The p values for the Shapiro- raphy example, the skewness values are –1.02
Wilk test are 0.766 and 0.857 for sensitivity and –0.36 for the first and second principal Steps for Assessing Variance and
and FPR for the control group, and 0.278 and components for the control group, and 0.01 Covariance Homogeneity Assumption
0.100 for sensitivity and FPR for the interven- and 0.43 for the first and second principal Step 1—For each variable, test if the vari-
tion group; because these p values exceed components for the intervention group. If the ances of the groups are the same. When there
0.05, the univariate assumption is reasonable. data follow a multivariate normal distribu- are two groups, this can be done easily with
The skewness values are –0.07 and 0.26 for tion, then (n / 6)(sum of the square of the most statistical packages. In SAS, a test for
sensitivity and FPR for the control group, and skewness measurements) should follow a chi- homogeneity of variances is performed when
–1.09 and –0.70 for sensitivity and FPR for square distribution with p degrees of freedom, the t test procedure is executed. The SAS
the intervention group. The skewness values where p is the number of outcome variables. code for our example is proc ttest; class trt;
should be near 0 for univariate normality. For In our example, p = 2, and the test statistic for var sen fpr;. After the t test results, SAS out-
this small sample size, these values are not the control group is (7 / 6)(1.04 + 0.13) = puts the results of the test of the hypothesis of
unreasonable. (Note that for large sample 1.37; the test statistic for the intervention equal variances. For sensitivity, the p value of
sizes, we expect the values to be much closer group is (7 / 6)(0.0001 + 0.18) = 0.21. The this test is 0.577, indicating that we can as-
to 0.) Finally, the kurtosis values are –1.31 critical value from a chi-square distribution sume equal variances in the two groups for
and –0.79 for sensitivity and FPR for the con- with 2 degrees of freedom is 5.99; because this variable. For FPR, the test for equal vari-
trol group, and 0.72 and –1.29 for sensitivity 1.37 is less than 5.99 and 0.21 is less than ances gives a p value of 0.445, indicating that
and FPR for the intervention group. The kur- 5.99, the multivariate normal assumption is we can assume equal variances in the two
tosis values should be near 3 for univariate reasonable. The kurtosis values are –0.06 and groups for this variable, as well.
normality, but SAS outputs the kurtosis val- –1.55 for the first and second principal com- Step 2—For each group, examine the cor-
ues minus 3. Thus, in our example we exam- ponents for the control group, and –1.07 and relation coefficient(s) between the variables.
ine the kurtosis values to see if they are near 0.06 for the first and second principal compo- In SAS we can obtain Pearson’s correlation
0; again, for this small sample size, these val- nents for the intervention group. If the data coefficient between sensitivity and FPR for
ues are not unreasonable. follow a multivariate normal distribution, the two groups by executing the following
Step 2—For each (treatment) group, com- then (n / 24)1/2(sum of the kurtosis measure- code: proc corr; by trt; var sen fpr;. For the con-
pute all of the principal components and then ments) should follow a normal distribution. In trol group, the Pearson’s correlation coefficient

TABLE 2: Changes in Reviewer Performance During the Intervention Period (Fictitious Data)
Change In Sensitivity FPR
Sensitivity FPR Before Intervention After Intervention Before Change After Change
Control group (n = 7 reviewers)
0.10 0.05 0.80 0.90 0.04 0.09
0.12 0.06 0.75 0.87 0.06 0.12
0.08 0.07 0.88 0.96 0.05 0.12
0.13 0.08 0.77 0.90 0.08 0.16
0.07 0.11 0.90 0.97 0.10 0.21
0.14 0.08 0.66 0.80 0.07 0.15
0.10 0.10 0.59 0.69 0.12 0.22
Intervention group (n = 7 reviewers)
0.16 0.05 0.70 0.86 0.05 0.10
0.16 0.06 0.75 0.91 0.06 0.12
0.15 0.09 0.80 0.95 0.08 0.17
0.12 0.10 0.85 0.97 0.10 0.20
0.14 0.12 0.76 0.90 0.11 0.23
0.11 0.12 0.88 0.99 0.09 0.21
0.07 0.12 0.85 0.92 0.13 0.25
Note—Changes are defined as performance after intervention minus performance before intervention. FPR = false-positive rate.

302 AJR:185, August 2005


Multivariate Statistical Methods

2.0
3.0

1.5
Frequency

Frequency
2.0
1.0

0.5 1.0

0.0
0.0
0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16
0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12
Change in Sensitivity Change in False-Positive Rate
A B
Fig. 1—Changes in sensitivity and false-positive rate (FPR) during intervention
period.
0.30 A and B, Bar graphs (where C = control, I = intervention) indicate changes in sensi-
tivity (A) and FPR (B).
C, Scatterplot illustrates multivariate data for changes in sensitivity and FPR. (z) =
0.25 intervention reviewers, = ({) = control reviewers.
Change in Sensitivity

0.20

0.15
Intervention

0.10
Controls

0.05

0.0

0.0 0.05 0.10 0.15 0.20 0.25 0.30


Change in FPR
C
In the multivariate setting, we need to con-
TABLE 3: Mean Changes (SDs) in Sensitivity and FPR for struct simultaneous confidence intervals for
Control Group and Intervention Group the outcome measures. If there are k outcome
Outcome Variables Control Intervention t Statistic p measures (in our example, k = 2), then the
Sensitivity 0.11 (0.03) 0.13 (0.03) 1.55 0.148 confidence statements for all the k outcome
measures hold simultaneously with a speci-
FPR 0.08 (0.02) 0.09 (0.03) 1.15 0.273
fied high probability (usually, 0.95). In other
Note—FPR = false-positive rate. words, it is a guarantee of a specified proba-
bility against any of the k statements being in-
between sensitivity and FPR is –0.32; the variances and covariances assumptions are not correct. The formula for constructing a simul-
same correlation for the intervention group is entirely met. When two groups are being com- taneous confidence interval for the difference
–0.75. The two correlations differ in magni- pared, as in the mammography example, or if between two populations is given in the ap-
tude, but with a small sample size this is not there is only one outcome variable (i.e., the pendix. For our example, the 95% simulta-
unreasonable. So we conclude that the homo- univariate case), these four statistics give the neous confidence intervals for the difference
geneity assumption is reasonable. same result anyway. The code for SAS to pro- in the change in sensitivity and FPR between
duce these test statistics for the mammography the two groups of physicians are [–0.03 to
Several authors [3, 12] have suggested that example is proc glm; class trt; model sen 0.07] and [–0.03 to 0.05], respectively.
for general use, the first statistic—the Pillai- fpr = trt; manova h = trt;. For our example So far we have investigated only the
Bartlett statistic—should be used. Their ratio- data, the F-statistic is 4.10 and has an associ- changes in sensitivity and FPR between the
nale is based on several aspects of the tests’ ated p value of 0.047. Thus, we reject the null two groups; however, the actual sensitivities
performance, including the fact that the Pillai- hypothesis that the intervention had no effect and FPRs before and after intervention (last
Bartlett statistic performs well even when the and conclude that mammographers’ perfor- four columns of Table 2) may provide other
multivariate normality and homogeneity of mances are affected by the intervention. useful information. For example, the change

AJR:185, August 2005 303


Obuchowski

in sensitivity is negatively correlated to the TABLE 4: Changes in PET Findings the same direction. The same assumptions
preintervention sensitivity (Pearson’s corre- After 3 Weeks about the data—that is, that the data follow a
lation coefficient is r = –0.49, p = 0.073); in (Modified Data) multivariate normal distribution and that the
contrast, the change in FPR is positively cor- Intensive Diet (n = 5) Standard Diet (n = 4) variances and covariances in each group are
related to the preintervention FPR (Pearson’s Extent Severity Extent Severity identical—are again required. These assump-
correlation coefficient is r = 0.90, p < 0.001). 8.54 2.97 4.99 –1.68 tions appear reasonable for the data (analysis
This suggests that reviewers with higher sen- of assumptions follows the same steps as de-
16.57 5.84 6.85 2.36
sitivities before intervention experience a scribed in the previous example, but the de-
smaller increase in sensitivity after the inter- 1.72 10.25 –19.27 0.0 tails are not shown here).
vention period, and reviewers with higher 0.55 –2.04 –1.36 –7.46 To perform the linear combination test, we
FPRs before intervention experience a larger 6.85 34.23 need the value of the univariate test statistics—
increase in FPR. Reviewers’ preintervention Note—Positive values for change indicate that is, 1.47 and 1.61 (from the fourth column
sensitivities and FPRs do not appear to be re- improvements (i.e., reduction in size or severity) in of Table 5), and the value of the correlation be-
lated to one another (Pearson’s correlation perfusion abnormalities at 3 weeks. tween the outcome variables. We will take an
coefficient, r = 0.03, p = 0.911). We might average of the Pearson’s correlation coeffi-
ask if, by chance, the reviewers in the inter- normalities; however, the p values from the cients from the two groups, but because the
vention group tended to have lower preinter- univariate analysis (last column of Table 5) are sample sizes in the two groups are different, we
vention sensitivities and higher preinterven- not significant at the 0.05 level. The sample will use an average weighted by the sample
tion FPRs than those in the control group; if sizes in the two groups are quite small, so we sizes. Pearson’s correlation coefficient be-
true, this would suggest that the intervention can conclude little about the effect of the diet. tween the changes in extent and severity for the
does not affect performance after all. A sim- The Pillai-Bartlett trace test for these data intensive diet group is 0.06, based on five pa-
ple comparison of the mean preintervention gives an F-value of 1.97, with an associated tients; the correlation for the control group is
measures between the two groups does not p value of 0.220. Thus, on the basis of this 0.02, based on four patients. The weighted av-
support this, but this is one reason that base- general purpose multivariate test, we still erage, or pooled correlation, is 0.04.
line values (e.g., preintervention measures) find insufficient evidence to reject the null Pocock et al. [13] give the formula for the
should be evaluated. hypothesis. test statistic in matrix form, which can be sim-
With a larger sample, we could fit a model We now perform a multivariate method plified when there are only two outcome mea-
describing a reviewer’s postintervention per- [13] that takes advantage of the common di- sures (see Appendix 2). The value of the test
formance (i.e., the dependent variable) as a rection we expect the outcome variables to statistic for our example is 2.14. Pocock et al.
function of baseline performance and treat- take. It is a linear combination test, which report that for two outcome variables, their
ment group. With data from only 14 review- means that it combines, in a linear fashion, the test statistic has an approximate t distribution
ers, however, we probably don’t want to fit univariate test statistics, taking into account with degrees of freedom equal to total sample
this complicated a model. Rather, we will use their correlation. This analysis will test the size minus 4; for our example the degrees of
and report the results of the Pillai-Bartlett sta- null hypothesis that diet has no effect on the freedom is 5; thus, the associated p value is
tistic based on the changes in performance, two imaging variables, versus the alternative 0.085. We do not reject the null hypothesis at
along with this simple analysis of the baseline hypothesis that the diet affects the variables in the 0.05 level; however, this result might be
performances.

Linear Combination Test


(PET Perfusion Imaging Example)
30
In a pilot study, patients with coronary ar-
tery disease were randomized to either an
intensive, lipid-lowering, plant-based diet 20
(n = 5) or a standard diet of 30% calories as
Change in Extent

fat (n = 4). Patients’ hearts were studied by Intensive Diet


rubidium-82 PET perfusion imaging 3 weeks 10
after beginning the diet; the 3-week images
were compared with PET images taken at
baseline (prediet). The changes in extent (i.e., 0
size) and severity of perfusion abnormalities Standard Diet
were recorded.
The data are given in Table 4 and illustrated –10 Fig. 2—Scatterplot illus-
in Figure 2; note that the data have been mod- trates data from study of
ified for proprietary reasons. The mean PET perfusion imaging of
patients with coronary
changes and SDs are summarized for the two –20 artery disease who were
groups in Table 5. The figure and the means in –20 –10 0 10 20 30 randomized to an inten-
Table 5 suggest that the intensive diet may im- sive diet ({) or to a stan-
Change in Severity dard diet (z).
prove the extent and severity of perfusion ab-

304 AJR:185, August 2005


Multivariate Statistical Methods

persuasive enough to encourage us to perform We now perform cluster analysis to see if the The SAS code for performing cluster anal-
the full-scale study. data suggest any groupings of like variables. ysis on the CT image quality example, with
Statistical packages usually offer several the correlation and first principal component
Cluster Analysis options for the analysis. First, the groupings options (defaults in SAS), is proc varclus; var
(CT Image Quality Example) can be based on either the correlations or co- quest1-quest18;. The cluster analysis pro-
In this study, Herts et al. [2] compared the variances between the variables. If you want duced six clusters from the original 18 vari-
image quality of helical CT of the abdomen for all the variables to be given equal impor- ables, as listed in Table 6.
two scanning times: 0.75 versus 1 sec per rev- tance, then use the correlation option; if you In contrast, using the centroid option based
olution. Three radiologists each evaluated 18 want variables with larger variances to have on covariances (SAS code is proc varclus cen-
separate image quality variables for 37 pa- more importance, then use the covariance troid cov; var quest1-quest18;), the analysis
tients: 17 patients at 0.75 sec and 20 patients at option. (Note that in situations in which the produced 11 clusters from the 18 original vari-
1.0 sec. They used a 10-point scale to evaluate variables are measuring different things with ables. These 11 clusters were just further divi-
each variable: 1 (blurry) to 10 (sharp) to de- different units of measurement, the correla- sions of the six clusters in Table 6; no new
scribe organ edge sharpness at six sites; tion option is usually most appropriate.) Sec- grouping of variables was identified. The au-
1 (poorly visualized, unenhanced) to 10 (well ond, the variables in each cluster can be ei- thors of the study [2] examined the six clusters
visualized, enhanced) to describe vessel visi- ther the optimized weighted average of the in Table 6 and determined that the six clusters
bility at 10 sites; and 1 (frequent or large-scale) variables (the first principal component made biologic sense. Furthermore, for analysis
to 10 (none detected) to describe motion of the option) or the unweighted average of the purposes we prefer fewer clusters. The six
abdominal wall above and below the umbili- standardized (the centroid option based on clusters were subsequently labeled liver and
cus. With 37 patients evaluated by three re- correlations) or nonstandardized (centroid spleen edge sharpness, renal edge sharpness
viewers on each of 18 image quality questions option based on covariances) variables. Note and abdominal wall motion, portal vein and in-
(i.e., 1,998 total observations), the data are best that a standardized variable is just the value trahepatic vessels, celiac axis and common he-
obtained by downloading them from the ACR of the variable minus its mean value, all di- patic artery, superior mesenteric vessels and
Web site where these modules are described. vided by the SD. mesenteric branches, and renal artery origin.
The analysis plan for this example is as fol-
lows: First, find suitable clusters of like vari-
ables (the number of clusters should be sub- TABLE 5: Mean Changes (SDs) in Extent and Severity of Perfusion Abnormalities
stantially less than the number of original According to Diet
variables); second, from each cluster, create a Outcome Variables Intensive Diet Standard Diet t Statistic p
new variable from the original variables in the Extent 6.85 (6.39) –2.20 (11.91) 1.47 0.185
cluster; and third, use these new variables for Severity 10.25 (14.13) –1.70 (4.18) 1.61 0.150
comparing the two scanning times. Note that
this is only one type of cluster analysis. In par-
ticular, in this example we are looking for clus- TABLE 6: Results of Cluster Analysis for CT Image Quality Example
ters of similar variables. In other situations we
might be looking for clusters of similar pa- Cluster No. Quest Original Variables
tients. Although they are not illustrated in this 1 Organ edge sharpness of anterior right lobe of liver
paper, a variety of approaches can be used for 2 Organ edge sharpness of anterior left lobe of liver
1
this sort of cluster analysis, including simple 3 Organ edge sharpness of posterior left lobe of liver
graphical methods and hierarchical methods 4 Organ edge sharpness of splenic margin
(e.g., nearest neighbor, average distance, and
5 Organ edge sharpness of anterior and posterior renal margins
minimum variance approaches) [11]. The var-
ious options are available in many statistical 6 Organ edge sharpness of medial and lateral renal margins
2
software packages, including SAS. 17 Motion of anterior abdominal wall above umbilicus
Before performing the cluster analysis, we 18 Motion of anterior abdominal wall below umbilicus
must first examine the correlations between 7 Vessel visibility and enhancement of main portal vein
the original variables. If there are no large
3 8 Vessel visibility and enhancement of main portal bifurcation
negative correlations, then we can proceed
with the cluster analysis. However, if some of 9 Vessel visibility and enhancement of intrahepatic portal and hepatic veins
the variables are highly negatively correlated, 10 Vessel visibility and enhancement of celiac axis
4
then the cluster analysis will see those vari- 11 Vessel visibility and enhancement of common hepatic artery
ables as being very dissimilar, when in fact 12 Vessel visibility and enhancement of origin of superior mesenteric artery
they are highly similar, just their scale of 13 Vessel visibility and enhancement of superior mesenteric artery and vein at
measurement is reversed. In this situation, 5
pancreatic head
these variables need to be transformed before 14 Vessel visibility and enhancement of mesenteric branch vessels
the cluster analysis [11]. In the CT image
15 Vessel visibility and enhancement of left origins of renal artery
quality study, there are no large negative cor- 6
relations between any of the variables. 17 Vessel visibility and enhancement of right origins of renal artery

AJR:185, August 2005 305


Obuchowski

When using the first principal component Multiple-Variable Logistic Regression value of the dependent variable divided by
option, the statistical packages will provide Analysis (MRI Breast Lesion Example) one minus its expected value; this is called the
coefficients for each variable in a cluster. Quantitative border measurements were “logit transformation.” Then the logit trans-
These coefficients can then be multiplied by made from the MRI images of 42 benign le- formation of the dependent variable is mod-
the standardized variables before summing sions and 47 malignant lesions from 89 total eled as a linear function of the predictor vari-
the scores in each cluster. The sums of the patients. The status of the lesions was known able. The SAS code for the univariate logistic
scores for each cluster become the new vari- from biopsy results. Four border measure- regression analysis with TBR is proc logistic;
ables for analysis. The authors of this study ments were computed: margin fluctuation model l_type = tbr; (where “l_type” stands
[2], however, took the sum of the scores of (MF), tumor border roughness (TBR), entropy for lesion type and is coded as 1 = malignant
the original variables in each cluster (i.e., the from 2D surface temperature (GST), and a and 0 = benign). Proc logistic in SAS outputs
principal component coefficients were not function of the convex hull area (FCHA). The a useful measure called the “c statistic,”
used). They divided the sum of each cluster goal of the study was to determine which vari- which has 1.0 as a maximum value and 0.5 as
by the number of original variables in the able or set of variables best distinguishes be- its effective lower value. For the TBR mea-
cluster so that the scale would be the same as nign from malignant lesions. The data, modi- sure, the value of the c statistic is 0.787. The
the original variables. They used these newly fied for proprietary reasons, can be obtained by interpretation, which is analogous to the inter-
generated six variables to compare the 0.75- downloading them from the ACR Web site pretation of the ROC curve area [15], is as fol-
and 1.0-sec scans. Such an approach is easy where these modules are described. lows: if presented with a randomly chosen be-
to explain and interpret and, in this example, The means and SDs of the four border mea- nign lesion and a randomly chosen malignant
produced similar results. surements for benign and malignant lesions lesion, the probability of correctly distin-
The means of these six new variables for are summarized in Table 8. To suit the goals guishing the two, by calling the lesion with
the two scanning times are given in Table 7. of this study, we should think of the depen- the higher TBR value “malignant” and the le-
Various methods can now be used to com- dent variable as whether the lesion is benign sion with the lower TBR value “benign,” is
pare the two scanning times. Here, we will or malignant, and the independent, or predic- 0.787, or 78.7%. In Table 8, all the predictors
use the Pillai-Bartlett statistic to test the hy- tor, variables as the four border measure- have c statistics greater than 0.5, suggesting
pothesis that the two scanning times differ ments. We begin the analysis by first assess- that all four border measurements have some
for one or more of the six new image quality ing the importance of each predictor variable ability to distinguish benign and malignant le-
variables. We perform a separate analysis alone, without consideration of the other pre- sions. (Note that SAS does not provide stan-
for each of the three reviewers. The results dictor variables. We use univariate logistic re- dard errors for the c statistic, so without addi-
of the Pillai-Bartlett trace test are as follows: gression, which is a convenient way to model tional calculations we cannot determine if the
reviewer 1, p = 0.119, reviewer 2, p = 0.769, a binary dependent variable as a function of c statistics are significantly better than 0.5.)
and reviewer 3, p = 0.907, suggesting that an independent variable [14]. Because the de- Discriminant analysis is a powerful multi-
there is insufficient evidence to conclude pendent variable is binary, in logistic regres- variate method for separating units (lesions in
that the image quality of the two scanning sion the dependent variable is represented by our example) into two or more populations and
times is different. the natural log of the quantity: the expected allocating units whose population membership
is unknown into one of these populations [11].
For the method to work properly, however, the
TABLE 7: Means (SDs) of Six New CT Image Quality Variables data must follow a multivariate normal distri-
New Image Quality Variables 0.75 sec 1.0 sec bution. In our MRI breast lesion example we
Liver and spleen edge sharpness 6.78 (0.17) 7.09 (0.28) have continuous-type data, but the data do not
follow a multivariate normal distribution (this
Renal edge sharpness and abdominal wall motion 5.81 (1.07) 6.28 (1.02)
is evident from the first step in assessing the
Portal vein and intrahepatic vessels 7.47 (0.34) 7.33 (0.45) multivariate normality assumption described
Celiac axis and common hepatic artery 6.96 (0.30) 7.00 (0.24) previously). In other situations, all the predic-
Superior mesenteric vessels and branches 7.10 (0.24) 7.19 (0.18) tor variables might not be the continuous type.
Renal artery origin 4.23 (1.12) 5.04 (1.31) Some examples of noncontinuous variables
Note—Means are computed over patients and reviewers. SDs describe the variability among the means of the
that are often used as predictor variables are
three reviewers. sex, which is a binary variable; level of pain,
which is often rated on an ordinal scale from 1
to 10; and employment status, which is often
TABLE 8: Means (SDs) of Border Measurements in MRI of Breast Lesions categorized as employed, homemaker, retired,
Border Measurements Benign Malignant c Statistic student, and unemployed.
As an alternative to discriminant analysis,
MF 0.0015 (0.0011) 0.0010 (0.0008) 0.668
multiple-variable logistic regression is often
TBR 8.3404 (10.1768) 22.0823 (15.8907) 0.787
used [11, 14]. This is an extension of the
GST 1.6987 (0.2540) 1.8682 (0.2914) 0.707 univariate logistic regression analysis. The de-
FCHA 0.8540 (0.0549) 0.8899 (0.0425) 0.693 pendent variable in the model is again the le-
Note—MF = margin fluctuation, TBR = tumor border roughness, GST = entropy from 2D surface temperature, sion type, and the border measurements are
FCHA = a function of the convex hull area. considered simultaneously as the predictor

306 AJR:185, August 2005


Multivariate Statistical Methods

variables. As with any statistical modeling, we lignant. In other words, once you have the TBR Another important point is that our model
must be careful not to overfit the model (i.e., value of a lesion, MF, GST, and FCHA do not was created on the basis of TBR values be-
include more predictor variables than can be provide any additional help in distinguishing tween 0.0 and 65.96 (i.e., these are the mini-
supported by the number of observations in the the lesion as benign or malignant. The p value mum and maximum TBR values in our sam-
study). A general rule of thumb with logistic for the model fit is 0.299, indicating that the ple). We do not know what the relationship is
regression analysis is that you need at least model is a reasonable fit for the data. Because between the probability that a lesion is malig-
10–15 observations (here, patients) of each we want to use the model in the future to pre- nant and TBR values less than 0.0 or TBR
type (here, type is patients with a particular le- dict the status of lesions of unknown type, we values greater than 65.96. Although we can
sion pathology) for each predictor variable in need to examine the model. SAS provides an plug any TBR value into our model and get
the model [16–18]. If we want to assess a estimate of the model’s intercept, -1.0958, and back a value for the probability that the lesion
model with all four border measurements, then the regression coefficient for TBR, 0.0872. is malignant, this is not advisable. Rather,
we would need 40–60 patients of each type From these values, we can estimate the proba- when using our model to predict whether a le-
(i.e., 40–60 with benign lesions and 40–60 bility that a lesion is malignant by substituting sion is benign or malignant, we should con-
with malignant lesions). Our sample size is just the TBR value into this equation: sider only TBR values from the range of TBR
adequate for assessing this model. Prob(malignant) = values used to create the model.
The SAS code for fitting the model for our 1 / {1 + exp[– (–1.0958 + 0.0872 × TBR)]}.
example is proc logistic; model l_type = mf For example, if the TBR value of an unbi- Discussion
tbr gst fcha/backward lackfit;. SAS first fits a opsied lesion is 30, then, based on this model, Multivariate statistical methods have many
model with all four border measurements. the probability that the lesion is malignant is: applications in radiology studies. They are par-
Then, because we included the “backward” 1 / {1 + exp[– (–1.0958 + 0.0872 × 30)]} = ticularly useful for controlling the type 1 error
option, SAS will drop from the model the pre- 0.82, or 82%. rate in a study, and they sometimes provide in-
dictor variable that is contributing the least to Figure 3 illustrates the probability of a ma- sight into the multidimensional patterns in the
the model. SAS will continue to drop predic- lignant lesion as a function of the TBR value. data that would be overlooked with univariate
tor variables until the remaining ones are all Clearly, there is considerable overlap in the analyses.
statistically significant at the 0.05 level. The TBR values of benign and malignant lesions As with all statistical analysis, we recom-
“lackfit” option tells SAS that we want it to and in their probabilities of being malignant. mend that an analysis plan be prepared at the
print the results of a test (called the Hosmer As with all statistical modeling, we must start of a study so that the results of the data do
and Lemeshow goodness-of-fit test [14]) to remember that the model may perform well not drive the methods used. This can be a par-
see if the model is a good representation of the for the data used to create the model, but the ticular problem when there are multiple nonsig-
data. If it is not a good representation of the model may not perform as well with new ob- nificant end points. It is sometimes tempting to
data, then we will not use the model. servations. Thus, before using the model in not report the nonsignificant end points and re-
The results of the multiple-variable logistic clinical practice, we must test its performance port only the statistically significant ones. This
regression analysis are as follows: The model using different observations. If we have a strategy, however, can lead to serious misinter-
was first fit with all four border measure- large sample size, then sometimes we can pretations of the data because the type 1 error
ments. MF was not statistically significant split the data into a training data set and a test- rate is not properly controlled. Other good-
and contributed least to the model, so it was ing data set. The training set is used to create practice strategies include plotting or otherwise
dropped. Then GST was removed from the the model; the testing set is used to determine summarizing the raw data so that the results of
model, as well as FCHA. how well the model performs. In the MRI the statistical analysis can be verified with the
The final model includes TBR as the only breast lesion study, however, the sample size raw data, and evaluation of the validity of any
predictor of whether a lesion is benign or ma- was barely adequate for training. assumptions required in the statistical analysis.
Probability Lesion Is Malignant

1.0
Malignant Lesions

0.8
Fitted model
0.6

0.4
Fig. 3—Graph shows probability of malignant lesion as function of tumor border
0.2 roughness (TBR) value. Open circles at bottom of figure show probability of malig-
nant lesion for those lesions that we know to be benign (probability = 0). Solid circles
at top of figure show probability of malignant lesion for those lesions that we know
0.0 Benign Lesions are malignant (i.e., probability = 1.0). Set of points in middle of figure represents prob-
ability of malignant lesion, based on model, for each lesion in data set. Note consid-
0 20 40 60
erable overlap in TBR values of benign and malignant lesions and in their probabili-
TBR ties of being malignant.

AJR:185, August 2005 307


Obuchowski

Acknowledgments 6. Dorfman DD, Berbaum KS, Metz CE. Receiver op- 12. Olson CL. Comparative robustness of six tests in
I appreciate the contributions of data erating characteristic rating analysis: generalization multivariate analysis of variance. J Am Stat Assoc
from several investigators: Radhika Sivara- to the population of readers and patients with the 1974; 69:894–908
makrishna, Brian Herts, Richard Brunken, and jackknife method. Invest Radiol 1992; 27:723–731 13. Pocock SJ, Geller NL, Tsiatis AA. The analysis of
Caldwell Esselstyn, and the helpful sugges- 7. Obuchowski NA. Multireader, multimodality re- multiple endpoints in clinical trials. Biometrics
tions made on an earlier draft by Craig Beam ceiver operating characteristic curve studies: hy- 1987; 43:487–498
and Michael Lieber. pothesis testing and sample size estimation using an 14. Hosmer DW, Lemeshow S. Applied logistic regres-
analysis of variance approach with dependent ob- sion. New York, NY: Wiley, 1989
servations. Acad Radiol 1995; 2[suppl 1]:S22–S29, 15. Hanley JA, McNeil BJ. The meaning and use of the
References S57–S64, S70–S71 area under a receiver operating characteristic
1. Pepe MS, Urban N, Rutter C, Longton G. Design of 8. Beiden SV, Wagner RF, Campbell G. Components- (ROC) curve. Radiology 1982; 143:29–36
a study to improve the accuracy in reading mam- of-variance models and multiple-bootstrap experi- 16. Harrell FE, Lee KL, Califf RM, Pryor DB, Rosati
mograms. J Clin Epidemiol 1997; 50:1327–1338 ments: an alternative method for random-effects, RA. Regression modeling strategies for improved
2. Herts BR, Baker ME, Davros WJ, et al. Helical CT receiver operating characteristic analysis. Acad Ra- prognostic prediction. Stat Med 1984; 3:143–152
of the abdomen: comparison of image quality be- diol 2000; 7:341–349 17. Harrell FE, Lee KL, Matchar DB, Reichert TA. Re-
tween scan times of 0.75 and 1 sec per revolution. 9. Srivastava MS. A measure of skewness and kurtosis gression models for prognostic prediction: advan-
AJR 1996; 167:58–60 and a graphical method for assessing multivariate tages, problems, and suggested solutions. Cancer
3. Hand DJ, Taylor CC. Multivariate analysis of vari- normality. Statistics and Probability Letters 1984; Treat Rep 1985; 69:1071–1077
ance and repeated measures: a practical approach 2:263–267 18. Smith LR, Harrell FE, Muhlbaier LH. Problems and
for behavioural scientists. London: Chapman & 10. Looney SW. How to use tests for univariate nor- potentials in modeling survival. In: Grady ML,
Hall, 1987 mality to assess multivariate normality. The Amer- Schwartz HA, eds. Medical effectiveness research
4. Wright SP. Adjusted p-values for simultaneous in- ican Statistician 1995; 49:64–70 data methods (summary report). Rockville, MD:
ference. Biometrics 1992; 48:1005–1013 11. Khattree R, Naik DN. Multivariate data reduction U.S. Department of Health and Human Services,
5. Holm S. A simple sequentially rejective multiple and discrimination with SAS software. Cary, NC: Agency for Health Care Policy and Research (pub.
test procedure. Scand J Stat 1979; 6:65–70 SAS Institute Inc., 2000 no. 92-0056), 1992:151–159

308 AJR:185, August 2005


Multivariate Statistical Methods

APPENDIX 1: Formula for Constructing a Simultaneous Confidence Interval for the Difference Between Two Populations

The formula for constructing a simultaneous confidence interval for where F k, n1 + n 2 – k – 1 ( α ) is the upper (100α)th percentile of the F
the difference between two populations for the i-th outcome measure is distribution with numerator degrees of freedom equal to k and denom-
inator degrees of freedom equal to (n1 + n2 – k – 1), and k is the total
1⎞ number of outcome measures. sii is the sample variance for the i-th
1- + ----
( x 1i – x 2i ) ± c ⎛ ---- - s (1) outcome measure, pooled over the two populations:
⎝n n 2 ⎠ ii
1

( n 1 – 1 )s 1i + ( n 2 – 1 )s 2i
where x 1i is the sample mean of the i-th outcome measure for the first s ii = ----------------------------------------------------------
- (3)
n1 + n2 – 2
population (e.g., control group), x 2i is the sample mean of the i-th out-
come measure for the second population (e.g., intervention group),
and n1 and n2 are the sample sizes from the first and second popula- where s1i and s2i are the sample variances for the i-th outcome mea-
tions. The value of c is given by sure for the two populations.

( n 1 + n 2 – 2 )k
c = - F
---------------------------------- (2)
n 1 + n 2 – k – 1 k, n 1 + n 2 – k – 1 ( α )

APPENDIX 2: Formula for Linear Correlation Test

Pocock et al. [13] give the following formula for the linear combi- the k outcome measures, and t is the kx1 vector of univariate t statistics
nation test for outcome variables with unknown, but equivalent, vari- for the k outcome measures.
ance-covariance matrix: When there are only two outcome measures (i.e., k = 2), the
numerator of formula (4) can be written simply as
J'S – 1 t [1 / (1 – r2)] × (t1 × [1 – r] + t2 × (–r + 1)), where r is the estimated
---------------------------
- (4) correlation between the two outcome variables, and t1 and t2 are the
( J'S – 1 J ) 1 ⁄ 2
univariate t statistic values for the two outcome variables. The denom-
inator is simply the square root of [1 / (1 – r2)] × (2 – 2r).
where J' is a 1xk vector of all 1’s (i.e., 1, 1, …, 1), k is the number of
outcome measures, S is the estimate of the kxk correlation matrix for

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 Distributions, April 2003
2. The Research Framework, April 2001 11. Observational Studies in Radiology, November 2004
3. Protocol, June 2001 12. Randomized Controlled Trials, December 2004
4. Data Collection, October 2001 13. Clinical Evaluation of Diagnostic Tests, January 2005
5. Population and Sample, November 2001 14. ROC Analysis, February 2005
6. Statistically Engineering the Study for Success, July 2002 15. Statistical Inference for Continuous Variables, April 2005
7. Screening for Preclinical Disease: Test and Disease 16. Statistical Inference for Proportions, April 2005
Characteristics, October 2002 17. Reader Agreement Studies, May 2005
8. Exploring and Summarizing Radiologic Data, January 2003 18. Correlation and Regression, July 2005
9. Visualizing Radiologic Data, March 2003 19. Survival Analysis, July 2005
10. Introduction to Probability Theory and Sampling

AJR:185, August 2005 309


Plevritis R e s e a r c h • F u n d a m e n t a l s o f C l i n i c a l R e s e a rc h f o r R a d i o l o g i s t s
Decision Analysis and
Simulation Modeling

Decision Analysis and Simulation


Modeling for Evaluating
Diagnostic Tests on the Basis of
Patient Outcomes
Sylvia K. Plevritis1 n imaging test with highest diag- findings should be compared with the health

Plevritis SK A nostic accuracy is not necessarily


the test of choice in clinical prac-
tice. The decision to order a diag-
benefits associated with true-negative (TN)
and true-positive (TP) findings. Increasingly,
the cost of the diagnostic test, including the
nostic imaging test needs to be justified by its downstream costs generated as a result of im-
impact on downstream health outcomes. De- aging, are also factored in the decision-mak-
cision analysis is a powerful tool for evaluat- ing process.
ing a diagnostic imaging test on the basis of Few radiologists would argue against the
long-term patient outcomes when only inter- importance of measuring the impact of diag-
mediate outcomes such as test sensitivity and nostic tests on long-term outcomes, but many
specificity are known. The basic principles of are concerned with the feasibility of evaluat-
decision analysis and “expected value” deci- ing long-term outcomes through traditional
sion making for diagnostic testing are intro- clinical trials. Except when evaluating the im-
duced. Markov modeling is shown to be a pact of an imaging test for an acute state that
valuable method for linking intermediate to is life-threatening, evaluating the impact of an
long-term health outcomes. The evaluation of imaging test in the adult population in terms
Markov models by algebraic solutions, cohort of overall survival can require follow-up of
Series editors: Nancy Obuchowski, C. Craig Blackmore, simulation, and Monte Carlo simulation is 10 years or more. In children, even longer fol-
Steven Karlik, and Caroline Reinhold. discussed. Finally, cost-effectiveness analy- low-up periods could be required. Long fol-
sis of diagnostic testing is briefly discussed as low-up times compete with demands to dif-
This is the 21st article in the series designed by the an example of decision analysis in which fuse promising technologies quickly and
American College of Radiology (ACR), the Canadian
Association of Radiologists, and the American Journal of
long-term health effects are measured both in increase the risk of delaying technologic in-
Roentgenology. The series, which will ultimately comprise life-years and costs. novations. For a disease with a low risk of
22 articles, is designed to progressively educate The emergence of evidence-based medi- death, an economically unfeasible sample
radiologists in the methodologies of rigorous clinical cine has handed radiologists the challenge of size may be required to detect a survival ben-
research, from the most basic principles to a level of
evaluating the impact of a diagnostic imaging efit due to diagnostic testing.
considerable sophistication. The articles are intended to
complement interactive software that permits the user to test on a patient’s long-term outcome, often Linking the intermediate outcomes (such
work with what he or she has learned, which is available on measured by overall survival and total health as TPs, TNs, FPs, and FNs) to long-term out-
the ACR Web site (www.acr.org). care expenditures [1]. This new challenge comes (such as survival) without requiring a
represents a significant departure from tradi- clinical trial is sometimes possible. This link
Project coordinator: G. Scott Gazelle, Chair, ACR
Commission on Research and Technology Assessment.
tional evaluations of diagnostic examinations is often made when existing clinical data
in which the main end points are intermediate (usually collected for different purposes) can
Staff coordinator: Jonathan H. Sunshine, Senior Director ones—namely, test sensitivity and specific- be extrapolated to address the problem of in-
for Research, ACR. ity. The shift from intermediate technology- terest. Often the data are extrapolated
specific to long-term patient-specific out- through a number of assumptions that are
Received November 17, 2004; accepted after revision
November 23, 2004. comes is being driven by the fact that a test formulated into a mathematic model in
with the highest diagnostic accuracy may not which the link between intermediate and
1Department of Radiology, LUCAS Center P267, Stanford necessarily be the test of choice in clinical long-term outcomes is expressed in terms of
University, Stanford, CA 94305. Address correspondence to practice [2]. When making the decision to or- probabilistic events [3]. The Markov model,
S. K. Plevritis. der a diagnostic imaging test, a clinician con- described later in this article, is an example
AJR 2005; 185:581–590
siders the health outcomes downstream from of the methods commonly used in this ex-
the imaging examination. For example, the trapolation process. When reliable models
0361–803X/05/1853–581
health risks of interventions resulting from can be generated, the opportunity arises to
© American Roentgen Ray Society false-positive (FP) and false-negative (FN) evaluate a variety of hypothetical clinical

AJR:185, September 2005 581


Plevritis

Fig. 1—Decision trees.


Do
A, Decision tree shows consequences and outcomes (L1–L12) of three options—
namely, "Do Nothing," "Surgery," and "Imaging." Nothing P(D+) × L1 + P(D–) × L2
B, Decision tree in A is "rolled back" one layer.
C, Decision tree in B is rolled back one layer.
D, Decision tree in C is rolled back one layer. This tree is fully collapsed, and the main P(D+)
options are expressed in terms of their expected value. P(O+) × L3 + P(O–) × L4
Surgery

P(D–)
P(O+) × L5 + P(O–) × L6

P(D+|T+)
P(O+) × L7 + P(O–) × L8
P(T+)
Surgery
P(D–|T+)
Imaging P(O+) × L9 + P(O–) × L10

P(D+) P(T–)
Do L1 P(D+|T–) × L11 + P(D–|T–) × L12
Nothing Do
Nothing
P(D–)
L2 B

Do
P(O+)
L3 Nothing
P(D+) P(D+) × L1 + P(D–) × L2

P(O–)
Surgery L4
Surgery P (D+) × [P(O+) × L3 + P(O–) × L4] +
P (D–) × [P(O+) × L5 + P(O–) × L6]
P(O+)
P(D–) L5 P(T+) P(D+|T+) × [P(O+) × L7 + P(O–) × L8] +
Surgery P(D–|T+) × [P(O+) × L9 + P(O–) × L10]
P(O–)
L6 Imaging

P(O+) P(T–)
P(D+|T–) × L11 + P(D–|T–) × L12
L7 Do
|
P(D+ T+) Nothing
P(O–) C
P(T+) L8
P(O+) Do
Surgery
|
P(D– T+) L9 Nothing
P(D+) × L1 + P(D–) × L2
Imaging P(O–)
L10
Surgery P(D+) × [P(O+) × L3 + P(O–) × L4] +
P(D+ T–) | L11
P(D–) × [P(O+) × L5 + P(O–) × L6]

P(T–)
Imaging P(T+) × [P(D+|T+) [P(O+) × L7 + P(O–) × L8]
Do P(D– T–) | L12 + P(D–|T+) × [P(O+) × L9 + P(O–) × L10]
Nothing
+ P(T–) × [P(D+|T–) × L11 + P(D–|T–) × L12]
A D

paradigms that would not be economically This article will focus on the basic princi- because they are typically incorporated into
feasible or practical to analyze experimen- ples of decision analysis and “expected decision analysis models to provide the link
tally via traditional clinical trials. The pro- value” decision making. Emphasis is placed between intermediate and long-term out-
cess of choosing from a number of hypothet- on evaluating diagnostic testing on the basis comes. Cost-effectiveness analysis, which is
ical clinical paradigms by comparing them in of long-term patient outcomes given only a type of decision analysis in which the health
terms of model-based probabilistic outcomes knowledge of the test sensitivity and specific- effects and costs are tracked simultaneously,
is often referred to as decision analysis. ity. Markov models will be briefly introduced will also be briefly discussed. Finally, the ma-

582 AJR:185, September 2005


Decision Analysis and Simulation Modeling

jor strength and weakness of decision analy- To answer this question, the decision dergoes immediate surgery; and Imaging,
sis will be summarized. maker, who is the clinician in the our exam- meaning the patient undergoes a diagnostic
ple, needs to define a value on which to base imaging test and then surgery if the imaging
Decision Analysis the decision. If the value is life expectancy, findings are positive.
Decision analysis is a deductive reasoning then the clinician would want to know if or- The Do Nothing option yields two chance
process that enables a decision maker to dering the diagnostic test will increase the pa- events: the patient has the disease, with prob-
choose from a well-defined set of options on tient’s life expectancy. Decision analysis re- ability P(D+), and is assigned a life expect-
the basis of the systematic model-based anal- veals how the patient’s life expectancy ancy of L1 years; or the patient does not have
ysis of all the probable outcomes [4–6]. Every depends on the choice made by the decision the disease, with probability P(D–), and is as-
outcome has a known probability of occur- maker and events that are governed by signed a life expectancy of L2 years.
rence and a numeric value (i.e., life expect- chance—namely, the patient’s probability of The Surgery options yields four chance
ancy). The purpose of decision analysis is to having the disease before getting the imaging events: the patient has the disease with prob-
quantify each option in terms of its expected results (i.e., the pretest probability or disease ability P(D+), experiences fatal surgical com-
(or average) value. A rational decision maker prevalence), the patient’s life expectancy if plications with probability P(O+), and has a
would choose the option that provides the the disease is present and untreated, the ex- life expectancy of L3 years; the patient has
greatest expected value. For example, if the pected survival gain from a successful sur- the disease with P(D+), undergoes successful
outcome of the decision is measured in terms gery, the risk of death from surgery, and the surgery with probability P(O–), and has a life
of life expectancy, the decision maker would sensitivity and specificity of the imaging test. expectancy of L4 years; the patient does not
choose to maximize the expected value; if the have the disease with probability P(D–), ex-
outcome is measured in costs, the decision Decision Trees periences complications due to surgery with
maker would choose to minimize the ex- Decision analysis is aided by the use of a probability P(O+), and has a life expectancy
pected value. decision tree [4, 5]. A decision tree is a of L5 years; and the patient does not have the
The critical components underlying deci- graphic model that represents the conse- disease with probability P(D–), does not ex-
sion analysis include clarifying the decision quences for each possible decision through a perience complications due to surgery with
and the value used to measure the success of sequence of decision and chance events [7]. A probability P(O–), and has a life expectancy
the decision, identifying the options, formu- decision tree is constructed with three types of L6 years.
lating every possible outcome from every of nodes: decision nodes, chance nodes, and The Imaging option yields six chance
possible decision, assigning a probability to terminal nodes, commonly represented as events. In four chance events, the imaging
each possible chance event, and assigning a squares, circles, and triangles, respectively. A findings are positive, with probability P(T+),
value to each possible outcome. Once these decision node is a branching point in the tree and the patient undergoes surgery; then the pa-
components are determined, computing the where several options are available to the de- tient who has the disease, with conditional
expected values for each option can be cision maker for his or her choosing. A probability P(D+|T+), experiences fatal surgi-
straightforward. chance node is a branching point from which cal complications with probability P(O+) and
Consider a generic clinical problem that in- several outcomes are possible, but they are is assigned a life expectancy of L7 years; the
volves the optional use of a diagnostic imag- not available to the decision maker for his or patient who has the disease, with conditional
ing test: her choosing. Instead, at a chance node, the probability P(D+|T+), has a successful sur-
A patient presents with clinical symptoms of outcome is randomly drawn from a set of pos- gery, with probability P(O–), and is assigned a
a life-threatening disease that requires surgery. sible outcomes (this is equivalent to saying life expectancy of L8 years; the patient who
What should the clinician recommend, know- that they are governed by chance). A chance does not have the disease, with conditional
ing that surgery carries a risk of death? If the pa- event could be, for example, that a patient probability P(D–|T+), but experiences fatal
tient’s probability of having the disease is low presenting with symptoms for a disease actu- surgical complications, with probability
relative to the risk of surgery-related fatality, ally has the disease. At a chance node, every P(O+), and is assigned a life expectancy of L9
the clinician may recommend “Do Nothing” to outcome is assigned a probability of occur- years; the patient who does not have the dis-
avoid the risk of death due to the surgery. If the rence, which is often estimated from a clinical ease, with conditional probability P(D–|T+),
patient’s probability of disease is high, the cli- trial or observational data. The decision tree has successful surgery, with probability
nician may recommend “Surgery” immedi- is typically drawn by starting at the far left P(O–), and is assigned a life expectancy of
ately on the premise that the risk of death from with a decision node and continuing from left L10 years. In two chance events the imaging
the disease is higher than the risk of death from to right through a sequence of decision and findings are negative, with probability P(T–);
surgery. Now suppose a diagnostic imaging test chance nodes. Every possible pathway and either the patient has the disease, with
becomes available with known sensitivity and through the decision tree ends at the far right probability P(D+|T–), and is assigned a life
specificity for the disease of interest. The clini- with a terminal node. Every terminal node is expectancy of L11 years; or the patient does
cian may choose to order the imaging test and assigned a value. not have the disease, with probability
then recommend Do Nothing if the imaging A simple decision tree associated with the P(D–|T–), and is assigned a life expectancy of
finding is negative or Surgery if the imaging clinical problem described previously is L12 years.
finding is positive. Should the clinician order given in Figure 1A. This decision tree has one All the probabilities populating the deci-
the imaging test, or make the recommendation decision node that illustrates three possible sion tree are summarized in Table 1.
of Do Nothing or Surgery, without the findings options: Do Nothing, meaning the patient is To evaluate the Imaging option, the proba-
from the imaging test? sent home; Surgery, meaning the patient un- bilities P(D+|T+), P(D–|T+), P(D+|T–),

AJR:185, September 2005 583


Plevritis

TABLE 1: Probability Notation and Base Case Values


Notation Meaning Base Case Value
P(D+) Probability disease is present = prevalence = pretest probability 0.10
P(D–) Probability disease is not present = [1–P(D+)] 0.90
P(T+|D+) Sensitivity = probability of a positive test given disease is present = probability of a true-positive 0.90
P(T–|D+) (1 – sensitivity) = probability of negative test given that the disease is present = probability of a false-negative 0.10
P(T–|D–) Specificity = probability of negative test given that the disease is present = probability of a true-negative 0.80
P(T+|D–) (1 – specificity) = probability of positive test given that the disease is not present = probability of a false-positive 0.20
P(D+|T+) Probability disease is present given that test is positive = PPV 0.33
P(D–|T+) Probability disease is not present given that test is positive = [1 – PPV] 0.67
P(D+|T–) Probability disease is present given that test is negative = [1 – NPV] 0.01
P(D–|T–) Probability disease is not present given that test is negative = NPV 0.99
P(T+) Probability of a positive test 0.27
P(T–) Probability of a negative test = [1 – P(T+)] 0.73
P(O+) Probability of surgery-related death 0.05
P(O–) Probability of successful surgery = [1 – P(O+)] 0.95
Note—Bold base case values are assigned, nonbold values are derived. PPV = positive predictive value, NPV = negative predictive value.

P(D–|T–), P(T+), and P(T–) must be evalu- The likelihood of a bad outcome is minimized For the Imaging option, the life expect-
ated. The probability that the patient has the when the decision is the one with the greatest ancy is P(T+) × {P(D+|T+) × [P(O+) × L7 +
disease given a positive imaging finding, de- expected value. This principle is often referred P(O–) × L8] + P(D–|T+) × [P(O+) × L9 +
noted as P(D+|T+), is commonly referred to to as Bayes’ Decision Rule and is credited to P(O–) × L10]} + P(T–) × [P(D+|T–) × L11 +
as the positive predictive value of the test. the Reverend Thomas Bayes, an 18th century P(D–|T–) × L12].
The probability that the patient does not have minister, philosopher, and mathematician To evaluate and compare the life expectan-
the disease given a negative imaging finding, who formulated Bayes’ theorem. cies for each of the three options, the probabil-
denoted as P(D–|T–), is commonly referred to Computing the expected value of each op- ities and life expectancies L1 through L12
as the negative predictive value of the test. tion is accomplished by “rolling back” or must be assigned. Consider the example in
These probabilities can be derived from the “averaging” the decision tree. The process of which a 60-year-old patient presents with
pretest probability of the disease and the sen- rolling back the decision tree for the clinical symptoms indicative of a specified disease that
sitivity and specificity of the test using Bayes’ example illustrated in Figure 1A is shown in has poor prognosis. Probability values for the
theorem: Figures 1B–1D. In each figure we progres- chance events are given in Table 1. These val-
sively roll back the right-most layer of termi- ues can be derived from the following three as-
P(D+|T+) = P(T+|D+) P(D+) / P(T+) nal branches to their originating node and as- sumptions: patient’s pretest probability for the
P(D–|T–) = P(T–|D–) P(D–) / P(T–), sign an expected value to that node, in effect disease is 0.10; the probability of surgery-re-
turning the originating node into a new ter- lated death is 0.05; and the diagnostic test has
where P(T+|D+) is defined as the sensitivity minal node. If the originating node is a a sensitivity of 0.90 and specificity of 0.80. To
and P(T–|D–) is defined as the specificity [8]. chance node, then the expected value is cal- compute the expected value of each option, we
The probability of a positive and negative test culated as the weighted average of the ex- also need to assign a value to each possible out-
can be computed as: pected values of its possible outcomes, come. Table 2 lists all the possible intermedi-
where the weights are the probabilities that ate outcomes (column 1) and their associated
P(T+) = P(T+|D+) P(D+) + P(T+|D–) P(D–) each outcome will occur. If the originating life expectancies (column 6). For example, if
P(T–) = 1–P(T+). node is a decision node, the outcome is the the clinician recommends Do Nothing and the
one with the best expected value. This pro- intermediate outcome is that the patient has the
Therefore, incorporating an imaging test cess is continued until the decision node at disease (D+), then the patient’s life expectancy
into the decision tree simply requires knowl- the far left-most part of tree is the only re- is L1 = 65 years. If the patient does not have the
edge of the test’s sensitivity and specificity maining decision node in the tree. The deci- disease, his or her life expectancy is 80 years.
and the patient’s pretest probability of disease. sion tree is said to be “fully collapsed.” An If the patient experiences operative death, then
example of a fully collapsed tree is given in his or her life expectancy is 60 years. We as-
Expected Value Decision Making Figure 1D. sume successful surgery is not curative but ex-
Decision analysis operates on the principle For the Do Nothing option, the life expect- tends the patient’s life expectancy to 72.5
that a rational choice from a set of options is ancy is P(D+) × L1 + P(D–) × L2 years. years. Later we will show how these life ex-
the one with the greatest expected value [4, 5]. For the Surgery option, the life expectancy pectancies can be estimated with a Markov
It is possible that a “good” decision leads to a is P(D+) × [P(O+) × L3 + P(O–) × L4] + model that links the intermediate health out-
“bad” outcome because chance is involved. P(D–) × [P(O+) × L5 + P(O–) × L6]. comes to overall survival.

584 AJR:185, September 2005


Decision Analysis and Simulation Modeling

TABLE 2: Transition Probabilities and Long-Term Outcomes


Transition Probabilitiesb Long-Term Outcome
Option and Intermediate p1 = P(DSDn|Aliven–1) p2 = P(DOCn|Aliven–1) p3 = P(DSn|Aliven–1) p4 = P(Aliven|Aliven–1) Expressed as Life
Outcomea Expectancy (yr)c
Do Nothing
D+ 0.15 0.05 — 0.80 L1 = 65
D– — 0.05 — 0.95 L2 = 80
Surgery
D+, O+ — — 1.00 — L3 = 60
D+, O– 0.03 0.05 — 0.92 L4 = 72.5
D–, O+ — — — 1.00 L5 = 60
D–, O– — 0.05 — 0.95 L6 = 80
Imaging
T+, D+, O+ — — 1.00 — L7 = 60
T+, D+, O– 0.03 0.05 — 0.92 L8 = 72.5
T+, D–, O+ — — — 1.00 L9 = 60
T+, D–, O– — 0.05 — 0.95 L10 = 80
T–, D+ 0.15 0.05 — 0.80 L11 = 65
T–, D– — 0.05 — 0.95 L12 = 80
Note—D+ indicates patient has disease, D– indicates patient does not have disease, O+ indicates patient experienced operative death, O– indicates patient underwent
successful surgery, T+ indicates positive imaging finding, T– indicates negative imaging finding.
aIntermediate outcomes depend on options illustrated in decision tree in Figure 1A.
bProbabilities correspond to Markov model in Figure 5. DSD , DOC , and DS indicate patient enters Disease-Specific Death, Death from Other Causes, Death from Surgery,
n n n
respectively, at cycle number n. Aliven and Aliven–1 indicate patient is alive at cycle n and cycle n–1, respectively. Cycle period is 1 year. Dash (—) indicates 0.
cOutput of Markov model when the patient is 60 years old at initiation.

Given the probabilities assigned to the at terminal nodes may not be known. Under maining parameters in Tables 1 and 2. For
chance events (Table 1, column 3) and the ex- these circumstances, the values assigned may P(D+) less than or equal to 0.03, the Do
pected values of each possible outcome be reflective of an expert’s best guess. The Nothing option has the greatest life expect-
(Table 2, column 6), the life expectancies for possibility exists that by varying the dubious ancy; for P(D+) greater than 0.03 but less
the options Do Nothing, Surgery, and Imag- model inputs, the expected values will not be than 0.54, the Imaging option has the great-
ing are 78.5 years, 78.3 years, and 78.9 years, affected greatly or will be affected but not est life expectancy; and for P(D+) greater
respectively. Because the maximum life ex- enough to change the ranking of the options in than or equal to 0.54, the Surgery option has
pectancy is associated with the Imaging op- order of expected value. Under either of these the greatest life expectancy. The point at
tion, the clinician would recommend surgery scenarios, a decision maker would be more which the decision shifts from one alterna-
on the basis of positive imaging findings. confident in implementing the option with the tive to another is often referred to as the
Factors that were not considered in the de- greatest expected value. However, when crossover point or the threshold.
cision process that could change the clini- changing an input affects the ranking of the Although a one-way sensitivity analysis is
cian’s recommendation include the invasive- options, the decision maker would be less cer- computationally easy, the outcomes may not
ness of the imaging test; quality of life while tain about proceeding without clarifying the be representative of realistic clinical situa-
living with the symptoms; utilities derived value of that input. tions. For example, changing test sensitivity
from TP, TN, FP, and FN findings on imag- N-way (or multivariate) sensitivity analy- without changing test specificity is usually
ing; and the possibility of delaying the sur- ses refer to the process of varying N param- not possible. In a two-way sensitivity analy-
gery. However, the general ideas presented eters in a model simultaneously while all sis, two parameters, such as sensitivity and
here can be extended to include these factors. other parameters remain constant. The most specificity, are varied at the same time, pref-
In addition, this general approach can be used simple and common example is one-way (or erably choosing paired values of sensitivity
to consider more complex decisions that may univariate) sensitivity analysis, in which one and specificity along a receiver operating
involve more than one imaging test ordered model parameter is varied in a range be- characteristic (ROC) curve [9]. A two-way
sequentially or in parallel. tween an upper and lower bound, while all sensitivity analysis on test sensitivity and test
the other parameters are kept constant. A se- specificity is shown in Figure 3. Here the sen-
Sensitivity Analysis ries of one-way sensitivity analyses is the sitivity and specificity are not varied along an
Sensitivity analysis is a necessary compo- easiest way to identify which parameters ROC curve. Instead, the sensitivity was var-
nent of decision analysis that is used to eval- have the strongest effect on the optimal de- ied continuously from 0 to 1, and for each
uate the robustness of the decision to varia- cision. For the example just given, a one- value of the sensitivity, the life expectancy
tions in model assumptions. In decision trees, way sensitivity analysis on the pretest prob- was evaluated at four discrete values of spec-
probabilities of chance events and the values ability is shown in Figure 2 using the re- ificity: namely, 0.3, 0.5, 0.7, and 0.9. The op-

AJR:185, September 2005 585


Plevritis

80.0

77.5
Life Expectancy (yr)

75.0

72.5

70.0

67.5

65.0
0.01 0.11 0.21 0.31 0.41 0.51 0.61 0.71 0.81 0.91

Pretest Probability of Disease, P(D+)

Fig. 2—One-way sensitivity analysis shows impact of changes in pretest probability Fig. 3—Two-way sensitivity analysis shows impact of changes in test sensitivity and
of disease on life expectancy for three options, "Do Nothing" (gray line), "Surgery" test specificity on life expectancy for "Imaging" (black lines) option. Life expectancies
(dotted line), and "Imaging" (black line). for "Do Nothing" (gray line) and "Surgery" (dotted line) options are included for com-
parison. Do Nothing option dominates Surgery option.

1.0
Disease-specific
0.9 death
Recommend p4
0.8 p1
Recommend “Imaging”
0.7
“Do Nothing”
Sensitivity

0.6
Alive p2 Death from
0.5
other causes
0.4

0.3

0.2 p3
0.1
Death from
surgery
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Specificity

Fig. 4—Two-way sensitivity analysis illustrates optimal decision as a function of sen- Fig. 5—Markov model with four health states ("Alive," "Disease-Specific Death,"
sitivity and specificity of diagnostic test. "Imaging" option would be recommended for "Death from Other Causes," "Death from Surgery") and four transition probabilities (p1,
values above line, and "Do Nothing" option would be recommended for values below. p2, p3, and p4).

timal decision is either Do Nothing or Imag- than 0.13, when the specificity is 0.7 and the Markov Models for Estimating
ing, depending on the test’s sensitivity and sensitivity is greater than 0.4, when the spec- Life Expectancy
specificity. The Do Nothing option dominates ificity is 0.5 and the sensitivity is greater than Markov models are commonly used in med-
the Surgery option, meaning that the optimal 0.65, and when the specificity is 0.3 and the ical decision analysis for estimating life ex-
choice between Do Nothing and Surgery sensitivity is greater than 0.91. The optimal pectancy [10, 11]. In the previous example, the
would be Do Nothing, as was observed in decision in “ROC Space” (i.e., for all values life expectancy for every possible outcome
Figure 2 for a pretest probability of 0.1. The of sensitivity and specificity) is shown in was known, but this information is usually not
Imaging option dominates the Surgery option Figure 4. Once the specificity is less than available. As discussed in the introduction, the
under the following conditions: when the 0.24, the Do Nothing option is recommended challenge in basing decisions on maximizing
specificity is 0.9 and the sensitivity is greater for all values of sensitivity. life-years lies in finding a model that links the

586 AJR:185, September 2005


Decision Analysis and Simulation Modeling

known intermediate health states to survival. A sitioning from Alive to Disease-Specific cle. By the end of the second cycle, an addi-
Markov model may be an appropriate tool for Death in 1 year’s time is 0.15 if the patient tional p1 × 8,000 = 0.15 × 8,000 = 1,200 pa-
establishing this link when it is possible to rep- does not undergo surgery and 0.03 if the pa- tients enter the Disease-Specific Death, and
resent a patient’s life history from a known in- tient undergoes successful surgery; the prob- p2 × 8,000 = 0.05 × 8,000 = 400 patients enter
termediate health to death through a series of ability of transitioning from Alive to Death Death from Other Causes, leaving 8,000 –
transitions among a finite set of health states from Other Causes in 1 year’s time is 0.05 if 1,200 – 400 = 6,400 in the Alive state. The cu-
that have been observed elsewhere. the patient does not experience surgery-re- mulative number of patients in each of the
A simple Markov model for the clinical ex- lated death; and surgery-related death is im- three states for the first 40 cycles of the
ample described is composed of four health mediate. Markov process is shown in the Table 3. Each
states: “Alive,” “Disease-Specific Death,” Once the health states, allowed transitions row totals 10,000 patients. Life expectancy is
“Death from Other Causes,” and “Death from between health states, and transition probabil- calculated as the average amount of time a pa-
Surgery.” This model is shown in Figure 5. ities are identified, the life expectancy can be tient is in the Alive state. For this example, a
Each oval represents a health state. The ar- calculated using an algebraic solution, a co- 60-year-old patient remains in the Alive state
rows represent the possibility of transition hort simulation, or a Monte Carlo simulation. for 20 years on average, making his or her life
from one state to another. The arrow that All three approaches will be illustrated for es- expectancy L1 = 60 + 20 = 80 years.
points back into the health state Alive indi- timating the life expectancy L1 in Figure 1A,
cates that the patient can remain in the health where a 60-year-old patient presents with Monte Carlo Simulation
state Alive after a given cycle. Transitions be- clinical symptoms, the decision is made to Do If complex dependencies exist in the state
tween health states occur in a designated time Nothing, and the patient actually has the dis- transition model, an intensive computer sim-
period, known as the cycle period of the ease (D+). In this case, p1 = 0.15, p2 = 0.05, ulation procedure called Monte Carlo simula-
model. The cycle period for chronic diseases p3 = 0, and p4 = 0.80, as shown in Table 2. tion may be needed to compute the life ex-
is typically 1 year, whereas the cycle period pectancy. In Monte Carlo simulation, patients
for acute diseases is often shorter—that is, Algebraic Solution traverse the health states one at a time, with a
months or even days. The probability that the If the transition probabilities are constant random number generator (RNG) determin-
patient will move from one health state to an- over time, then a closed-form algebraic solu- ing what happens to an individual at each cy-
other in a given cycle is referred to as the tran- tion exists for estimating the life expectancy. cle of the process. An RNG is a computer al-
sition probability. The life expectancy is the In the simple example above, the patient’s life gorithm that produces sequences of numbers
average length of time spent in all health expectancy L1 is calculated as follows: that on aggregate have a specified probability
states other than death. distribution and individually possess the ap-
The transition probabilities for the Markov LE = present age + 1 / (p1 + p2 + p3) = pearance of randomness.
model shown in Figure 5 are as follows: 60 + 1 / (0.15 + 0.05 + 0) = 65 years. To estimate L1 in the above example via
Monte Carlo simulation, an RNG samples a
p1 = P(Disease-Specific Death at cycle In more complex Markov chain models uniform distribution from 0 to 1. When the
number n | Alive at cycle number n – 1), with numerous transient, recurrent, and ab- RNG produces a number in the range
sorbing states, a matrix formalism may be 0–0.05, the patient is assigned to Disease
p2 = P(Death from Other Causes at cycle necessary to evaluate the model using the from Other Causes. This will happen 5% of
number n | Alive at cycle number n – 1), closed-form, algebraic approach. the time, which corresponds to the transition
probability from Alive to Disease from
p3 = P(Death from Surgery at cycle Cohort Simulation Other Causes. When the RNG produces a
number n | Alive at cycle number n – 1), and If the transition probabilities are not con- number greater than 0.05 but less than or
stant over time, simulating the outcomes of a equal to 0.20, the patient is assigned to Dis-
p4 = P(Alive at cycle cohort of patients is commonly implemented. ease-Specific Death. This will happen 15%
number n | Alive at cycle number n – 1), This simulation process is initiated by distrib- of the time, which corresponds to the transi-
uting a cohort among the health states. For the tion probability from Alive to Disease-Spe-
where p1 + p2 + p3 + p4 = 1. above example, the entire cohort begins in the cific Death. Finally, when the RNG pro-
Alive state. At each cycle the cohort is redis- duces a number greater than 0.2 but less than
Note that the health state at cycle number n is tributed among the states, depending on the or equal to 1.0, the patient remains in the
conditioned on the health state at cycle number transition probabilities. Markov cohort simu- Alive state, which will happen 80% of the
n–1 and is independent of the health state be- lation for estimating the life expectancy L1 is time. In this simple example, only one ran-
fore cycle number n–1. This property is the de- illustrated in Table 3. The initial cohort size is dom number needs to be generated at every
fining property of Markov models of this type, 10,000. At the start of the simulation, the cycle. Once the patient enters a death-related
which are referred to as Markov chain models. 10,000 patients are in the Alive state. By the state, the life history of that patient is termi-
The transition probabilities for the exam- end of the first cycle, p1 × 10,000 = 0.15 × nated and a new run begins that traces the
ple just described are given in Table 2, as- 10,000 = 1,500 patients enter Disease-Spe- life history of the next patient. The process is
suming a cycle period of 1 year. All these cific Death and p2 × 10,000 = 0.05 × 10,000 = repeated until a large number of runs (typi-
transition probabilities can be derived from 500 patients enter Death from Other Causes, cally 10,000) are performed. There is no for-
the following three assumptions: if the pa- leaving 10,000 – 1,500 – 500 = 8,000 patients mula specifying the exact number of runs
tient has the disease, the probability of tran- in the Alive state for the start of the second cy- needed, but the number should increase with

AJR:185, September 2005 587


Plevritis

TABLE 3: Markov Cohort Simulation: Distribution of a 10,000-Patient Cohort at End of Each 1-Year Cycle
Cumulative No. in State
Cycle No. Age of Alive Population No. in Alive State Disease-Specific Death Death from Other Causes
0 60 10,000.00 0 0
1 61 8,000.00 1,500.00 500.00
2 62 6,400.00 2,700.00 900.00
3 63 5,120.00 3,660.00 1,220.00
4 64 4,096.00 4,428.00 1,476.00
5 65 3,276.80 5,042.40 1,680.80
6 66 2,621.44 5,533.92 1,844.64
7 67 2,097.15 5,927.14 1,975.71
8 68 1,677.72 6,241.71 2,080.57
9 69 1,342.18 6,493.37 2,164.46
10 70 1,073.74 6,694.69 2,231.56
11 71 858.99 6,855.75 2,285.25
12 72 687.19 6,984.60 2,328.20
13 73 549.76 7,087.68 2,362.56
14 74 439.80 7,170.15 2,390.05
15 75 351.84 7,236.12 2,412.04
16 76 281.47 7,288.89 2,429.63
17 77 225.18 7,331.12 2,443.71
18 78 180.14 7,364.89 2,454.96
19 79 144.12 7,391.91 2,463.97
20 80 115.29 7,413.53 2,471.18
21 81 92.23 7,430.82 2,476.94
22 82 73.79 7,444.66 2,481.55
23 83 59.03 7,455.73 2,485.24
24 84 47.22 7,464.58 2,488.19
25 85 37.78 7,471.67 2,490.56
26 86 30.22 7,477.33 2,492.44
27 87 24.18 7,481.87 2,493.96
28 88 19.34 7,485.49 2,495.16
29 89 15.47 7,488.39 2,496.13
30 90 12.38 7,490.72 2,496.91
31 91 9.90 7,492.57 2,497.52
32 92 7.92 7,494.06 2,498.02
33 93 6.34 7,495.25 2,498.42
34 94 5.07 7,496.20 2,498.73
35 95 4.06 7,496.96 2,498.99
36 96 3.25 7,497.57 2,499.19
37 97 2.60 7,498.05 2,499.35
38 98 2.08 7,498.44 2,499.48
39 99 1.66 7,498.75 2,499.58
40 100 1.33 7,499.00 2,499.67
41 101 1.06 7,499.20 2,499.73
42 102 0.85 7,499.36 2,499.79
43 103 0.68 7,499.49 2,499.83
44 104 0.54 7,499.59 2,499.86
45 105 0.44 7,499.67 2,499.89
Note—Life expectancy is computed as average amount of time a patient remains in Alive state.

588 AJR:185, September 2005


Decision Analysis and Simulation Modeling

TABLE 4: Monte Carlo Simulation of a Markov Process clarification of the decision and values for
Run 1 Run 2 Run 3 Run 4 Run 5 … Run 10,000 making a good decision, integration of data
Cycle No.a RNG State RNG State RNG State RNG State RNG State RNG State from multiple data sources, and mathematic
modeling. The necessary steps for any decision
1 0.63 Alive 0.75 Alive 0.23 Alive 0.93 Alive 0.55 Alive 0.92 Alive
analysis are summarized in Appendix 1.
2 0.83 Alive 0.91 Alive 0.32 Alive 0.59 Alive 0.15 DSD 0.64 Alive The major strength of decision analysis is
3 0.56 Alive 0.03 DOC 0.69 Alive 0.30 Alive 0.48 Alive that the process offers an explicit and system-
4 0.20 Alive 0.33 Alive 0.60 Alive 0.82 Alive atic approach to decision making based on
5 0.09 DDS 0.63 Alive 0.19 DSD 0.61 Alive uncertainty. The major weakness of decision
analysis lies with the decision analyst who
6 0.30 Alive 0.86 Alive
uses data to populate a model without under-
7 0.39 Alive 0.86 Alive
standing the biases in the data and therefore
8 0.83 Alive 0.07 DSD does not fully explore their impact on the de-
9 0.52 Alive cision [23, 24]. This problem is minimized
10 0.68 Alive when the decision analyst is fully knowledge-
11 0.27 Alive
able of both clinical domain–specific and
methodology-specific issues.
12 0.14 DSD
This article has focused the basic ideas of
… decision analysis toward the problem of eval-
Note—Each run represents life history of a single individual. At each cycle in a given run, a random number uating a diagnostic imaging test on the basis
generator (RNG) outputs a number from a uniform distribution between 0 and 1. If RNG produces a number of long-term patient outcomes when only the
≤ 0.05, patient is assigned to “Death from Other Causes” (DOC); > 0.05 and ≤ 0.20, “Disease-Specific
Death” (DSD); or > 0.20 and ≤ 1,“Alive.” Once patient enters a death-related state, life history of that patient is test’s sensitivity and specificity are known.
terminated and a new run begins until maximum number of runs is completed. Markov models were introduced as means of
aCycle period = 1 year. linking intermediate to long-term outputs.
Even when the inputs and structure of the de-
cision analysis model may be incompletely
the complexity of the model to reduce simu- be the basis for a preferred decision. If the dif- supported by data, the decision analysis pro-
lation variability in the result. ference in life expectancy between an exist- cess itself can be valuable in identifying im-
Six sample runs of a Monte Carlo simulation ing clinical protocol and a new clinical proto- portant areas of uncertainty and directing the
are shown in Table 4. In each run, the patient is col is small but the difference in costs is large, investment of resources toward acquiring in-
initiated in the Alive state at age 60 and ages 1 it may be more prudent to follow the existing formation needed to address the question of
year in every cycle. Table 4 shows that in run 1, protocol and invest health care dollars in an- interest. Such analyses may be warranted be-
the patient dies of the disease after 5 years (at other clinical problem for which the incre- fore resources are committed to large-scale,
age 65); in run 2, the patient dies of other causes mental life expectancy is higher for the same costly clinical trials.
after 3 years (at age 63). The runs are repeated health care expenditures.
10,000 times. The life expectancy is the aver- In cost-effectiveness analysis, the expected
age age at death. A valuable output of Monte value is reported as the marginal cost per year References
Carlo simulation is a histogram of age at death, of life saved (MCYLS) [17]. When the deci- 1. Thornbury JR. Why should radiologists be inter-
so that measures of variability in the life expect- sion tree is rolled back, the average cost is ested in technology assessment and outcomes re-
ancy are easy to calculate [12, 13]. evaluated in parallel with the life expectancy. search? AJR 1994; 163:1027–1030
Markov models have much broader appli- Dominant options are ranked in terms of in- 2. Fryback DG, Thornbury JR. The efficacy of diag-
cability than estimating life expectancy. cremental cost-effectiveness ratios. nostic imaging. Med Decis Making 1991; 11:88–94
They are used in a variety of fields to repre- The value of diagnostic testing is put to 3. Ramsey SD, McIntosh M, Etzioni R, Urban N. Sim-
sent processes that evolve over time in a the greatest challenge in cost-effectiveness ulation modeling of outcomes and cost effective-
probabilistic manner. The article by Kuntz analysis. Often diagnostic testing increases ness. Hematol Oncol Clin North Am 2000;
and Weinstein [14] is recommended further both life expectancy and health care costs. 14:925–938
reading on Markov modeling in medical de- The application of cost-effectiveness analy- 4. Weinstein MC, Fineberg HV, Elstein AS, et al.
cision analysis. sis to diagnostic testing is introduced in an Clinical decision analysis. Philadelphia, PA: Saun-
article by Fryback [18] and discussed in ders, 1980
Cost-Effectiveness Analysis more detail in an article by Singer and Ap- 5. Pauker SG, Kassirer JP. Decision analysis. N Engl
Cost-effectiveness analysis is a type of de- plegate [19]. More general discussions on J Med 1987; 316:250–258
cision analysis in which both health and eco- the role of cost-effectiveness analysis and 6. Sox H, Blatt MA, Higgins MC, Marton KI. Medical
nomic outcomes are considered simulta- recommendations for reporting results are decision making. Boston, MA: Butterworths, 1988
neously in making a decision [15, 16]. The found in other articles [20–22]. 7. Fineberg HV. Decision trees: construction, uses,
decision analysis example described previ- and limits. Bull Cancer 1980; 67:395–404
ously focused on maximizing life expectancy Summary 8. Schulzer M. Diagnostic tests: a statistical review.
(LE). Although maximizing life expectancy Decision analysis is a multifaceted concept. Muscle Nerve 1994; 17:815–819
is a reasonable value, it may not necessarily Underlying the decision analytic process is 9. Metz CE. Basic principles of ROC analysis. Semin

AJR:185, September 2005 589


Plevritis

Nucl Med 1978; 8:283–298 eds. Cost-effectiveness in health and medicine. Ox- health and medicine. Panel on Cost-Effectiveness in
10. Sonnenberg FA, Beck JR. Markov models in med- ford, England: Oxford University Press, 1996 Health and Medicine. JAMA 1996; 276:1172–1177
ical decision making: a practical guide. Med Decis 16. Weinstein MC, Stason WB. Foundations of cost-ef- 21. Weinstein MC, Siegel JE, Gold MR, Kamlet MS,
Making 1993; 13:322–338 fectiveness analysis for health and medical prac- Russell LB. Recommendations of the Panel on
11. Beck JR, Pauker SG. The Markov process in med- tices. N Engl J Med 1977; 29:716–721 Cost-effectiveness in Health and Medicine. JAMA
ical prognosis. Med Decis Making 1983; 3:419–458 17. Detsky AS, Naglie IG. A clinician’s guide to cost- 1996; 276:1253–1258
12. Tambour M, Zethraeus N. Bootstrap confidence in- effectiveness analysis. Ann Intern Med 1990; 22. Siegel JE, Weinstein MC, Russell LB, Gold MR.
tervals for cost-effectiveness ratios: some simula- 113:147–154 Recommendations for reporting cost-effectiveness
tion results. Health Econ 1998; 7:143–147 18. Fryback DG. Technology evaluation: applying analyses. Panel on Cost-Effectiveness in Health and
13. Critchfield GC, Willard KE. Probabilistic analysis cost-effectiveness analysis for health technology Medicine. JAMA 1996; 276:1339–1341
of decision trees using Monte Carlo simulation. assessment. Decisions in Imaging Economics 1990; 23. Sheldon TA. Problems of using modelling in the
Med Decis Making 1986; 6:85–92 3:4–9 economic evaluation of health care. Health Econ
14. Kuntz KM, Weinstein MC. Life expectancy biases 19. Singer ME, Applegate KE. Cost-effectiveness anal- 1996; 5:1–11
in clinical decision modeling. Med Decis Making ysis in radiology. Radiology 2001; 219:611–620 24. Buxton MJ, Drummond MF, Van Hout BA, et al.
1995; 15:158–169 20. Russell LB, Gold MR, Siegel JE, Daniels N, Wein- Modelling in economic evaluation: an unavoidable
15. Gold MR, Siegel JE, Russell LB, Weinstein MC, stein MC. The role of cost-effectiveness analysis in fact of life. Health Econ 1997; 6:217–227

APPENDIX 1: Necessary Steps for Any Decision Analysis

1. Identify the clinical problem and targeted patient population.


2. Identify clinical options.
3. Identify the decision maker.
4. Identify the outcomes associated with each clinical option.
5. Identify the value on which the decision will be based.
6. Assign a value to each terminal branch of the decision tree. This step may include additional modeling, such as Markov models, to link
known intermediate health states to long-term outcomes.
7. Assign probabilities to each chance event.
8. Compute the expected value of each decision by averaging out the decision tree.
9. Perform a sensitivity analysis.
10. Report the model assumptions, inputs, and results.

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:
1. Introduction, which appeared in February 2001 Distributions, April 2003
2. The Research Framework, April 2001 11. Observational Studies in Radiology, November 2004
3. Protocol, June 2001 12. Randomized Controlled Trials, December 2004
4. Data Collection, October 2001 13. Clinical Evaluation of Diagnostic Tests, January 2005
5. Population and Sample, November 2001 14. ROC Analysis, February 2005
6. Statistically Engineering the Study for Success, July 2002 15. Statistical Inference for Continuous Variables, April 2005
7. Screening for Preclinical Disease: Test and Disease 16. Statistical Inference for Proportions, April 2005
Characteristics, October 2002 17. Reader Agreement Studies, May 2005
8. Exploring and Summarizing Radiologic Data, January 2003 18. Correlation and Regression, July 2005
9. Visualizing Radiologic Data, March 2003 19. Survival Analysis, July 2005
10. Introduction to Probability Theory and Sampling 20. Multivariate Statistical Methods, August 2005

590 AJR:185, September 2005


Hollingworth Re s e arch • F und am en ta l s o f C li ni c al R es e a rc h for R ad io l og ist s
Radiology Cost and
Outcomes Studies

Radiology Cost and Outcomes


Studies: Standard Practice and
Emerging Methods
William Hollingworth1 ost and outcomes research has be- process more convenient, the threshold for

Hollingworth W C come an integral part of radiology


since the pioneering randomized
controlled trials (RCTs) of radio-
imaging is lowered [8]. In these situations the
onus will continue to be on radiologists to
provide evidence that newer imaging tech-
graphic screening for lung and breast cancer niques improve diagnostic and therapeutic
in the 1960s and 1970s [1, 2]. The impetus for decision making and thereby benefit patients.
radiologists to become involved in this tech- This article has three objectives: first, to
nology assessment process is likely to con- identify factors that, in combination, make ra-
tinue to increase in the foreseeable future. diology cost and outcomes studies unique; sec-
Medical expenditures are at an all time high; ond, to review standard methods for measuring
in 2002 the United States spent just under the cost and outcomes of diagnostic imaging;
14% of its gross domestic product on health and third, to describe emerging methods that
care. This equates to about $4,900 per capita will help radiologists conduct and interpret
annually, more than double the amount spent cost and outcomes studies in future years.
by other industrialized countries such as Swe-
den, Australia, and Japan [3]. The source of What Are the Factors That
Series editors: Nancy Obuchowski, C. Craig Blackmore, the increase in spending is certainly multifac- Make Cost and Outcomes
Steven Karlik, and Caroline Reinhold. torial, including the need to provide care to an Research in Radiology Unique?
aging population, which places ever higher The Gap Between Diagnosis and Outcome
This is the 22nd and final article in the series designed by
the American College of Radiology (ACR), the Canadian
expectations on the capabilities of medicine. The fundamental distinction between out-
Association of Radiologists, and the American Journal of However, the public and health care profes- comes research in radiology and other areas of
Roentgenology. The series has been designed to sionals alike perceive that medical technolo- medicine, such as surgery and pharmaceutics,
progressively educate radiologists in the methodologies of gies also drive expenditure. A survey of is the distance between cause and effect. That
rigorous clinical research, from the most basic principles
health economists revealed that 81% identi- is, the chain of events that separates the imme-
to a level of considerable sophistication. The articles are
intended to complement interactive software that permits fied technologic change as the primary reason diate aim of radiology, which is to make an ac-
the user to work with what he or she has learned, which is for the increase in health sector spending [4]. curate diagnosis, from the ultimate goal, which
available on the ACR Web site (www.acr.org). Purchases of expensive diagnostic imaging is to improve patient health and life expectancy
equipment are particularly visible; 68% of re- at an affordable cost. The links in this chain
Project coordinator: G. Scott Gazelle, Chair, ACR
Commission on Research and Technology Assessment.
spondents to a U.S. community survey have been formalized in the hierarchy origi-
thought that the increase in diagnostic proce- nally developed by Fineberg et al. [9] and
Staff coordinator: Jonathan H. Sunshine, Senior Director dures played a large or very large role in in- adapted by others [10, 11]. The first two levels
for Research, ACR. creasing health care costs [5]. of this hierarchy depend on the capability of
The introduction of noninvasive angiogra- the imaging technology to depict normal and
DOI:10.2214/AJR.04.1780
phy using MRI or MDCT to replace catheter abnormal anatomy and function (level 1) and
Received November 17, 2004; accepted after revision angiography provides one of several exam- the ability of radiologists to use the images to
November 23, 2004. ples in which diagnostic imaging advances make accurate diagnoses (level 2). Beyond
1Department of Radiology, University of Washington,
have the potential to simultaneously reduce these initial two levels, the value of diagnostic
Harborview Medical Center, 325 Ninth Ave., Box 359960, costs and benefit patients [6, 7]. However, imaging is dictated by factors that are not un-
Seattle, WA 98104. Address correspondence to this will not always be the case. Newer imag- der the control of radiology. The referring cli-
W. Hollingworth (willh@u.washington.edu). ing technology may increase costs for any of nician must be convinced by the imaging re-
AJR 2005;185:833–839
the following reasons: if it is an adjunct rather sults to change the working diagnosis (level 3)
than a replacement for existing imaging and therapy (level 4) for the patient. Effective
0361–803X/05/1854–833
methods; if it has a higher unit cost than ex- therapeutic options must be available if the
© American Roentgen Ray Society isting imaging; or if, by making the imaging change in therapy is to benefit patients (level

AJR:185, October 2005 833


Hollingworth

5). Finally, the net cost of diagnosis and treat- in defining best practice. However, as re- ought to be included in the analysis. For exam-
ment must be justified by improvements in pa- search methods have evolved there have been ple, a recent trial compared the cost of abdom-
tients’ health (level 6). Failure at any one of the a number of landmark publications that have inal CT with 120 mL of nonionic contrast ver-
latter four levels will undermine the value of defined a methodologic blueprint for re- sus the same technique with 100 mL of the
even the most accurate diagnostic test. search. The Consolidated Standards of Re- same contrast material pushed with 40 mL of
porting Trials (CONSORT) statement pro- saline [25]. From the perspective of the hospi-
The Size of the Study vides a checklist of items considered essential tal and society as a whole, the small cost reduc-
One upshot of this hierarchy of events is that for the clear presentation of RCT results [16]. tion of the saline flush method is relevant be-
imaging, particularly when used to screen Similar guidelines have been developed for cause it might generate substantial savings in
asymptomatic populations, is likely to directly nonrandomized studies [17], economic eval- the long run. However, from the perspective of
benefit only a small subgroup of recipients. uations [18], and decision analysis models third-party insurers, who pay a fixed reim-
This is in contrast to therapeutic interventions, [19]. In addition, a number of excellent arti- bursement rate for contrast-enhanced CT, the
in which all patients have the potential to bene- cles apply general cost and outcomes meth- cost reduction is of no immediate relevance or
fit. For example, in many breast cancer screen- ods to radiology [20, 21]. value. Therefore, an explicit statement of the
ing programs, fewer than 1% of mammograms The purpose of this section is to briefly reca- perspective of the study is a vital, although of-
result in a confirmed case of cancer [12]. The pitulate the standard methodologic issues, with ten overlooked [26], part of a cost and out-
health of the remaining 99% of women is un- the expectation that readers who require more comes study.
likely to be directly affected beyond reassur- detail will turn to the citations listed in the text. Current guidelines recommend that the de-
ance provided by a negative result or anxiety fault study perspective should be societal
raised by false-positive findings. Conse- Study Design [18]. This is the broadest perspective and in-
quently, most studies of screening are large tri- Blackmore et al. [22] identified 238 radiol- cludes the costs borne by individuals and pub-
als recruiting thousands of patients, or decision ogy cost and outcome studies conducted over lic and private organizations within society.
analyses based on hypothetical models of diag- a 40-year period. Most studies presented pri-
nostic accuracy and therapeutic effectiveness. mary data from observational cohort or case- Measuring Costs
Large trials are needed to detect with statistical control studies (59%) or RCTs (18%), and the Table 1 provides examples of the costs and
accuracy health effects in the small proportion remaining studies used secondary data avail- costing methods that might be used for diag-
of the population with the disease. able in the medical literature to build decision nostic technology assessment from the point
analysis models. RCTs are thought to be the of view of four commonly encountered per-
The Intrinsic Value of Diagnostic Information best method of providing unbiased evidence spectives. Importantly, the cost of medical
Even diagnostic imaging of symptomatic on the costs and effectiveness of alternative care to society is not equivalent to the charge
patients may not radically alter treatment for imaging technologies [23]. The process of billed by the provider. Charges incorporate
many recipients. For example, in a study com- randomly allocating patients to receive one of both costs and a profit margin. From the per-
paring MRI and arthrography for patients with the two or more putative technologies makes spective of society, profit merely represents
shoulder pain and suspected full-thickness ro- it probable that any differences observed in the transfer of money from one member of so-
tator cuff tears, Blanchard et al. [13] found that subsequent outcomes will be truly due to the ciety (the payer) to another (the provider), no
preimaging management plans changed in imaging strategy and not caused by the myr- resources are depleted, and society as a whole
36% and 25% of patients, respectively. Al- iad of individual patient characteristics that is neither richer nor poorer. Therefore,
though imaging may not always trigger a confound the interpretation of nonrandom- charges tend to overestimate cost.
change in therapy, diagnostic information may ized studies. However, RCTs do have draw- Costs can either be calculated directly us-
still have intrinsic value. In 1994, Mushlin et backs and are not necessary to answer all ra- ing activity-based costing (ABC) methods or
al. [14] found that patients with suspected mul- diology outcomes research questions [24]. indirectly using proxies for cost based on
tiple sclerosis became less anxious after a pos- Most notably, rigorous RCTs require a sub- third-party insurer reimbursement rates or
itive MRI diagnosis, even though they faced a stantial commitment of time and money. cost-to-charge ratios. The ABC method, also
chronic disease with, at that time, few thera- Moreover, often only a select subset of pa- referred to as microcosting, is the more accu-
peutic options. A negative test result may also tients enrolls in trials, making the extrapola- rate and laborious. It is usually reserved for
be beneficial if it reassures the patient that tion of trial results to real-world clinical prac- elements of cost likely to be most influential
nothing is seriously wrong. However, this is tice problematic. Despite these caveats, for for the study results. Nisenbaum et al. [27]
not a predictable effect; indeed, in some pa- the most important questions, RCTs should used ABC methods to calculate the costs of
tients, negative test findings can heighten anx- continue to spearhead the push toward the ra- 17 CT procedures performed at a university
iety about the cause of ongoing symptoms tional use of diagnostic imaging. hospital. Each element of resource use is
[15]. These intrinsic effects emphasize the im- identified, measured, and valued. For exam-
portance of assessing patients’ perceptions of Choosing the Perspective of the Study ple, the CT machine cost per examination is a
their physical and mental health after imaging. Innovations in imaging rarely affect all ele- function of the purchase cost, maintenance
ments of society, such as physicians and insur- and upgrade costs, machine life expectancy,
Standard Methods in Cost and ers, equally. The value of imaging will depend yearly hours of machine operation, and the
Outcomes Research on the viewpoint, or perspective, of the analyst. number of minutes spent imaging each pa-
The diverse nature of cost and outcomes By stating the perspective of the study, the re- tient. Using this detailed approach, a cost for
research makes it difficult to be prescriptive searcher predetermines the relevant costs that all elements of the CT procedure, including

834 AJR:185, October 2005


Radiology Cost and Outcomes Studies

TABLE 1: Costs Under Alternative Perspectives


Perspective
Resource Item Societal Hospital & Care Provider Third-Party Insurer Patient & Family
Diagnostic imaging Cost of equipment, Cost of equipment, Reimbursement rate and Out of pocket expenses
consumables, overhead, and consumables, overhead, and administrative costs for (e.g., charge, copayment)
personnel a personnel a covered items
Medication Cost of developing, Negotiated price of Reimbursement rate and Out of pocket expenses
manufacturing, and medication administrative costs for (e.g., charge, copayment)
marketing the drugb covered drugs
Outpatient and office-based Cost of equipment, Cost of equipment, Reimbursement rate and Out of pocket expenses
therapy consumables, overhead, and consumables, overhead, and administrative costs for (e.g., charge, copayment)
personnela personnela covered items
Hospital admission Cost of equipment, Cost of equipment, Reimbursement rate and Out of pocket expenses
consumables, overhead, and consumables, overhead, and administrative costs for (e.g., charge, copayment)
personnelc personnelc covered items
Patient time and money spent Cost of transportation, Not included Not included Cost of transportation,
receiving care parking, etc.; opportunity parking, etc.; opportunity
cost of timed cost of timed
Patient time off work due to Opportunity cost of timed Not included Not included Opportunity cost of timed
illness
Informal care giving Opportunity cost of care Not included Not included Opportunity cost of care
givers’ timed givers’ timed
aIn situations in which cost cannot be directly calculated, Medicare reimbursement rates (including both professional and technical components for diagnostic
tests) are often used as a proxy for cost [46].
bIn situations in which cost cannot be directly calculated, average wholesale price, which approximates prices in discount pharmacies, is often used as a proxy
for cost [46].
cIn situations in which cost cannot be directly calculated, Medicare Prospective Payment System or cost-to-charge ratios are often used as a proxy for cost [46].
dHourly wage that patient or care giver would have been earning is often used to estimate cost of time lost due to illness [46].

consumables (e.g., contrast material and film) cost of individual imaging examinations. mated efficiency of screening interventions,
and radiologist, technologist, administrative, Therefore, overreliance on reimbursement in which costs occur immediately but benefits
and overhead (e.g., rent) costs, is developed. rates or cost-to-charge ratios may distort the are delayed.
The intricate ABC approach is not always cost analysis. In practice, there is a trade-off
feasible, and simpler methods are often suffi- between the accuracy and the feasibility of Choosing the Type of Economic Evaluation and
cient. For example, the Centers for Medicare costing methods. Many studies use a combina- Measuring Outcomes
and Medicaid Services has made extensive ef- tion of ABC methods for key cost elements, Although there are four types of economic
forts to implement a resource-based relative such as the initial imaging, and cost proxies for evaluation commonly defined in the literature
value scale (RBRVS) of reimbursement. This other costs, such as subsequent medications (Table 2), most health care studies can be clas-
system provides reimbursement for each radi- and inpatient and outpatient care. sified as one of two types. Currently, the most
ology procedure based on the perceived com- All cost data should be standardized and prevalent method is cost-effectiveness analy-
plexity and resource utilization required to per- updated to reflect current costs. Often, be- sis, accounting for more than 80% of published
form that procedure. One advantage of this cause of the scarcity of cost information, an- analyses [30]. The distinguishing feature of
system is that it is standardized at a national alysts draw on cost data from several years. In cost-effectiveness analysis is that the outcome
level. Nevertheless, recent work has indicated these circumstances, historical cost data are measure used reflects only a limited aspect of
that substantial inaccuracies may still exist in inflated to current values using the medical health. This primary outcome can be a clinical
reimbursement rates, resulting in poorly (e.g., care component of the consumer price index. measure such as mortality, bone density, or ex-
radiography and interventional) and favorably On a similar theme, current U.S. guidelines ercise tolerance, or a patient-reported measure
(e.g., sonography, MR, and CT) reimbursed recommend that future costs, savings, and such as pain or quality of life. For example, in
techniques [28]. Other authors have used cost- health outcomes be discounted at a rate of 3% an RCT comparing coronary interventions
to-charge ratios to estimate cost by removing per year [18]. Therefore, a screening test in guided by intravascular sonography or angiog-
the element of profit in the charges billed for 2004 that prevented $1,000 of treatment costs raphy, Mueller et al. [31] used 2-year major
medical procedures [29]. The cost-to-charge in 2006 would receive credit for saving only cardiac event-free survival to determine
ratio is the ratio of annual departmental expen- $943 (i.e., $1,000 / [1 + 0.03]2). The rationale whether either imaging method improved pa-
diture to revenue. However, because the profit for discounting is based on evidence that peo- tient outcomes. Cost-effectiveness analysis
margin may vary widely among imaging ex- ple prefer to have resources now rather than in works well in situations in which imaging is
aminations, the devaluation of charges based the future for several reasons, including the expected to improve one predominant aspect
on uniform departmental-level cost-to-charge opportunity to profitably invest current funds. of health. However, if imaging is likely to af-
ratios provides only a crude estimate of the Controversially, discounting lowers the esti- fect more than one element of health or

AJR:185, October 2005 835


Hollingworth

TABLE 2: Types of Cost and Outcomes Studies guish efficient from inefficient health care in-
Cost terventions. In reality, this threshold will vary
Type of Evaluation Measure Outcome Measure over time and according to many other factors,
Cost-minimization analysis (CMA) Dollars Assumed or known to be the same for both imaging including the amount of money available to
strategies fund health care.
Cost-effectiveness analysis (CEA) Dollars Any of various intermediate (e.g., number of cases The ICER statistic has several weaknesses.
detected), clinical (e.g., mortality), or patient-reported Most important, the meaning of a negative
(e.g., pain) outcomes ICER statistic is ambiguous and open to misin-
Cost–utility analysis (CUA) Dollars Quality-adjusted life years (QALYs) terpretation. For example, an efficient imaging
Cost–benefit analysis (CBA) Dollars Dollars strategy that is both cheaper (–$1,000) and
more effective (0.1 QALYs) than the strategy
with which it is being compared has an ICER
longevity, then the more inclusive quality-ad- The QALY provides a universal outcome of –$10,000. Likewise, an inefficient imaging
justed life year (QALY) outcome measure measure that could be used in all clinical trials. strategy that is both more expensive ($500)
used in cost-utility analysis is recommended. Therefore, the efficiency of femoral vein and less effective (–0.05 QALYs) than the
Cost-utility analysis measures outcomes sonography from the trial just described could, strategy with which it is being compared also
by weighting years of life by a factor (Q) that in theory, be compared with any other medical has the same ICER value, –$10,000. The pol-
represents the patient’s health-related qual- intervention in which cost-utility analysis data icy implications of these two scenarios are di-
ity of life. Q is anchored at 1 (perfect health) are available. For this reason, current guide- ametrically opposed, yet the ICER is identical.
and 0 (a health state considered to be as bad lines favor cost-utility analysis as the most use- Furthermore, merely presenting the ICER esti-
as death) and is estimated for all health states ful method for policy makers [18]. However, mate without quantifying the surrounding con-
between these extremes. A QALY is simply some authors are skeptical of the QALY fidence interval is of limited value. Unfortu-
the number of years that a patient spends in method [36], and it is likely that cost-effective- nately, however, the ICER has an undefined
each health state multiplied by the quality of ness analysis will remain a popular method of variance; this complicates even simple statisti-
life weight, Q, of that state. For example, a economic evaluation in the near future. cal tasks such as hypothesis testing and confi-
patient who spends 2 years in an imperfect The benefits of screening, diagnosis, and dence interval calculation [38].
health state, where Q = 0.75, would achieve preventive treatment may influence the entire In recognition of these weaknesses, newer
1.5 QALYs (0.75 × 2). The quality weight, course of patients’ lives. Therefore, cost and methods are emerging, such as the net benefit
Q, can be elicited directly from patients us- outcomes studies should strive to measure the statistic and cost-effectiveness acceptability
ing methods such as the visual analogue lifetime impact of imaging. However, in pro- curves, and are being used to complement or
scale, time trade-off, and standard gamble; spective studies it is not practical to follow up supplant the ICER statistic in economic anal-
these methods have been described in detail patients indefinitely. Therefore, analysts of- yses. These emerging methods will be dis-
elsewhere [21, 32]. Alternatively, in an in- ten report the primary results after the first cussed in the final section of this article.
creasing number of studies, Q is estimated few years of follow-up and extrapolate any It is often difficult to generalize cost-ef-
indirectly via a quality of life questionnaire differences in cost and outcomes data over the fectiveness results observed in one imaging
such as the EQ-5D [33] or the Health Utili- remaining life expectancy of patients [37]. center to other settings. For example, a sur-
ties Index [34]. The questionnaire asks the vey of 26 Canadian MRI centers concluded
patient to categorize current health in vari- Analysis Methods that the average operating time per week was
ous dimensions—for example, physical The incremental cost-effectiveness ratio 64 hr (range, 25–113 hr) [39]. It would be
functioning, pain, and mental health. Every (ICER) is conventionally used to summarize unreasonable to assume that the cost of MRI
possible combination of questionnaire re- the relative efficiency of medical procedures. equipment per examination is identical for
sponses is associated with a quality weight, The ICER is calculated as follows: centers at opposite ends of this spectrum.
Q, from a catalog or algorithm provided by Therefore, sensitivity analysis is frequently
the questionnaire creators. The weights in ICER = ( C1 – C 0 ) ⁄ ( E 1 – E 0 ) = ( ∆C ⁄ ∆E ) used to judge whether study conclusions
this catalog are based on prior surveys of the might be reversed by plausible deviations in
general public’s preferences for the health where C 1 , C 0 , E 1 , and E 0 are the mean cost parameters, such as the intensity of MRI ma-
states described by the questionnaire. This and effectiveness of the two imaging strategies chine utilization, that underpin cost and effi-
indirect approach to estimating Q is cur- being compared, and ∆C and ∆E are the dif- cacy estimates. In the example given, the
rently being used in a trial comparing duplex ference between the mean costs and mean ef- sensitivity analyst might vary the mean cap-
sonography with clinical surveillance after fectiveness of the two strategies, respectively. ital cost of MRI by ± 60% to simulate the
femoral vein bypass [35]. In this trial, imag- Therefore, a screening strategy that in- plausible variation in operating hours and to
ing influences medical therapy for ischemia creases costs by an average of $500 per patient judge whether a particular application of
or surgical decisions to amputate and there- and improves life expectancy by an average of MRI is likely to be efficient even in centers
fore affects several aspects of health, includ- 0.04 QALYs per patient, has an ICER of with low patient throughput. Sensitivity
ing mobility, self-care, and pain. These re- $12,500 per QALY saved. Typically, less cost- analysis takes many forms, including one-
searchers chose the EQ-5D questionnaire, effective imaging strategies will have higher, way, multiway, and threshold analyses.
which incorporates all of these dimensions positive, ICER values. However, no consensus These methods have been described in detail
of health. exists on an exact threshold that would distin- in a previous article in this series [40].

836 AJR:185, October 2005


Radiology Cost and Outcomes Studies

Emerging Analytic Methods where λ is the amount that society is willing ods [44], the probability that the net benefit
Evaluating the Imaging Process to pay for an improvement in health. statistic is positive (i.e., the intervention is
from the Patient’s Perspective Therefore, continuing the previous exam- cost-effective) can be calculated for each
In many clinical applications there are ple, if society is willing to pay $100,000 per value of λ and presented as a cost-effective-
now a multitude of highly accurate imaging QALY gained, then our hypothetical screen- ness acceptability curve (CEAC).
alternatives available. It is frequently impos- ing strategy that increased mean QALYs by
sible to differentiate between two imaging 0.04 and increased mean costs by $500 would Cost-Effectiveness Acceptability Curves
techniques purely on the basis of their im- have a net benefit of $3,500 ([$100,000 × The CEAC describes the probability that an
pact on patient health or medical care costs. 0.04] – $500). Unlike the ICER, the interpre- imaging intervention is cost-effective at differ-
In these circumstances, researchers have be- tation of the net benefit statistic is clear-cut; a ent willingness-to-pay thresholds. Figure 1
gun to formally assess patients’ views on the positive value indicates a cost-effective imag- shows the information provided by the CEAC
desirability of competing imaging proce- ing strategy in which the net costs are more from a randomized trial comparing rapid MRI
dures. For example, Blanchard et al. [41] than justified by the net benefits, whereas a with radiography as the initial imaging test in
found that 26% of patients undergoing negative value indicates the opposite. The patients with lower back pain [45]. The pri-
shoulder MRI reported it to be unpleasant or larger the net benefit statistic, the more cost- mary finding of this trial was that costs were
extremely unpleasant compared with 7% un- effective the imaging strategy and the more slightly (≈ $300), but not statistically signifi-
dergoing arthrography, although most pa- highly it should be prioritized. Furthermore, cantly, higher in patients initially imaged with
tients would allow either test to be repeated in large samples the mean net benefit statistic rapid MRI and that there was no clinically or
[41]. Swan et al. [42] developed a method is normally distributed; therefore, hypothesis statistically important difference in physical
for further quantifying the strength of patient testing and confidence interval calculation function outcomes. In this trial, the ICER alone
preferences. They report that, on average, are straightforward [43]. is difficult to interpret because it is negative
patients with peripheral vascular disease One potential limitation of the net benefit and has an undefined confidence interval. The
would be willing to wait an extra 6 weeks for approach is that λ, the value that society is CEAC provides more useful information. In
imaging results and treatment if they could willing to pay for improved health, must be this case, the curve crosses the y-axis, where
avoid the discomfort and risk of X-ray an- explicitly quantified and embedded in the net society places no value on improvements in
giography. By comparison, patients would benefit calculation. In general, λ is not accu- back-related function, at 0.16 (Fig. 1). This
wait just more than 2 weeks to avoid the MR rately known and will vary from setting to set- confirms that, on the basis of the trial data, a
angiography procedure [42]. ting. To address this limitation, many authors 16% probability still exists that rapid MRI is
now present their results across the spectrum the cheapest strategy. Therefore, more data are
Net Benefits of λ values. These values range from $0, im- required to state with certainty that the rapid
Presenting cost-effectiveness results using plying that society cannot afford or is not MRI strategy is more expensive than radiogra-
the net benefit statistic resolves many of the willing to pay anything for improved health phy. As we move right along the x-axis, the
problems associated with incremental cost- and will simply choose the cheapest option, probability that rapid MRI is cost-effective in-
effectiveness ratios [43]. The net benefit sta- through to millions of dollars, implying that creases. This reflects the fact that the more so-
tistic is calculated as follows: society wishes and is able to pay handsomely ciety is willing to pay for improvements in
for even the most meager health improve- physical function, the more likely it is that the
NB = λ ( ∆E ) – ( ∆C ) ments. Using resampling or simulation meth- extra cost of rapid MRI will be justified by
Probability That Rapid MRI Is Cost-Effective

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0 Fig. 1— Graph of cost-effectiveness acceptability


b

$0 $10,000 $20,000 $30,000 $40,000 $50,000 curve shows probability that rapid MRI cost-
Willingness to Pay for Improved Physical Function (λ) effectiveness increases as society is willing to pay
more for improvements in physical functioning.

AJR:185, October 2005 837


Hollingworth

small improvements in function. However, in 86:1–24 Task Force on Good Research Practices–Modeling
this example, the probability curve flattens 5. Lindenthal JJ, Lako CJ, van der Waal MA, Tymstra Studies. Value Health 2003; 6:9–17
quickly and never rises above 0.50. This hap- T, Andela M, Schneider M. Quality and cost of 20. Sunshine JH, Applegate KE. Technology assess-
pens because the trial data provide no substan- healthcare: a cross-national comparison of Ameri- ment for radiologists. Radiology 2004;
tive evidence that the rapid MRI strategy is ei- can and Dutch attitudes. Am J Manag Care 1999; 230:309–314
ther more or less effective than radiography. 5:173–181 21. Singer ME, Applegate KE. Cost-effectiveness anal-
Therefore, even if society is willing to pay ex- 6. Budoff MJ, Achenbach S, Duerinckx A. Clinical ysis in radiology. Radiology 2001; 219:611–620
cessively for improved health, a 50% probabil- utility of computed tomography and magnetic res- 22. Blackmore CC, Black WC, Jarvik JG, Langlotz CP.
ity still exists that rapid MRI is not the most ef- onance techniques for noninvasive coronary an- A critical synopsis of the diagnostic and screening
fective strategy. This graph informs the giography. J Am Coll Cardiol 2003; 42:1867–1878 radiology outcomes literature. Acad Radiol 1999;
decision maker that it is probable, but not cer- 7. U-King-Im JM, Hollingworth W, Trivedi RA, et al. 6[suppl 1]:S8–S18
tain, that rapid MRI is currently not a cost-ef- Contrast-enhanced MR angiography vs intra-arte- 23. Hunink MG, Krestin GP. Study design for concur-
fective initial imaging tool for improving the rial digital subtraction angiography for carotid im- rent development, assessment, and implementation
function of patients with lower back pain. aging: activity-based cost analysis. Eur Radiol of new diagnostic imaging technology. Radiology
2004; 14:730–735 2002; 222:604–614
Conclusions 8. Stevens A, Milne R, Burls A. Health technology as- 24. Black N. Why we need observational studies to
This article provides a starting point for sessment: history and demand. J Public Health Med evaluate the effectiveness of health care. BMJ 1996;
radiologists and allied health professionals 2003; 25:98–101 312:1215–1218
who have an interest in conducting or apply- 9. Fineberg HV, Bauman R, Sosman M. Computer- 25. Schoellnast H, Tillich M, Deutschmann HA, et al.
ing the results of health services research. ized cranial tomography: effect on diagnostic and Abdominal multidetector row computed tomogra-
By its very nature, health services research is therapeutic plans. JAMA 1977; 238:224–227 phy: reduction of cost and contrast material dose us-
multispecialty research because the diagnos- 10. Jarvik JG. The research framework. AJR 2001; ing saline flush. J Comput Assist Tomogr 2003;
tic information provided by radiology must 176:873–878 27:847–853
be combined with the therapeutic expertise 11. Fryback DG, Thornbury JR. The efficacy of diag- 26. Severens JL, van der Wilt GJ. Economic evaluation
of other clinical specialties to improve the nostic imaging. Med Decis Making 1991; 11:88–94 of diagnostic tests: a review of published studies. Int
health of patients. This fact, coupled with the 12. Brown J, Bryan S, Warren R. Mammography J Technol Assess Health Care 1999; 15:480–496
large sample sizes needed to provide a defin- screening: an incremental cost effectiveness analy- 27. Nisenbaum HL, Birnbaum BA, Myers MM,
itive answer to some screening questions, sis of double versus single reading of mammo- Grossman RI, Gefter WB, Langlotz CP. The costs
can make this type of research seem daunt- grams. BMJ 1996; 312:809–812 of CT procedures in an academic radiology de-
ing. However, there are now numerous ex- 13. Blanchard TK, Bearcroft PW, Constant CR, Griffin partment determined by an activity-based costing
amples where simple observational studies DR, Dixon AK. Diagnostic and therapeutic impact (ABC) method. J Comput Assist Tomogr 2000;
[12, 13] and compact randomized trials [25, of MRI and arthrography in the investigation of full- 24:813–823
45] have been used to elucidate the links be- thickness rotator cuff tears. Eur Radiol 1999; 28. Saini S, Seltzer SE, Bramson RT, et al. Technical
tween diagnostic imaging and the ultimate 9:638–642 cost of radiologic examinations: analysis across im-
goal of better health for patients. It seems in- 14. Mushlin AI, Mooney C, Grow V, Phelps CE. The aging modalities. Radiology 2000; 216:269–272
evitable that the frequency and importance value of diagnostic information to patients with sus- 29. Subramanian S, Spies JB. Uterine artery emboliza-
of these cost and outcomes studies will con- pected multiple sclerosis. Rochester-Toronto MRI tion for leiomyomata: resource use and cost estima-
tinue to increase in the future. Study Group. Arch Neurol 1994; 51:67–72 tion. J Vasc Interv Radiol 2001; 12:571–574
15. Lucock MP, Morley S, White C, Peake MD. Re- 30. Nixon J, Stoykova B, Glanville J, Christie J, Drum-
Acknowledgement sponses of consecutive patients to reassurance after mond M, Kleijnen J. The U.K. NHS economic eval-
The author thanks Jeffrey G. Jarvik, MD, gastroscopy: results of self administered question- uation database: economic issues in evaluations of
MPH, for his useful comments on an earlier naire survey. BMJ 1997; 315:572–575 health technology. Int J Technol Assess Health
version of this manuscript. 16. Moher D, Schulz KF, Altman D. The CONSORT Care 2000; 16:731–742
statement: revised recommendations for improving 31. Mueller C, Hodgson JM, Schindler C, Perruchoud
the quality of reports of parallel-group randomized AP, Roskamm H, Buettner HJ. Cost-effectiveness
References trials. JAMA 2001; 285:1987–1991 of intracoronary ultrasound for percutaneous coro-
1. Shapiro S, Strax P, Venet L. Evaluation of periodic 17. Des Jarlais DC, Lyles C, Crepaz N. Improving the nary interventions. Am J Cardiol 2003; 91:143–147
breast cancer screening with mammography: meth- reporting quality of nonrandomized evaluations of 32. Morimoto T, Fukui T. Utilities measured by rating
odology and early observations. JAMA 1966; behavioral and public health interventions: the scale, time trade-off, and standard gamble: review
195:731–738 TREND statement. Am J Public Health 2004; and reference for health care professionals. J Epi-
2. Taylor WF, Fontana RS, Uhlenhopp MA, Davis 94:361–366 demiol 2002; 12:160–178
CS. Some results of screening for early lung cancer. 18. Siegel JE, Weinstein MC, Russell LB, Gold MR. 33. Brooks R. EuroQol: the current state of play. Health
Cancer 1981; 47:1114–1120 Recommendations for reporting cost-effectiveness Policy 1996; 37:53–72
3. Organisation for Economic Co-Operation and De- analyses. Panel on Cost-Effectiveness in Health and 34. Furlong WJ, Feeny DH, Torrance GW, Barr RD.
velopment. Health at a glance: OECD indicators Medicine. JAMA 1996; 276:1339–1341 The Health Utilities Index (HUI) system for assess-
2003. Paris, France: OECD, 2003 19. Weinstein MC, O’Brien B, Hornberger J, et al. Prin- ing health-related quality of life in clinical studies.
4. Fuchs VR. Economics, values, and health care re- ciples of good practice for decision analytic mod- Ann Med 2001; 33:375–384
form. The American Economic Review 1996; eling in health-care evaluation: report of the ISPOR 35. Kirby PL, Brady AR, Thompson SG, Torgerson D,

838 AJR:185, October 2005


Radiology Cost and Outcomes Studies

Davies AH. The Vein Graft Surveillance Trial: ra- 39. Rankin RN. Magnetic resonance imaging in Can- 43. Zethraeus N, Johannesson M, Jonsson B, Lothgren M,
tionale, design and methods. VGST participants. ada: dissemination and funding. Can Assoc Radiol Tambour M. Advantages of using the net-benefit ap-
Eur J Vasc Endovasc Surg 1999; 18:469–474 J 1999; 50:89–92 proach for analysing uncertainty in economic evalua-
36. Schwappach DL. Resource allocation, social values 40. Plevritis SK. Decision analysis and simulation tion studies. Pharmacoeconomics 2003; 21:39–48
and the QALY: a review of the debate and empirical modeling for evaluating diagnostic tests on the basis 44. Fenwick E, O’Brien BJ, Briggs A. Cost-effectiveness
evidence. Health Expect 2002; 5:210–222 of patient outcomes. AJR 2005; 185:581–590 acceptability curves: facts, fallacies and frequently
37. Ramsey SD, Berry K, Etzioni R, Kaplan RM, Sul- 41. Blanchard TK, Bearcroft PW, Dixon AK, et al. asked questions. Health Econ 2004; 13:405–415
livan SD, Wood DE. Cost effectiveness of lung-vol- Magnetic resonance imaging or arthrography of the 45. Jarvik JG, Hollingworth W, Martin B, et al. Rapid
ume-reduction surgery for patients with severe em- shoulder: which do patients prefer? Br J Radiol magnetic resonance imaging vs radiographs for pa-
physema. N Engl J Med 2003; 348:2092–2102 1997; 70:786–790 tients with low back pain: a randomized controlled
38. O’Brien BJ, Briggs AH. Analysis of uncertainty in 42. Swan JS, Fryback DG, Lawrence WF, Sainfort F, trial. JAMA 2003; 289:2810–2818
health care cost-effectiveness studies: an introduc- Hagenauer ME, Heisey DM. A time-tradeoff 46. Gold M, Siegel J, Russell L, Weinstein M. Cost-ef-
tion to statistical issues and methods. Stat Methods method for cost-effectiveness models applied to ra- fectiveness in health and medicine. New York, NY:
Med Res 2002; 11:455–468 diology. Med Decis Making 2000; 20:79–88 Oxford University Press, 1996

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 12. Randomized Controlled Trials, December 2004
2. The Research Framework, April 2001 13. Clinical Evaluation of Diagnostic Tests, January 2005
3. Protocol, June 2001 14. ROC Analysis, February 2005
4. Data Collection, October 2001 15. Statistical Inference for Continuous Variables, April 2005
5. Population and Sample, November 2001 16. Statistical Inference for Proportions, April 2005
6. Statistically Engineering the Study for Success, July 2002 17. Reader Agreement Studies, May 2005
7. Screening for Preclinical Disease: Test and Disease 18. Correlation and Regression, July 2005
Characteristics, October 2002 19. Survival Analysis, July 2005
8. Exploring and Summarizing Radiologic Data, January 2003 20. Multivariate Statistical Methods, August 2005
9. Visualizing Radiologic Data, March 2003 21. Decision Analysis and Simulation Modeling for Evaluating
10. Introduction to Probability Theory and Sampling Diagnostic Tests on the Basis of Patient Outcomes,
Distributions, April 2003 September 2005
11. Observational Studies in Radiology, November 2004

AJR:185, October 2005 839


Gatsonis and Paliwal R e se a rch • Fundamentals of Clinical Research for Radiologists
Meta-Analysis of Diagnostic
and Screening Test
Accuracy

A C E N T U
R Y O F Meta-Analysis of Diagnostic and
Screening Test Accuracy
MEDICAL IMAGING
Evaluations: Methodologic Primer
Constantine Gatsonis1 OBJECTIVE. Interest in evidence-based diagnosis is growing rapidly as diagnostic and
Prashni Paliwal1,2 screening techniques proliferate. In this article we provide an overview of systematic reviews
of diagnostic performance and discuss in detail statistical methods for the most common variant
Gatsonis C, Paliwal P of the problem: meta-analysis of studies in which a pair of estimates of sensitivity and speci-
ficity is reported. The need to account for possible variations in threshold for test positivity
across studies led to the formulation of the Summary ROC (SROC) curve method. We discuss
graphical and model-based ways to estimate, summarize, and compare SROC curves, and we
present an example from a meta-analysis of data on techniques for staging cervical cancer. We
also present a brief survey of the methodologic literature for addressing heterogeneity, corre-
lated data, multiple thresholds per study, and systematic reviews of ROC studies. We conclude
with a discussion of the significant methodologic challenges that continue to face investigators
in this area of diagnostic medicine research.
CONCLUSION. Systematic reviews of diagnostic performance are a rigorous approach
Series editors: Craig A. Beam, C. Craig Blackmore, to examining and synthesizing evidence in the evaluation of diagnostic and screening tests. The
Steven Karlik, and Caroline Reinhold. information from such reviews is needed by clinicians, health policy makers, researchers in di-
agnostic medicine, developers of diagnostic techniques, and the general public. However, de-
This is the 23rd and final artricle in the series designed by spite progress in study quality and reporting and in methodologic development, major chal-
the American College of Radiology (ACR), the Canadian
Association of Radiologists, and the American Journal of
lenges confront investigators undertaking these reviews.
Roentgenology. The series has been designed to
progressively educate radiologists in the methodologies of he need for systematic reviews of of the quality and scope of available primary
rigorous clinical research, from the most basic principles
to a level of considerable sophistication. The articles are
intended to complement interactive software that permits
the user to work with what he or she has learned, which is
T diagnostic and screening tests has
grown markedly in recent years as
technologic advances have brought
studies of diagnostic and screening tech-
niques and thus development of information
necessary for determining directions of fu-
available on the ACR Web site (www.acr.org). forth a vast array of such techniques. Patients, ture research in diagnostic medicine.
physicians, and policy makers all need infor- A taxonomy of the important aspects of
Project coordinator: Bruce J. Hillman, Chair, ACR mation on the reliability and performance of evaluation of diagnostic and screening tests
Commission on Research and Technology Assessment.
tests and the interpretation of results. In addi- would distinguish three broad areas of end
Staff coordinator: Jonathan H. Sunshine, Senior Director tion, the increased availability of a plethora of points: the diagnostic performance of the test,
for Research, ACR. diagnostic and screening techniques has meant assessed with measures of test accuracy and
Keywords: diagnostic accuracy, evidence, meta-analysis, increased use of tests and a dramatic increase in predictive value; the impact of the test on the
statistical methods, Summary ROC curve, systematic reviews health care costs. process of care, assessed by metrics of the ef-
DOI:10.2214/AJR.06.0226
As evidence-based medicine expands fect of the test on subsequent diagnostic and
from therapy to diagnosis, the role of sys- therapeutic decision making; and the impact
Received February 13, 2006; accepted after revision
tematic reviews acquires added importance of the test on patient-level outcomes, includ-
February 20, 2006.
[1]. The information from systematic re- ing mortality, morbidity, satisfaction and
1Center for Statistical Sciences, Brown University, views of diagnostic and screening tests is health-related quality of life, health care utili-
Box G-H, Providence RI, 02912. Address correspondence
to C. Gatsonis (gatsonis@stat.brown.edu).
necessary for the following purposes: deter- zation, and cost [2–4].
mination of the proper and efficacious use of It is also possible, although not formally
2Present address: Department of Psychiatry, diagnostic and screening tests in the clinical practiced, to distinguish developmental levels
Yale University, New Haven, CT 06511. setting; decision making about health care for a technique, following the trajectory from
AJR 2006; 187:271–281
policy and financing; evaluation of the per- early development to broad dissemination.
formance and status of a diagnostic tech- For example, a four-stage categorization
0361–803X/06/1872–271
nique to determine areas for further research, would include stage 1 (discovery), in which
© American Roentgen Ray Society development, and evaluation; and evaluation the technical parameters and diagnostic crite-

AJR:187, August 2006 271


Gatsonis and Paliwal

ria of a technique are established; stage 2 (in- Each of the six steps in the process involves accounting of how the reference standard in-
troduction), in which diagnostic performance its own challenges and can be further refined formation was defined and obtained, and any
is assessed and fine tuning of the technology with more detailed flowcharts [7]. We provide other factors that can affect the integrity of the
is performed in single-institution studies; a brief description of the tasks involved in study and the generalizability of the results.
stage 3 (maturity), in which the technique is each step. Methods of quality assessment may focus
evaluated in comparative, multicenter, pro- on the absence or presence of key qualities in
spective clinical studies (efficacy); and stage Definition of the Objectives of the Review the study report (checklist approach), use
4 (dissemination), in which the technique is A systematic review of diagnostic accuracy scores developed for this purpose (scale ap-
evaluated as used by the community at large begins with defining the clinical context and proach), or use the levels-of-evidence meth-
(effectiveness) [3]. developing a precise description of the diag- ods by which a level or grade is assigned to
Appropriate end points can be selected for nostic question for which test accuracy is to studies fulfilling a predefined set of criteria.
each developmental level of a technique. In be assessed. This part of the process is similar The literature on assessment of the quality of
general, however, evaluation of diagnostic to the development of the protocol for a pri- therapy studies is extensive, at least in com-
performance is a relevant end point for studies mary study. It includes specification of the parison with the literature on diagnostic test
at any stage. The most commonly used metric clinical question giving rise to the potential evaluations [11, 12]. Two developments in
of diagnostic performance, and the one dis- use of the test or tests under investigation, the the diagnostic area are the Standards for Re-
cussed in detail in this primer, is the pair of es- technical characteristics of the tests, the con- porting of Diagnostic Accuracy (STARD)
timated sensitivity and specificity values for a ditions under which the tests are interpreted, checklist for reporting of studies of diagnos-
test. Others include receiver operating charac- and the reference information used in the as- tic accuracy [4, 13, 14] and the quality as-
teristic (ROC)–based measures and measures sessment of test accuracy [8]. Because sys- sessment tool for diagnostic accuracy
of the predictive value of a test. tematic reviews of diagnostic accuracy are (QUADAS) for assessing the quality of stud-
This primer focuses exclusively on sys- called on to inform the use of diagnostic tests ies of diagnostic accuracy [12, 15]. The
tematic reviews of the diagnostic perfor- in clinical care, comparisons of alternative former may be beneficial in improving the
mance of tests. We provide a brief descrip- tests are most valuable. quality of published reports and, indirectly,
tion of the main steps in conducting in improving the quality of primary studies.
systematic reviews, from formulating the re- Literature Search and Retrieval of Studies The latter is a rigorously constructed tool
search question through primary study re- Although on search strategies extensive that can be used by investigators undertaking
trieval and data collection to data analysis literature for studies of therapy is available, new systematic reviews.
and interpretation of results. We also discuss the corresponding body of literature on diag- Incorporation of quality assessment results
statistical methods for deriving summaries nostic test evaluation is relatively small. into meta-analysis is a matter of debate. A
of diagnostic performance data and give an Deville et al. [9] and Bachmann et al. [10] simple and perhaps draconian approach is to
example of an application to meta-analysis discuss strategies relating to diagnostic and exclude studies of poor quality. A less drastic
of the diagnostic accuracy of tests in the de- screening tests. alternative is to use quality scores as weights
tection of lymph node involvement in The search for appropriate studies must be in the statistical analysis. However, the exact
women with cervical cancer. The article con- comprehensive, objective, and reproducible, definition of the weights is often a matter of
siders methods for meta-analysis of studies and the searcher must consider all available disagreement, and the statistical rationale for
in which a single pair of sensitivity and spec- evidence. The search should not simply be their use is shaky. Another alternative, which
ificity estimates is reported. Extensions of for documents in English and should cover we recommend to investigators, is to conduct
the basic method are described, and a brief publications beyond journals, such as con- sensitivity analysis. The goal of sensitivity
guide to the methodologic literature is pro- ference proceedings and other reports. Hand analysis is to assess the contribution of poor-
vided. We summarize our recommendations searching through publications, reference quality studies to the results of the full meta-
and discuss methodologic and subject-mat- checking, and searching for unpublished re- analysis. The assessment is made by compar-
ter challenges in the last section. ports often is necessary, especially to assess ing the results from the statistical analysis
the extent of publication bias. Finally, it is with the results of the specific studies in-
important to document the process and the cluded and excluded. Sensitivity analysis also
Overview of Systematic Reviews of outcome of each search. can be used to assess the effect on diagnostic
Diagnostic Accuracy
accuracy of a study characteristic or a combi-
The conduct of a systematic review of di- Assessment of Study Quality and Applicability nation of study characteristics.
agnostic test accuracy proceeds through the The scope of assessment of study quality is
following major steps [5, 6]: broad and not generally well defined. In the Extraction of Data
1. Definition of the objectives of the review. context of studies of diagnostic performance, In studies of imaging techniques, test results
2. Literature search and retrieval of studies. assessment of quality has to consider the im- are most commonly reported as binary (yes or
3. Assessment of study quality and applica- portant features of the design and execution of no) or ordinal categoric. An example of the lat-
bility to the clinical problem at hand. the study, including factors such as definition ter often used in ROC studies is a five-category
4. Extraction of data. of the research question and clinical context, scale for degree of suspicion about the pres-
5. Statistical analysis. specification of appropriate patient popula- ence of a target condition. The categories are
6. Interpretation of results and development tion, description of the diagnostic techniques commonly described as follows: 1 = definitely
of recommendations. under study and their interpretation, detailed normal, 2 = probably normal, 3 = equivocal,

272 AJR:187, August 2006


Meta-Analysis of Diagnostic and Screening Test Accuracy

4 = probably abnormal, and 5 = definitely ab- LR+ = P(T+|D+) / P(T+|D–) = sens / (1 – et al. [19] and Pepe [20] and in chapters by
normal. In recent years, degree of suspicion as- spec); and negative likelihood ratio: Toledano et al. [21] and Toledano [22].
sessments also have been made on nearly con- LR– = P(T–|D+) / P(T–|D–) = (1 – sens) / Digression to ROC analysis is necessary to
tinuous scales, for example, scales from 1 to spec. See also the recent article in the AJR by highlight the role of the positivity threshold
100. Continuous test results are typically re- Weinstein et al. [16]. and its consequences. A direct implication of
ported in the evaluation of laboratory tests, This primer is concerned mainly with this issue in meta-analysis of sensitivity and
such as the concentration of a substance. meta-analysis of studies reporting estimates specificity estimates is that the method has to
A binary test result is typically obtained by of pairs of sensitivity and specificity. The account for the possibility of different thresh-
dichotomizing a test outcome measured on a methods discussed in the next section as- olds across studies. The use of simple or
continuous scale. The continuous scale can be sumes the availability of a single two-by-two weighted averages of sensitivity and specific-
observed directly, as is the case with many lab- table from each study. However, the results of ity to draw statistical conclusions is not meth-
oratory tests. As an alternative, the scale can be some studies are reported with more than one odologically defensible. A simple example to
a latent, unobservable one, as is the case with threshold of test positivity and even more than illustrate this point is a meta-analysis of three
the observer’s degree of suspicion in ROC one definition of disease status. It is important studies with the sensitivity and specificity es-
studies. In either case, the binary test result is for investigators to record all the information timates described in Figure 1. The estimated
obtained by application of a threshold for test on alternative thresholds reported in retrieved sensitivity and specificity pairs are (0.1, 0.9),
positivity. The presence of such a threshold is studies and to determine which of the thresh- (0.8, 0.8), and (0.9, 0.1). The average pair is
a fundamental theme in the evaluation of diag- olds of test positivity is the most relevant for (0.6, 0.6). Clearly, the (0.6, 0.6) pair does not
nostic and screening tests. the purposes of the systematic review. The represent these data in any useful way; thus, a
In this primer, as in most published work on methods for combining data when several simple averaging of sensitivity and specificity
diagnostic and screening test evaluation, dis- thresholds are used in each study is beyond is not an adequate approach.
ease status is assumed to be binary. Thus, for a the scope of this primer but is discussed “Average” values of sensitivity and speci-
particular threshold of test positivity, the study briefly later in the Other Methods section. ficity sometimes are used as descriptive sum-
results can be presented in the familiar two-by- maries of the observed data. Typically, this
two table showing cross classification of dis- Statistical Analysis approach would be the case when the ob-
ease status and test outcome (Table 1). Because binary test outcomes are defined on served variability in one or both of the two
Although it may seem reasonable to expect the basis of an explicit or implicit threshold for quantities is small.
that obtaining an appropriate two-by-two table test positivity, it follows that measures of bi-
from a published study should be rather nary test performance depend on the particular Interpretation of Results
straightforward, practical experience suggests threshold used to generate the binary test out- Interpretation of the findings from a meta-
that this is not always the case. Investigators comes. This dependence is a fundamental as- analysis of diagnostic performance must ad-
need to consider carefully the data report and pect of diagnostic test evaluation. In the case of dress the relevance of the results to the four
may also need to contact the authors of the re- test sensitivity and specificity, dependence on general aims stated earlier. That is, this section
port to obtain the necessary information. the threshold induces a tradeoff between the of the report should highlight the specific ways
Measures of test performance are defined two quantities as the threshold for positivity is in which the data provide information about
either conditionally on disease status (sensi- moved across all possible values. The curve of the proper use of the particular test, preferably
tivity, specificity) or conditionally on test re- all pairs of sensitivity and specificity values in comparison with alternative techniques; dis-
sult (predictive value). Commonly used met- achieved by moving the threshold across its cuss how the findings can be used to make de-
rics include test sensitivity = P(T+|D+); possible range is the ROC curve [17, 18]. cisions about health care policy and financing;
specificity = P(T–|D–); positive predictive Comparison of tests on the basis of ROC summarize the quality of the available studies,
value = P(D+|T+); and negative predictive curves takes into consideration the actual pointing to areas in which more research is
value = P(D–/T–), where P(…) is the proba- curves and is aided by summary measures that needed; and provide information about possi-
bility of the event in parentheses, T is the test have been proposed in the literature. The area ble areas of improvement in the performance
result, and D is the true disease status. In ad- under the curve (AUC) is the most commonly of the techniques under review.
dition, studies may report other metrics, such used summary and can be interpreted as aver-
as diagnostic odds ratio (OR): sens spec / (1 – age sensitivity for the test, taken over all spec- Statistical Methods for Meta-Analysis
sens)(1 – spec); positive likelihood ratio: ificity values. Strictly speaking, the AUC is of Sensitivity and Specificity Data
equal to the probability that if a pair of dis- Summary ROC (SROC) Curve for a Single Test
eased and nondiseased subjects is selected at Our focus is on meta-analyses in which
TABLE 1: Two-by-Two Table of Binary random, the diseased subject will be ranked each study contributes a two-by-two table of
Test Results Versus Disease correctly by the test. Other summaries of the data, on the basis of which a pair of esti-
Status
ROC curve include partial areas under the mates of sensitivity and specificity can be
True Status (D) curve, values of sensitivity corresponding to obtained. To introduce statistical notation,
Test Result
(T) Nondiseased Diseased Total selected values of specificity (and vice versa), the ith study (i = 1,…I) contributes data in
Negative a c a+c and optimal operating points, defined accord- the format shown in Table 2. With the nota-
Positive b d b+d
ing to specific criteria. ROC analysis and tion of Table 2, the estimates of sensitivity
other statistical methods for diagnostic test and 1 – specificity from the ith study are
Total a+b c+d N
evaluation are described in textbooks by Zhou TPRi = di / ni1 and FPRi = bi / ni0, where

AJR:187, August 2006 273


Gatsonis and Paliwal

Fig. 1—Graph shows where the X variables can be suitably defined to


that averaging represent characteristics of the study design, the
1.0
sensitivities and
specificities can be test technology, and study participant charac-
misleading. TPR = true- teristics as used in subgroup analyses. A model
positive rate, with appropriately defined indicator variables X
0.8
FPR = false-positive rate.
can also be used to compare tests.
SROC summaries—In analogy with the
0.6 usual ROC curve, a natural summary of the
SROC is the AUC. However, the choice of
TPR

the exact limits for defining the area is a mat-


0.4
ter of some debate. In particular, some au-
thors prefer to compute the area only over
the range of the observed FPR values to
avoid the inherent uncertainties about ex-
0.2
trapolating beyond the range of the observed
data. Other authors support the use of a par-
tial area over a range of FPR values of inter-
0.0
est in the context of the particular test. In this
0.0 0.2 0.4 0.6 0.8 1.0 primer we report the full AUC estimates be-
FPR cause of their simplicity, intuitive interpreta-
tion, and avoidance of arbitrary choices of
limits of FPR values.
Another global summary of the SROC curve
is the so-called Q* (“Q-star”) statistic, which
TPR is the true-positive rate and FPR is the where logit(a) = log [a / (1 – a)]. The next measures the value of TPR at the point where
false-positive rate. step is to fit a linear regression model of the the curve intersects the x + y = 1 diagonal line.
The display of paired estimates of sensi- form This is the point on the curve where sensitivity
tivity and specificity in ROC coordinates equals specificity. For a symmetric curve, this
(FPR, TPR) is a key step in the process of Di = a + bSi + error. value is also the point at which the curve is clos-
statistical analysis. Such plots ideally est to the ideal point (FPR = 0, TPR = 1).
should include error bars for each of the two The fitted model provides a value of D for In addition to the global summary mea-
estimates. However, the bars often make the each value of S. In the final step, the D and S sures, the SROC curve can be used to estimate
plot rather busy. An additional plot to con- pairs are transformed back into ROC coordi- TPR for each fixed value of FPR and, con-
sider is a forest plot, which shows the sensi- nates to obtain an SROC curve. versely, standard errors of the estimates can
tivity and specificity estimates of each study The transformed variable D is actually be obtained using the delta method. We in-
side by side and may also include the nu- the diagnostic odds ratio estimated from clude such estimates in the analysis of the cer-
merators and denominators used to con- each individual primary study in the meta- vical cancer data (Fig. 5).
struct the estimates (Figure 2). analysis. The variable S has a less straight- SROC summaries can be used to compare
Simple derivation of an SROC curve—An forward interpretation. A little algebra the performance of alternative diagnostic
easy way to construct a graphical summary of shows that S increases when the probability and screening tests for a particular diagnos-
(FPR, TPR) estimates was introduced by of a positive test result increases in both tic question. These comparisons are rela-
Moses and colleagues in 1993 [23]. In this ap- the diseased and nondiseased populations. tively straightforward when statistical inde-
proach, the original data are first transformed Hence, S can be interpreted as a proxy for pendence can be assumed to hold, as when
into new variables S and D, defined as follows the test positivity threshold operating in the SROC curves of alternative tests are derived
for the ith study: particular study. This way of constructing from separate sets of studies or from over-
an SROC curve is roughly based on an im- lapping sets of studies in which test results
Si = logit(TPRi) + logit(FPRi) plicit assumption that the variation in diag- were not correlated. However, the situation
Di = logit(TPRi) – logit(FPRi) nostic odds ratio across studies is a function is technically more complex when test re-
of the threshold for test positivity. sults within a study are correlated, as is the
The foregoing model can be easily ex- case when a paired design has been used to
tended to incorporate covariates measuring compare tests. We discuss this issue later, in
TABLE 2: Format for ith Study study characteristics or group characteristics Other Methods.
Test Result True Status (D) of the participants in the individual primary SROC properties and limitations—The
(T) Nondiseased Diseased Total studies. The linear model would then have the shape of the SROC curve derived from the
Negative ai ci following form: foregoing linear regression model depends on
the values of the linear model parameters a
Positive bi di
Di = a + b0Si + b1X1 + b2X2 + … and b [24]. The special case of b = 0 corre-
Total ni 0 ni 1 ni + bkXk + error, sponds to the situation in which the true diag-

274 AJR:187, August 2006


Meta-Analysis of Diagnostic and Screening Test Accuracy

nostic odds ratio is assumed to be constant


across all studies. In this case, the SROC
curve is symmetric along the x + y = 1 diago-
nal line. If b ≠ 0, the curve is not symmetric.
Indeed, it turns out that when | b | > 1, the
SROC curve derived from the linear regres-
sion model has a counterintuitive property:
According to the curve, the sensitivity of the
test decreases as the FPR increases. Estimated
values of b greater than 1 or less than –1 indi-
cate that the simple linear regression model is
not adequate for constructing an SROC curve.
SROC curve computations based on the lin-
ear regression model are a simple and useful
method for developing such curves. There are,
however, potentially important technical diffi-
culties to overcome if the results of this ap-
proach are used to draw formal statistical infer-
ences. First, the presence of sampling error in
the variable S on the right hand side of the linear
model may affect the magnitude of the esti-
mates of b and its SE. The sampling error may
increase the uncertainty in the estimate of b,
leading erroneously to the conclusion that the
0.0 0.2 0.4 0.6 0.8 1.0 0.70 0.75 0.80 0.85 0.90 0.95 1.00
Sensitivity Specificity SROC is symmetric. Second, the linear model
uses summaries from the two-by-two tables of
the individual studies and ignores the statistical
Fig. 2—Forest plot of CT sensitivity and specificity estimates and their confidence intervals. precision of these summaries. Unfortunately,
LAG = lymphangiography.
the precision of TPR and FPR estimates is
somewhat complex because it depends not only
on overall sample size but also on the sample
sizes for diseased and nondiseased subjects in
the study. Hence, simple weighting by sample
size is not sufficient. In addition, the left-hand-
side and right-hand-side variables in the linear
model have their own estimates of statistical
1.0
precision, making it difficult to decide on a sin-
gle weight for the particular study. Third, the
linear model does not account for the presence
0.8 of correlations in the data, such as those result-
ing from the use of paired designs within indi-
vidual primary studies.
0.6 Binary regression for SROC analysis—Be-
TPR

cause of the methodologic difficulties de-


scribed, it is prudent for investigators to con-
0.4 sider the use of alternative approaches to
estimating SROC parameters for purposes of
LAG formal statistical inference. An early such ap-
0.2 CT proach predated the linear regression method
MRI
and used the bivariate normal distribution of
the estimates of sensitivity and specificity from
0.0 each study, with a linear relation between the
true values of sensitivity and specificity to ac-
0.0 0.2 0.4 0.6 0.8 1.0
count for the effect of threshold [25].
FPR A streamlined alternative to the linear regres-
sion model is to use a variant of logistic regres-
Fig. 3—Observed true-positive rates (TPR) and false-positive rates (FPR) for three imaging techniques. sion, which models directly the data ineach
LAG = lymphangiography. two-by-two table [26–28]. If Y is the binary test

AJR:187, August 2006 275


Gatsonis and Paliwal

result (yes = 1, no = 0) and D the binary disease metastasis in patients with cervical cancer The scale parameter is not statistically dif-
status for an individual patient in a given study, [30]. This systematic review was conducted ferent from zero for all three techniques. In-
the form of the model is as follows: to compare the performance of three imag- stead of assuming it is zero and plotting the
ing techniques: lymphangiography (LAG), SROC curves as symmetric, we used the esti-
logitP[Y = 1] = (θ – αD)exp(–βD). CT, and MRI. The published report de- mated value of β to derive the plots in
scribes how the problem was formulated, Figures 3 and 4. The SROC curves are super-
The binary regression model is intuitively how the relevant studies were identified and imposed on the observed data in Figure 4.
based on the usual conceptualization of the bi- reviewed, and how the diagnostic perfor- The SROC curves with superimposed 95%
nary test outcome resulting from a positivity mance data were extracted. Briefly, studies confidence intervals for TPR and FPR at three
threshold (denoted here by θ). In other words, were located with a MEDLINE literature points are shown in Figure 5.
the binary test outcome is obtained by dichot- search combined with hand searching of bib- The SROC curve for LAG stays consis-
omizing a continuous variable that has differ- liographies from retrieved articles. Included tently below the curves of the other two tech-
ent distributions for diseased and nondiseased studies had histologic confirmation of cervi- niques. The MRI curve dominates the CT
subjects. The parameter α measures the dis- cal cancer, uniformly appropriate reference curve, and its summary values of AUC and
tance between the centers of the diseased and standard information, and evidence of blind- Q* estimates dominate the other two tech-
nondiseased populations, and the parameter β ing in study design. In addition, included niques. However, only one of the paired com-
measures the ratio of the SDs in the two popu- studies had a minimum sample size of 20 pa- parisons of the AUC estimates (LAG vs MRI)
lations. The mathematic details of the model tients, reported criteria for test positivity, is statistically significant. A comparison of
and its relation to the linear model approach and presented sufficient data to complete the the confidence intervals for TPR and FPR
are sketched in Appendix 1. necessary two-by-two table. also shows overlap at each of three points
The use of binary regression allows inves- In our example we included data from 42 chosen in Figure 5. We conclude that al-
tigators to avoid key difficulties associated studies, 13 of which evaluated LAG, 19 eval- though there is a trend for MRI to have better
with the linear model approach, notably the uated CT, and 10 evaluated MRI. Nine studies performance than CT and LAG, only the
errors-in-variables problem and the need to evaluated more than one test, but this feature AUC of MRI is statistically different from
account for differences in sample size across of the data is ignored for the purposes of this that of LAG.
studies. As shown in Appendix 1, it is possi- analysis. The pairs of observed values of sen- For comparative purposes, we present the
ble to translate the findings of binary regres- sitivity and specificity are presented in ROC numeric results from the linear regression fit of
sion analysis into linear model parametriza- coordinates in Figure 3. We are not using ex- the SROC curve (Table 4). The actual curves
tion. However, the SROC curves obtained actly the same set of studies presented in the and summary estimates of AUC and Q* are
from binary regression analysis always lead published paper, and hence the results of this close but not identical to those derived from the
to values of the slope between –1 and 1 and example may differ from those in the article, binary regression analysis. For a more detailed
hence avoid the counterintuitive properties of particularly in the case of the LAG evaluation. view of the comparison, we converted the
curves with | b | > 1 obtained from the linear SROC curves were derived separately for SROC equation from the binary regression to
model. Binary regression models can be fitted each test by both the binary and the linear re- the form that would be obtained from a linear
with standard software, such as Proc NL- gression methods. The results of the binary regression fit, using the formulas in
Mixed in SAS [29]. The SAS code for fitting regression fit are presented in detail and are Appendix 1. Table 5 shows the results for CT.
a binary regression model using Proc NL- followed by summary tables from the linear
Mixed is in Appendix 2. regression fit. The latter are included for com- Other Methods
parison purposes. For each test, the binary re- The SROC method is limited in two impor-
Example: Meta-Analysis of Cervical Cancer gression model assumed common location tant respects. First, the statistical framework
Staging Data (α) and scale (β) parameters across the stud- does not consider the presence of random
To illustrate the SROC method we use ies but a separate threshold value for each variation between studies. This fixed-effects
data from a meta-analysis of diagnostic im- study. Table 3 summarizes the results from framework implicitly assumes that the uni-
aging tests in the detection of lymph node the binary regression fit. verse of all studies to which inferences apply
is only the specific studies used in the meta-
analysis and that in addition to sampling vari-
ation within studies, the only other possible
variation can be explained by study-level co-
variates. As a result of its assumptions, a
TABLE 3: Estimates of Summary Receiver Operating Characteristic Curve fixed-effects approach to meta-analysis is
Parameters, Area Under the Curve (AUC), and Q* Statistic for Each generally expected to provide artificially
Technique (Binary Regression Model) more precise results than an approach that
Technique α (Location) SE (α) β (Scale) SE (β) AUC SE (AUC) Q* provides a fuller account of variability in the
LAG –1.965 0.365 –0.500 0.639 0.719 0.054 0.677 data [31]. The second important limitation of
CT –3.380 0.737 –0.591 0.399 0.839 0.025 0.769
the specific fixed-effects approach is that it
ignores correlations in the data within studies.
MRI –3.349 0.501 0.1 0.376 0.933 0.021 0.862
In this section, we briefly discuss statistical
Note—LAG = lymphangiography. methods based on hierarchical models de-

276 AJR:187, August 2006


Meta-Analysis of Diagnostic and Screening Test Accuracy

1.0
1.0

0.8
0.8

0.6
0.6
TPR

TPR
0.4
0.4
0 LAG
CT
+ MRI
0.2
LAG 0.2
LAG
CT CT
MRI MRII
0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

FPR FPR

Fig. 4—Estimated SROC curves and original data points for three imaging Fig. 5—Summary Receiver Operating Characteristic curves with confidence
techniques. TPR = true-positive rate, FPR = false-positive rate, intervals for selected (TPR, FPR) points. TPR = true-positive rate, FPR = false-positive
LAG = lymphangiography. rate, LAG = lymphangiography.

signed to address these limitations. We also for modeling data from studies conducted studies to use the same number of categories
provide references to the literature on meta- with paired designs. in reporting of test results [33, 34].
analysis of ROC studies. An alternative way to build hierarchical If the emphasis is on meta-analysis of sum-
models for diagnostic accuracy data is to con- maries of the ROC curve, the appropriate
Hierarchical Summary ROC Analysis sider a variant of the Kardaun approach and use methods have to be tailored to the specific
The binary regression model is the building the bivariate asymptotic normal distribution of summary. For meta-analysis of estimates of
block for a hierarchical model describing the the estimates of sensitivity and specificity from the AUC from independent studies, McClish
full range of variation in the data. In particular, each study [32]. Although this approach has [35] describes weighted average estimators,
the hierarchical model differentiates within- been used to derive “average” estimates of sen- Zhou [36] describes a generalized estimating
study from between-studies variability and sitivity and specificity, a practice criticized ear- equation approach, and Hellmich et al. [37]
systematic from random variability. For exam- lier in this article, it is easy to modify the model describe a bayesian method. A hierarchical
ple, a model for the cervical cancer data ac- to derive SROC curves. model for such data can be constructed in a
counts for two levels of variability. In level 1, straightforward manner with the asymptotic
within-study variation is modeled by binary re- Meta-Analysis with Multiple Thresholds from distribution of the estimate of the AUC for the
gression. In level 2, between-studies variation Individual Studies first level of the model and proceeding as in
is modeled by distributions of the threshold In the SROC methods discussed earlier, it the HSROC model for the other levels. Be-
and location parameters. The mean of the dis- is assumed that a single two-by-two table is cause the distributions involved are all nor-
tribution of the parameters may depend on obtained from each study. If multiple thresh- mal, the process of fitting and checking such
study-level covariates (e.g., test type). olds for test positivity are used in the primary models is fairly routine [31, 38].
A hierarchical model can be fitted with studies, ordinal regression methods and their
fully bayesian methods [27] or likelihood- hierarchical formulations can be used to per- Discussion
based approximations as implemented in the form the statistical analysis [33]. As interest in evidence-based diagnosis in-
Proc NLMixed procedure of SAS [28]. A Hi- creases, so does the demand for information
erarchical SROC (HSROC) curve can be de- Meta-Analysis of ROC Data from systematic reviews of studies of diag-
rived by use of the population means of the The choice of suitable statistical methods nostic accuracy. The information from such
parameters. In addition to providing a full for combining data from ROC studies de- reviews is a key ingredient for all subsequent
account of the variability in the data, the hi- pends on the type of data considered. If the evaluation of diagnostic techniques. Because
erarchical model accounts implicitly for cor- full ROC data are available—for example, the empiric studies of test outcomes can be pro-
relations within studies. If information exists complete two-by-five table of disease status hibitively difficult to conduct in practice, re-
for such correlations, it can be included ex- by test results when a five-point ordinal cate- search synthesis and modeling of health out-
plicitly by suitable extensions of the model. goric scale is used—then ordinal regression comes and costs often remain the only viable
In particular, such formulations are useful methods can be used. It is not necessary for all options. For such undertakings, the informa-

AJR:187, August 2006 277


Gatsonis and Paliwal

TABLE 4: Estimates of Summary Receiver Operating Characteristic Curve across studies. Hierarchical modeling can
Parameters, Area Under the Curve (AUC), and Q* Statistic for Each be a powerful framework for incorporating
of Three Techniques (Linear Regression Model, Unweighted) observer variability in the analysis of indi-
Technique α SE (α) β SE (β) AUC SE (AUC) Q* vidual studies [43, 44]. However, detailed
LAG 2.006 0.308 0.299 0.276 0.779 0.031 0.731 data on observer variability are not usually
CT 2.788 0.360 0.219 0.118 0.861 0.024 0.801 reported, making it necessary for investiga-
tors to contact the authors of studies if such
MRI 3.508 0.609 0.255 0.187 0.916 0.028 0.852
an analysis is to be undertaken.
Note—LAG = lymphangiography. • As is the case with most technology, diag-
nostic and screening techniques evolve rap-
idly. In the absence of a consensus on a
TABLE 5: Comparison of Binary and Linear Regression Summary Receiver framework for diagnostic technology assess-
Operating Characteristic Analyses for CT Data ment, there is risk of increasing the heteroge-
neity in a systematic review by inclusion of
Analysis α SE (α) β SE (β) AUC SE (AUC) Q*
studies that clearly do not reflect the current
Linear (unweighted) 2.788 0.360 0.219 0.118 0.861 0.024 0.8012 state of a technique. By contrast, such a
Linear (weighted) 2.606 0.329 0.141 0.114 0.854 0.028 0.7863 framework is in place for the evaluation of
Binary regression 2.409 0.410 0.288 0.183 0.839 0.025 0.7694 therapy. In that context, a systematic review,
Note—AUC = area under the curve. for example, would not combine estimates
of effects reported in phase 1 and 2 studies
with those reported in phase 3 studies.
tion from meta-analysis of diagnostic perfor- and matured. Evidence of methodologic • Particular forms of bias exist within many
mance is crucial. progress is the growing list of published work primary studies [45]. The effect of such
Meta-analysis of accuracy evaluations is and the formation of the Cochrane Diagnostic within-study bias on systematic reviews
not as streamlined or easy to perform and Reviews initiative [42] late in 2003. The re- has to be considered. Methods for handling
summarize as meta-analysis of therapy evalu- searchers involved in this initiative are at bias within the primary studies need to be
ations. A key difference is the nature of the work preparing the methodologic infrastruc- developed.
summary measure. In therapy studies, the ture for performing diagnostic accuracy re- In confronting methodologic and practical
summary can be as simple as an overall suc- views and including them in a new division of challenges, investigators conducting system-
cess rate with appropriately quantified vari- the Cochrane Library. atic reviews of diagnostic accuracy are likely
ability and uncertainty. In diagnostic accu- Despite progress in study quality and re- to find colleagues and collaborators. The era
racy studies, however, the summary is a curve porting and in methodologic development, of evidence-based diagnosis is here to stay.
(or several curves if patient subsets are con- major challenges confront investigators ven-
sidered). Comparisons of curves are inher- turing into the world of systematic reviews of Acknowledgments
ently more complex and nuanced than com- diagnostic and screening tests. The following We thank the editors for inviting us to pre-
parisons of means or proportions. Thus, is a partial list of challenges: pare this review and the referees for their
systematic reviews of diagnostic accuracy • The literature contains many small studies, comments and suggestions.
present the research community with a chal- which are usually retrospective and of un-
lenging set of questions about how best to certain quality.
summarize the information and how to use it • The detail and accuracy of reporting on References
in analysis and decision making. For exam- study methods and results vary greatly. It is 1. Knottnerus JA, ed. The evidence base of clinical di-
ple, the methodology for incorporation of often impossible to determine key study agnosis. London, UK: BMJ Books, 2002
SROC curves in modeling outcomes and characteristics, such as study cohort, tech- 2. Thornbury JR. Clinical efficacy of diagnostic imag-
costs is not fully developed, and practical ex- nical aspects of the techniques involved, ing: love it or leave it. AJR 1994; 162:1–8
perience in this type of analysis is relatively and definition of gold standard informa- 3. Gatsonis C. Design of evaluations of imaging tech-
scarce. In most published modeling exercises, tion. nologies: development of a paradigm. Acad Radiol
the sensitivity and specificity of tests are as- • Even for relatively tightly defined clinical 2000; 7:681–683
sumed to be a single pair of numbers. questions, multiple sources of heterogene- 4. Gatsonis C. Do we need a checklist for reporting the
Two major determinants of the success of ity among studies are operating, threshold results of diagnostic test evaluations? Acad Radiol
systematic reviews of diagnostic accuracy are differences being only one. It is therefore 2003; 10:599–600
the availability of relevant studies of adequate important for the review to explore such 5. Irwig L, Tosteson AN, Gatsonis CA, et al. Guide-
quality and the development of a consensus sources of variation and to use appropriate lines for meta-analyses evaluating diagnostic tests.
around the methods for such reviews. In re- statistical techniques. Ann Intern Med 1994; 120:667–676
cent years, the quality of diagnostic and • An important source of heterogeneity not 6. Irwig L, Macaskill P, Glasziou P, Fahey M. Meta-
screening test evaluations has improved, but addressed in this article is heterogeneity analytic methods for diagnostic test accuracy. J Clin
the hill still seems steep [14, 39–41]. In the due to observer. Empiric data suggest that Epidemiol 1995; 48:119–130
same period, the methods for systematic re- within-study observer variability can be of 7. Pai M. Systematic reviews of diagnostic test eval-
view of diagnostic accuracy have progressed the same order of magnitude as variability uations: what’s behind the scenes? ACP J Club

278 AJR:187, August 2006


Meta-Analysis of Diagnostic and Screening Test Accuracy

2004; 141:11–13 Wiley, 2002 and specificity produces informative summary


8. Gatsonis C, McNeil B. Collaborative evaluation of 20. Pepe M. The statistical evaluation of medical tests measures in diagnostic reviews. J Clin Epidemiol
diagnostic tests: experience of the Radiologic Di- for misclassification and prediction. New York, 2005, 58:982–990
agnostic Oncology Group. Radiology 1990; NY: Oxford University Press, 2003 33. Dukic V, Gatsonis C. Meta-analysis of diagnostic
175:571–575 21. Toledano AY, Herman BA. Case study: evaluating test accuracy assessment studies with varying num-
9. Deville WL, Bezemer PD, Bouter LM. Publica- accuracy of cancer diagnostic tests. In: Beam C, ed. ber of thresholds. Biometrics 2003; 59:936–946
tions on diagnostic test evaluation in family med- Biostatistical applications in cancer research. Bos- 34. Kester ADM, Buntinx F. Meta-analysis of ROC
icine journals: an optimal search strategy. J Clin Ep- ton, MA: Kluwer, 2002:219–232 curves. Med Decis Making 2000; 20:430–439
idemiol 2000; 53:65–69 22. Toledano AY. Cancer diagnostics: statistical methods. 35. McClish DK. Combining and comparing area esti-
10. Bachmann LM, Coray R, Estermann P, Ter Riet G. In: Beam C, ed. Biostatistical applications in cancer mates across studies or strata, Med Decis Making
Identifying diagnostic studies in MEDLINE: reduc- research. Boston, MA: Kluwer, 2002:183–218 1992; 12:274–279
ing the number needed to read. J Am Med Inform As- 23. Moses LE, Littenberg B, Shapiro D. Combining in- 36. Zhou X. Empirical Bayes combination of estimated
soc 2002; 9:653–658 dependent studies of a diagnostic test into a sum- areas under ROC curves using estimating equa-
11. Moher D, Jadad AR, Tugwell P. Assessing the qual- mary ROC curve: data–analytic approaches and tions. Med Decis Making 1996; 16:24–28
ity of randomized controlled trials: current issues some additional considerations. Stat Med 1993; 37. Hellmich M, Abrams KR, Sutton AJ. Bayesian ap-
and future directions. Int J Technol Assess Health 12:1293–1316 proaches to meta-analysis of ROC curves. Med De-
Care 1996; 12:195–208 24. Walter S. Properties of the SROC for diagnostic test cis Making 1999; 19:252–264
12. Whiting P, Rutjes AW, Reitsma JB, Bossuyt PM, data. Stat Med 2002; 21:1237–1256 38. DuMouchel W, Normand SL. Computer modeling
Kleijnen J. The development of QUADAS: a tool 25. Kardaun JWPF, Kardaun OJWF. Comparative di- and graphical strategies for meta-analysis. In: Stan-
for the quality assessment of studies of diagnostic agnostic performance of three radiological proce- gle D, Berry D, eds. Meta-analysis in medicine and
accuracy included in systematic reviews. BMC Med dures for the detection of lumbar disk herniation. health policy. New York, NY: Dekker, 2000
Res Methodol 2003; 3:25 Methods Inf Med 1990; 29:12–22 39. Beam C, Sostman HD, Zheng J-Y. Status of clinical
13. Bossuyt P, Reitsma J, Bruns D, et al. The STARD 26. Rutter C, Gatsonis C. Regression methods for meta- MR evaluations 1985–1988: baseline and design for
statement for reporting studies of diagnostic accu- analysis of diagnostic test data. Acad Radiol 1995; further assessments. Radiology 1991; 180:265–270
racy: explanation and elaboration. Clin Chem 2003; 2:S48–S56 40. Black WC. How to evaluate the radiology literature.
49:7–18 27. Rutter C, Gatsonis C. A hierarchical regression ap- AJR 1990; 154:17–22
14. Bossuyt P, Reitsma J, Bruns D, et al. Towards com- proach to meta-analysis of diagnostic test accuracy 41. Cooper LS, Chalmers TC, McCally M, Berrier J,
plete and accurate reporting of studies of diagnostic evaluations. Stat Med 2001; 20:2865–2884 Sacks HS. The poor quality of early evaluations
accuracy: the STARD initiative. BMJ 28. Macaskill P. Empirical Bayes estimates generated of magnetic resonance imaging. JAMA 1988;
2003;326:41–44 in a hierarchical summary ROC analysis agreed 259:3277–3280
15. Whiting P, Rutjes AW, Reitsma JB, Glas AS, closely with those of a full Bayesian analysis. J Clin 42. Cochrane reviews of diagnostic test accuracy. The
Bossuyt PM, Kleijnen J. Sources of variation and Epidemiol 2004; 57:925–932 Cochrane Collaboration Web site. Available at:
bias in studies of diagnostic accuracy: a systematic 29. SAS / STAT 9.1 user’s guide. Cary, NC: SAS Insti- www.cochrane.org/newslett/ccnews31-lowres.pdf.
review. Ann Intern Med 2004; 140:189–202 tute, 2004 Accessed May 31, 2006
16. Weinstein S, Obuchowski N, Lieber M. Clinical 30. Scheidler J, Hricak H, Yu KK, Subak L, Segal MR. 43. Gatsonis CA. Random effects models for diagnostic
evaluation of diagnostic tests. AJR 2005; 184:14–19 Radiological evaluation of lymph node metastases accuracy data. Acad Radiol 1995; 2:S14–S21
17. Hanley J. Receiver operating characteristic (ROC) in patients with cervical cancer: a meta-analysis. 44. Ishwaran H, Gatsonis C. A general class of hierar-
methodology: the state of the art. Crit Rev Diagn Im- JAMA 1997; 278:1096–1101 chical ordinal regression models with applications
aging 1989; 29:307–335 31. Normand SL. Tutorial in biostatistics: meta-analy- to correlated ROC analysis. Can J Stat 2000;
18. Obuchowski N. ROC analysis. AJR 2005; sis—formulating, evaluating, combining, and re- 28:731–750
184:364–372 porting. Stat Med 1999; 18:321–359 45. Lijmer JG, Mol BW, Heisterkamp S, et al. Empiri-
19. Zhou XH, Obuchowski N, McClish D. Statistical 32. Reitsma J, Glas A, Rutjes A, Scholten R, Bossuyt cal evidence of design-related bias in studies of di-
methods in diagnostic medicine. New York, NY: P, Zwinderman A. Bivariate analysis of sensitivity agnostic tests. JAMA 1999; 282:1061–1066

Appendices continue on next page

AJR:187, August 2006 279


Gatsonis and Paliwal

APPENDIX 1: Binary Regression Model

For a single study, the model can be described as follows.


Let Yij represent the test result (1 = positive, 0 = negative) and Dij the true disease status on the jth individual in the ith study. In our notation,
we code D = 1 / 2, if diseased, and –1 / 2 if nondiseased. The binary regression model is based on the assumption that the response arises from
the discretization of an underlying continuous latent variable with threshold θi. The latent variable follows logistic distributions for diseased
and nondiseased subjects, and the two distributions can be distinguished by a location parameter (αi) and a scale parameter (βi). The diagnostic
performance of the test in the ith study is a function of the location and the scale parameters. Formally,

logitP[Yij = 1 | Dij] = (θi – αiDij)exp(–βiDij)

The binary regression model is closely related to the usual ROC model and implies that for the ith study:

logit(FPRi) = (θi + αi / 2)exp(βi / 2)

logit(TPRi) = (θi – αi / 2)exp(–βi / 2)

If the location and scale parameters are assumed to be constant across all studies, the model reduces to a relation between the true-positive
rate (TPR) and false-positive rate (FPR) that is similar to the relation postulated in the model described by Moses et al. [23]. In particular:

logit (TPRi) = c0 + c1 logit (FPRi) or


2c 0 c1 – 1
- + -------------- S i
D i = -------------
c1 + 1 c 1+1

where c0 = –αe–β / 2, c1 = e–β. It is clear that c1 is greater than 0; and that b, which is equal to (c1 – 1) / (c1 + 1), takes values between –1 and 1.
An SROC curve and its summary measures can be estimated from the binary regression model. In addition, study-level and subject-level co-
variates can be easily incorporated, resulting in models of the form:

logitP[Yij = 1 | Dij, Xi] = (θi – αiDij – γXi)exp(–βiDij),

which corresponds to simultaneously fitting several SROC curves to subsets of the data.
The large number of parameters in the binary regression model creates identifiability problems without additional assumptions. For example,
the model without covariates has three parameters for each table; hence, it is not identifiable for a single table. However, with suitable assump-
tions, such as the one leading to the analogue of the Moses model, the binary regression model can be made identifiable. Other assumptions
about the parameters allow the exploration of heterogeneity across studies. For example, studies may have different location parameters (thus
different overall accuracies) but the same scale parameter and the same threshold. Such exploration of heterogeneity is rather limited within the
fixed-effects type of approach we present in this article. More elaborate exploration of heterogeneity requires the use of hierarchical models.

280 AJR:187, August 2006


Meta-Analysis of Diagnostic and Screening Test Accuracy

APPENDIX 2: Software for Fitting a Binary Regression Model

SAS Code for Binary Regression Model (for CT);


data binreg1;
input study test n_pos n_tp dis dis1;
cards;
1 1 10 8 1 0.5
………………………………………………
42 3 24 2 zero −0.5

data final; set binreg1; if test=2;


/* create indicator variable for each study */
if study=1 then s1=1; else s1=0;
...............
if study=42 then s42=42; else s42=0;
run;
proc nlmixed data=final1 maxiter=5000 cov ;
parms t18=0 t19=0 t20=0 t21=0 t22=0 t23=0 t24=0 t25=0 t26=0 t27=0 t28=0 t29=0 t30=0 t31=0 t32=0 t33=0 t34=0 t35=0 t36=0 ;
logitp=(t18*s18+t19*s19+t20*s20+t21*s21+t22*s22+t23*s23+t24*s24+t25*s25+t26*s26+t27*s27+t28*s28+t29*s29+t30*s30+t31*s31+
t32*s32+t33*s33+t34*s34+t35*s35+t36*s36–a*dis1) / exp (b*dis1) ;
p=exp(logitp) / (1+exp (logitp) ) ;
model n_tp~binomial (n_pos , p) ;
run;

The reader’s attention is directed to earlier articles in the Fundamentals of Clinical Research series:

1. Introduction, which appeared in February 2001 13. Clinical Evaluation of Diagnostic Tests, January 2005
2. The Research Framework, April 2001 14. ROC Analysis, February 2005
3. Protocol, June 2001 15. Statistical Inference for Continuous Variables, April 2005
4. Data Collection, October 2001 16. Statistical Inference for Proportions, April 2005
5. Population and Sample, November 2001 17. Reader Agreement Studies, May 2005
6. Statistically Engineering the Study for Success, July 2002 18. Correlation and Regression, July 2005
7. Screening for Preclinical Disease: Test and Disease 19. Survival Analysis, July 2005
Characteristics, October 2002 20. Multivariate Statistical Methods, August 2005
8. Exploring and Summarizing Radiologic Data, January 2003 21. Decision Analysis and Simulation Modeling for Evaluating
9. Visualizing Radiologic Data, March 2003 Diagnostic Tests on the Basis of Patient Outcomes,
10. Introduction to Probability Theory and Sampling September 2005
Distributions, April 2003 22. Radiology Cost and Outcomes Studies: Standard Practice
11. Observational Studies in Radiology, November 2004 and Emerging Methods, October 2005
12. Randomized Controlled Trials, December 2004

AJR:187, August 2006 281

You might also like