You are on page 1of 67

Practical Guide to the Evaluation of

Clinical Competence 2nd Edition -


eBook PDF
Visit to download the full and correct content document:
https://ebooksecure.com/download/practical-guide-to-the-evaluation-of-clinical-compe
tence-2nd-edition-ebook-pdf/
Any screen.
Any time.
Anywhere.
Activate the eBook version
e.
of this title at no additional charge.

Expert Consult eBooks give you the power to browse and find content,
view enhanced images, share notes and highlights—both online and offline.

Unlock your eBook today.


1 Visit expertconsult.inkling.com/redeem Scan this QR code to redeem your
eBook through your mobile device:
2 Scratch off your code
3 Type code into “Enter Code” box

4 Click “Redeem”
5 Log in or Sign up
6 Go to “My Library”
Place Peel Off
It’s that easy! Sticker Here

For technical assistance:


email expertconsult.help@elsevier.com
call 1-800-401-9962 (inside the US)
call +1-314-447-8200 (outside the US)
Use of the current edition of the electronic version of this book (eBook) is subject to the terms of the nontransferable, limited license granted on
expertconsult.inkling.com. Access to the eBook is limited to the first individual who redeems the PIN, located on the inside cover of this book,
at expertconsult.inkling.com and may not be transferred to another party by resale, lending, or other means.
Practical Guide to the Evaluation
of Clinical Competence
2nd Edition

i
This page intentionally left blank

     
Practical Guide to the Evaluation
of Clinical Competence

2nd Edition

Eric S. Holmboe, MD, MACP, FRCP


Senior Vice President, Milestones Development and Evaluation
Accreditation Council for Graduate Medical Education
Chicago, Illinois;
Professor Adjunct
Yale University
New Haven, Connecticut;
Adjunct Professor of Medicine
Feinberg School of Medicine, Northwestern University
Chicago, Illinois

Steven J. Durning, MD, PhD


Professor of Medicine and Pathology
Department of Medicine
Uniformed Services University of the Health Sciences
Bethesda, Maryland

Richard E. Hawkins, MD, FACP


Vice President, Medical Education Outcomes
American Medical Association
Chicago, Illinois
1600 John F. Kennedy Blvd.
Ste 1800
Philadelphia, PA 19103-2899

PRACTICAL GUIDE TO THE EVALUATION OF CLINICAL


COMPETENCE, ED. 2 ISBN: 978-0-323-44734-8

Copyright © 2018 Eric Holmboe, Richard Hawkins and Steven Durning, Published by Elsevier Inc. All rights
reserved.
For chapter 2 (Dr. Brian Clauser): Copyright © 2018, NBME. Published by Elsevier Inc. All Rights Reserved.

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing
from the publisher. Details on how to seek permission, further information about the Publisher’s permissions poli-
cies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing
Agency, can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other
than as may be noted herein).

Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden
our understanding, changes in research methods, professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments described herein. In using such information or methods
they should be mindful of their own safety and the safety of others, including parties for whom they have a profes-
sional responsibility.
With respect to any drug or pharmaceutical products identified, readers are advised to check the most current
information provided (i) on procedures featured or (ii) by the manufacturer of each product to be administered,
to verify the recommended dose or formula, the method and duration of administration, and contraindications.
It is the responsibility of practitioners, relying on their own experience and knowledge of their patients, to make
diagnoses, to determine dosages and the best treatment for each individual patient, and to take all appropriate
safety precautions.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liabil-
ity for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise,
or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

Previous edition copyrighted 2008 by Mosby, an imprint of Elsevier Inc.

Library of Congress Cataloging-in-Publication Data

Names: Holmboe, Eric S., editor. | Durning, Steven J., editor. | Hawkins,
Richard E., editor.
Title: Practical guide to the evaluation of clinical competence / [edited by]
Eric S. Holmboe, Steven J. Durning, Richard E. Hawkins.
Description: 2nd edition. | Philadelphia, PA : Elsevier, [2018] | Includes
bibliographical references and index.
Identifiers: LCCN 2016048388 | ISBN 9780323447348 (pbk. : alk. paper)
Subjects: | MESH: Clinical Competence | Educational Measurement--methods |
Education, Medical, Graduate--standards | Competency-Based
Education--methods
Classification: LCC R837.A2 | NLM W 18 | DDC 616--dc23 LC record available at https://lccn.loc.gov/2016048388

Executive Content Strategist: James Merritt


Senior Content Development Specialist: Rae Robertson
Publishing Services Manager: Patricia Tannian
Project Manager: Stephanie Turza
Design Direction: Patrick Ferguson

Printed in the United States of America

Last digit is the print number: 9 8 7 6 5 4 3 2 1


Preface

Assessment of health professionals across the continuum stay true to that philosophy by adding more supplemental
of medical education and practice is essential for advanc- material and new chapters on assessing clinical reasoning in
ing high-quality and safe care for patients and the public. the workplace, work-based procedural assessment, and feed-
Assessment of clinical competence is a core element of back. All other chapters have undergone extensive revision
professionalism and underlies effective professional self-­ to be up-to-date and practical.
regulation; it is essential for fulfilling our professional obli- The three of us have spent much of our professional lives
gation to assure the public that the graduates of medical thinking, learning, and then teaching about assessment.
education training programs are truly prepared to enter the Like many of you, much of our initial learning was through
next stage of education and/or practice. Despite substan- trial and error, occurring as a result of being assigned posi-
tial attention to the quality and safety of healthcare over tions of responsibility in determining the competence of
the past 20 years, major deficiencies and concerns persist students and residents in internal medicine. We have also
in healthcare fields. The transformation of medical educa- had the privilege to work within national organizations
tion, and the education of all healthcare professionals, is involved in the assessment of physicians across the contin-
appropriately seen as part of the solution. Effective assess- uum. Assessment is not routinely seen by physicians and
ment is a vital component of this transformation. First and other health professionals as a welcome activity, especially
foremost, medicine is a service profession. As medical edu- when it comes from an external entity. Yet without assess-
cators, it is vital we develop and use high-quality assessment ment feedback is almost impossible and continuous profes-
methods and systems in order to fulfill a primary obligation sional growth is difficult. We hope by sharing part of our
to the public and patients we serve. Furthermore, effective own journey through this textbook we can help the reader
assessment provides the necessary data for robust feedback address important assessment challenges they are facing in
and guidance to support professional growth and develop- their own work context and also contribute to larger con-
ment. Learners are entitled to no less; without assessment versations around assessment as a mechanism to improve
and feedback the attainment of mastery, the ultimate goal healthcare quality and safety.
of outcomes-based education, is nearly impossible. The primary purpose of this book is to provide a practi-
It has been nearly 10 years since the publication of the cal guide to developing assessment programs using a sys-
first edition of this book, and much has changed during this tems lens. No single assessment method is sufficient to
period. Competency-based medical educational (CBME) determine something as complex as clinical competence.
models are now being implemented to varying degrees across Educators will need to develop programs of assessment by
the globe in an effort to drive better outcomes of education choosing the optimal combination of methods, based on
and by extension healthcare. The philosophical underpin- the best evidence available, for their local context. This book
nings of CBME are informing curricular and program- has been organized around the various assessment methods
matic assessment changes, accreditation and certification and instruments and how individuals with responsibilities
approaches, and the credentialing of healthcare profes- for assessment can apply these methods and instruments
sionals. CBME has highlighted the importance of leverag- in their own setting. We have provided an overview of key
ing more traditional methods of assessment while creating educational theories where applicable to help the reader
substantial pressure and defining the need to advance other understand how best to use the assessment method and its
methods of assessment, especially in the workplace. Fully purpose. Each chapter provides information on the strengths
implemented, CBME frameworks embrace holistic and and weaknesses of the assessment method, along with infor-
constructivist approaches to assessment; successful assess- mation about specific tools. Many chapters provide examples
ment programs will need to incorporate a diverse range of of assessment instruments along with suggestions on faculty
educational and assessment theories and methods. development and effective implementation of the assess-
We are pleased to be able to share changes and advances ment method. Each chapter also contains an annotated
in assessment that have occurred since 2008. Many readers ­bibliography of helpful articles for additional reading.
let us know that one of the main benefits of the first edition The first chapter provides an overview of basic assess-
was the practical suggestions in each chapter that could be ment principles with a focus on the rise and impact
implemented in training programs. We have attempted to of competency-based approaches to achieve outcomes.

v
vi Preface

Chapter 2 provides a useful primer on key theories The final three chapters help the reader “put it all together.”
and aspects of psychometrics, a discipline that remains Portfolios, covered in Chapter 14, offer a comprehensive
essential to effective assessment. Chapter 3 explores the approach to supporting an assessment program. The chapter
evolving approaches to the use of rating scales, a com- provides practical advice on how to design and implement
mon component of assessment forms and surveys, high- portfolios. Chapter 15 provides a systematic approach to
lighting the importance of appropriate frameworks and working with the dyscompetent learner, i.e., the learner in
anchors. Direct observation in the workplace, especially difficulty. These learners require an assessment program and
of clinical skills, is the focus of Chapter 4 with multiple systematic approach using multiple assessment methods.
practical suggestions on how to better prepare faculty The final chapter, Chapter 16, covers the important role of
in this essential assessment skill. Chapter 5 explores the programmatic evaluation as part of an effective educational
assessment of clinical skills with standardized patients, program. Newer concepts and approaches to programmatic
another form of direct observation in controlled settings. assessment are provided.
Chapter 6 provides an extensive overview on the effective Effective assessment requires a multifaceted approach
use of the traditional written, standardized tests of medical using a combination of assessment methods. This is the
knowledge and clinical reasoning, still an essential part of rationale behind the organization and design of this book.
an assessment program. However, the need for high-quality Effective assessment also depends upon collaboration among
assessment of clinical reasoning in the workplace has grown a team of faculty and other educators; thus any change to an
in importance with the recognition of the persistent and assessment system must include not only buy-in from oth-
pernicious problem of diagnostic and therapeutic errors in ers, but also the investment to train educators to use assess-
clinical practice. This is the focus of Chapter 7, a new chap- ment methods and tools effectively. In a CBME system, this
ter for this edition. Another new addition, Chapter 8, covers must also include the learners as “active agents” in their own
the assessment of procedural competence in the workplace, learning and assessment. Interprofessional faculty, program
another growing area of interest for medical educators in an leaders, and learners need to work together to co-create and
era of patient safety concerns. co-produce assessment to maximize educational, and ulti-
Chapter 9 addresses the importance of assessing evi- mately, clinical outcomes.
dence-based practice, an essential competency in a time of It is essential to remember the true assessment instru-
rapidly expanding medical knowledge and growing use of ment is the individual using it, not the instrument itself.
clinical decision support at the point of care. Chapter 10 has Assessment tools are only as good as the individual using
been extensively revised and now focuses on the multiple them. If done well, assessment can have a profoundly posi-
ways to assess performance in clinical practice using quality tive effect on patients, learners, and faculty. That has not
and safety measures. The growing use of these measures is changed since 2008 and likely never will. Nothing can be
now an established part of medical practice across the globe. more satisfying than knowing each and every one of your
Chapter 11 provides guidance on the effective use of multi- graduates is truly ready to move to the next career level.
source feedback, an approach essential to patient-centered The public expects no less, and we should expect no less
care and interprofessional practice. from ourselves. In that spirit, we welcome comments from
Chapter 12 is a complement to Chapter 5, covering the you, the reader, on how we can improve upon this book.
growing field of simulation outside standardized patients. Eric S. Holmboe
Simulation, depending on the discipline, should increas- Steven J. Durning
ingly become a standard component of an assessment pro- Richard E. Hawkins
gram. Chapter 13 is a new chapter on practical approaches
to feedback. This chapter was added because no assessment
system can be fully effective without robust feedback.
Contributors

John R. Boulet, PhD Richard E. Hawkins, MD, FACP


Vice President, Research and Data Resources Vice President, Medical Education Outcomes
Foundation for Advancement of International Medical American Medical Association
Education and Research Chicago, Illinois
Educational Commission for Foreign Medical Graduates
Philadelphia, Pennsylvania Eric S. Holmboe, MD, MACP, FRCP
Senior Vice President, Milestones Development and
Carol Carraccio, MD Evaluation
Vice President Accreditation Council for Graduate Medical Education
Competency Based Assessment Programs Chicago, Illinois;
American Board of Pediatrics Professor Adjunct
Chapel Hill, North Carolina Yale University
New Haven, Connecticut;
Brian E. Clauser, EdD Adjunct Professor of Medicine
Vice President Feinberg School of Medicine, Northwestern University
Center for Advanced Assessment Chicago, Illinois
National Board of Medical Examiners
Philadelphia, Pennsylvania William Iobst, MD
Vice Dean and Vice President for Academic Affairs
Daniel Duffy, MD Professor of Medicine
Landgarten Chair of Medical Leadership Geisinger Commonwealth School of Medicine
Department of Internal Medicine Scranton, Pennsylvania
Oklahoma University School of Community Medicine
Tulsa, Oklahoma Jennifer R. Kogan, MD
Professor of Medicine
Steven J. Durning, MD, PhD Assistant Dean, Faculty Development
Professor of Medicine and Pathology Director of Undergraduate Education, Department of
Department of Medicine Medicine
Uniformed Services University of the Health Sciences Perelman School of Medicine at the University of
Bethesda, Maryland Pennsylvania
Philadelphia, Pennsylvania
Michael L. Green, MD
Professor of Medicine Jocelyn M. Lockyer, PhD
Department of Internal Medicine Professor of Community Health Sciences
Associate Director for Student Assessment Senior Associate Dean of Education
Teaching and Learning Center Cumming School of Medicine
Yale University School of Medicine University of Calgary
New Haven, Connecticut Calgary, Alberta, Canada

Stanley J. Hamstra, PhD Melissa J. Margolis, PhD


Vice President, Milestones Research and Evaluation Senior Measurement Scientist
Accreditation Council for Graduate Medical Education National Board of Medical Examiners
Chicago, Illinois Philadelphia, Pennsylvania

vii
viii Contributors

Neena Natt, MD Ross J. Scalese, MD


Associate Professor Associate Professor of Medicine
Vice Chair Education Director of Educational Technology Development
Division of Endocrinology, Diabetes, Metabolism, Michael S. Gordon Center for Research in Medical
Nutrition Education
Mayo Clinic University of Miami Miller School of Medicine
Rochester, Minnesota Miami, Florida

Patricia S. O’Sullivan, EdD David B. Swanson, PhD


Director Vice President of Academic Affairs
Research and Development in Medical Education American Board of Medical Specialties
Center for Faculty Educators, School of Medicine Chicago, Illinois;
Professor of Medicine Professor (Honorary) of Medical Education
University of California San Francisco University of Melbourne
San Francisco, California Victoria, Australia

Louis N. Pangaro, MD, MACP Olle ten Cate, PhD


Professor and Chair Professor of Medical Education
Department of Medicine Center for Research and Development of Education
Uniformed Services University of the Health Sciences University Medical Center Utrecht
Bethesda, Maryland Utrecht, the Netherlands

Joan M. Sargeant, PhD


Professor
Faculty of Medicine
Division of Medical Education
Department of Community Health and Epidemiology
Dalhousie University
Halifax, Nova Scotia, Canada;
Adjunct Professor
School of Education
Acadia University
Wolfville, Nova Scotia, Canada
Acknowledgments

In memory of my incredibly supportive parents, Dr. Much love and gratitude to my mother, Jacqueline
Kenneth C. and Mrs. Bette M. Holmboe. Hawkins, and my partner, Margaret Jung, for their support
All my love and appreciation to my wife and best friend, and encouragement.
Eileen Holmboe, and my two amazing children who bring Richard E. Hawkins
so much joy, Ken and Lauren.
Eric S. Holmboe

To my wife of 25 years, Kristen, and my two wonder-


ful sons, Andrew and Daniel, for their love and support.
To my parents and my in-laws for their wisdom and
encouragement.
Steven J. Durning

Dedication

We also wish to acknowledge the talent and dedication of


the authors whose effort and expertise resulted in this book.
We also wish to thank the countless trainees and faculty that
we have worked with over the years who continue to inspire
and challenge us.
Eric S. Holmboe, Steven J. Durning,
Richard E. Hawkins

ix
This page intentionally left blank

     
Contents

1 Assessment Challenges in the Era of 9 Evaluating Evidence-Based Practice, 165


Outcomes-Based Education, 1 Michael L. Green
Eric S. Holmboe, Olle ten Cate, Steven J. Durning, and
Richard E. Hawkins 10 Clinical Practice Review, 184
Eric S. Holmboe and Daniel Duffy
2 Issues of Validity and Reliability for
Assessments in Medical Education, 22 11 Multisource Feedback, 204
Brian E. Clauser, Melissa J. Margolis, and David B. Swanson Jocelyn M. Lockyer

3 Evaluation Frameworks, Forms, and Global 12 Simulation-Based Assessment, 215


Rating Scales, 37 Ross J. Scalese
Louis N. Pangaro, Steven J. Durning, and Eric S. Holmboe
13 Feedback and Coaching in Clinical Teaching
4 Direct Observation, 61 and Learning, 256
Jennifer R. Kogan and Eric S. Holmboe Joan M. Sargeant and Eric S. Holmboe

5 Direct Observation: Standardized Patients, 91 14 Portfolios, 270


John R. Boulet, Neena Natt, and Richard E. Hawkins Patricia S. O’Sullivan, Carol Carraccio, and Eric S. Holmboe

6 Using Written Examinations to Assess Medical 15 The Learner With a Problem or the Problem
Knowledge and Its Application, 113 Learner? Working With Dyscompetent
David B. Swanson and Richard E. Hawkins Learners, 288
William Iobst and Eric S. Holmboe
7 Assessing Clinical Reasoning in the
Workplace, 140 16 Program Evaluation, 303
Eric S. Holmboe and Steven J. Durning Richard E. Hawkins and Steven J. Durning

8 Workplace-Based Assessment of Procedural


Skills, 155
Stanley J. Hamstra

xi
Video Contents

4 Direct Observation 4.13 Medical Interviewing: Level 3


4.14 Physical Examination: Level 1
4.1 Medical Interviewing: Level 1 History Taking 4.15 Physical Examination: Level 2
4.2 Medical Interviewing: Level 2 History Taking 4.16 Physical Examination: Level 3
4.3 Medical Interviewing: Level 3 History Taking 4.17 Informed Decision Making: Level 1
4.4 Physical Examination: Level 1 4.18 Informed Decision Making: Level 2
4.5 Physical Examination: Level 2 4.19 Informed Decision Making: Level 3
4.6 Physical Examination: Level 3
4.7 Counseling: Level 1 13 Feedback and Coaching in Clinical
4.8 Counseling: Level 2
4.9 Counseling: Level 3 Teaching and Learning
4.10 How Faculty Should Conduct an Effective 13.1 An Evidence-Based 4-Stage Model for Facilitating
­Observation Reflective Feedback and Coaching for Change:
4.11 Medical Interviewing: Level 1 R2C2
4.12 Medical Interviewing: Level 2

xii
1
Assessment Challenges in the Era of
Outcomes-Based Education
ERIC S. HOLMBOE, MD, MACP, FRCP, OLLE TEN CATE, PHD,
STEVEN J. DURNING, MD, PHD, AND RICHARD E. HAWKINS, MD, FACP

CHAPTER OUTLINE References


Appendix 1.1: Developing an Entrustable Professional
The Rise of Competency-Based Medical Education
Activity
Outcomes and Competency-Based Medical Education
Appendix 1.2: Entrustable Professional Activities,
A Brief History of Assessment Competencies, and Milestones: Pulling It All Together
Drivers of Change in Assessment
Accountability and Quality Assurance
Quality Improvement Movement
Technology The Rise of Competency-Based Medical
Psychometrics Education
Qualitative Assessment and Group Process
Framework for Assessment
Despite major biomedical and technical advances, medi-
cal care across the globe continues to suffer from perni-
Dimension 1: Competencies
cious quality and safety gaps that result in substantial
Dimension 2: Levels of Assessment
harm and ineffective care for too many patients each
Miller’s Pyramid
The Cambridge Model
year.1,2 In 2001, the Institute of Medicine codified the
six aims of quality: care that is effective, efficient, safe,
Dimension 3: Assessment of Progression
patient centered, timely, and equitable.3 More recently,
Criteria for Choosing a Method the triple aim of quality in patient experience (defined
Elements of Effective Faculty Development by the six aims), health of a population, and cost stew-
Overview of Assessment Methods ardship has become the overarching driving framework
Traditional Measures for the United States and other health care systems.4 Yet
Methods Based on Observation
data from multiple sources, such as the Organization for
Economic Cooperation and Development (OECD), the
Simulation
World Health Organization (WHO), and the Common-
Work
wealth Fund (CMWF), demonstrate persistent problems
New Directions in Assessment in morbidity and mortality that are amenable to better
Milestones and safer health care delivery.5 Although a number of
Entrustable Professional Activities factors contribute to this state of affairs, many medical
Combining Milestones and Entrustable Professional educators and policymakers accept the premise that the
Activities medical education enterprise bears some responsibility
Entrustable Professional Activities – Competencies – Skills through insufficient preparation of trainees for 21st-cen-
Entrustable Professional Activities Across the Continuum tury practice.6 In conjunction with these concerns about
and Nested Entrustable Professional Activities health care quality and safety has been the growing focus
Entrustment Decision Making as Assessment on the outcomes of education. Specifically, educators are
Systems of Assessment (See Chapter 16.) now most concerned with the abilities of a graduate rather
than whether a trainee simply completes a prescribed edu-
Conclusion
cational program.7 These and other factors have led to the
Acknowledgment global spread of outcomes-based medical education using

1
2 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

competencies as a foundational outcomes framework for The primary purpose of this second edition is to pro-
educational programs.7–11 vide practical guidance to educators and program leaders
In 1978, McGaghie and colleagues described a ratio- on the “front lines” for building and implementing better
nale for an approach to medical education founded on the programs and systems of assessment using the best evidence
acquisition of defined competencies. “The intended output and information available. Assessment is fundamental and
of a competency-based programme,” they wrote, “is a health essential for effective learning and for achieving both desired
professional who can practise medicine at a defined level of educational and clinical outcomes. CBME is part of the lat-
proficiency, in accord with local conditions, to meet local est phase on what should be a continuous commitment to
needs.”8 Educational leaders and policymakers worldwide improve educational programs and by extension the qual-
produced multiple reports lamenting that medical educa- ity and safety of care patients and populations receive. This
tion systems were not producing physicians with the abili- introductory chapter will present an overview of the drivers
ties needed to meet the complexities of modern practice, of change in the assessments used during clinical education,
leading to the realization that reforms in undergraduate, frameworks for such assessment, criteria for choosing assess-
graduate, and continuing medical education were urgently ment methods, elements of an effective faculty development
needed. In the United States, several recent reviews call effort, and the new concepts of competencies, milestones,
attention to the inadequate preparation of our graduates to and entrustable professional activities now being used to
practice effectively in our evolving health care systems.12–14 facilitate change and improvement in medical education.
This context and other factors ultimately led to the devel- Before moving on to fundamental issues of assessment in a
opment of competency frameworks in several countries as CBME world, we will first review some key definitions and
part of initiatives to implement competency-based medical elements of CBME.
education (CBME) to achieve better educational and clini-
cal care outcomes. The first iteration of the Canadian Medi-
cal Education Directions for Specialists (CanMEDS) Roles Outcomes and Competency-Based Medical
by the Royal College of Physicians and Surgeons of Can- Education
ada was produced in 1996.15,16 Recognizing similar needs
and issues, the Accreditation Council of Graduate Medical A focus on the educational process has now shifted to an
Education, the American Board of Medical Specialties, the emphasis on what a physician is able to actually do at the end
Institute of Medicine, the General Medical Council of the of training and at important junctures during the training
United Kingdom, the Royal Australasian College of Sur- process. Competencies have become a primary mechanism
geons, the Dutch College of Medical Specialties, and other for defining the educational outcomes. Outcomes-based
national professional entities produced competency frame- education starts with a specification of the competencies
works.17–21 Two key features of these competency projects expected of a physician, and these requirements drive the
stand out. One is a redefinition of the doctor to include content and structure of the curriculum, the selection and
many more important and relevant abilities and constructs deployment of teaching and learning methods, the site of
beyond medical knowledge and technical skill that had been training, and the nature of the teachers. Assessment plays a
dominating training in the previous decades. The other fea- central role in determining whether students and residents
ture is the intention to better monitor doctors in training have actually achieved the competencies that have been
and to ensure they meet predefined competency standards specified and whether the educational program has been
upon graduation to unsupervised practice.7,22 efficacious. CBME highlights the importance of integrating
Since the publication of the first edition of this book curriculum and assessment; they should not be independent
in 2008, a number of major reports and initiatives have activities but rather inform each other as part of an overall
sought to move CBME toward broader implementa- educational system and program of assessment. This change
tion. The International CBME Collaborators, a group of in thinking and the need to assess the diverse competencies
medical educators and leaders convened by the Royal Col- of the physician have been important factors in the develop-
lege of Physicians and Surgeons of Canada, produced a ment of new methods of assessment, especially work-based
series of articles on the history, concepts, and challenges assessments covered in detail throughout this book.
to implementation of competency-based medical educa- CBME is an outcomes-focused approach to and philoso-
tion, including needed changes to assessment, across the phy of designing the explicit developmental progression of
continuum of medical training.15,16,23–25 In the same year, health care professionals to meet the needs of those they
Frenk and a group of international leaders published an serve. Among its fundamental characteristics (Box 1.1) is
influential position paper in The Lancet on the need to a shift in emphasis away from time-based programs based
accelerate transformation in medical education, grounded solely on exposure to experiences such as clinical rotations
in the principles of CBME.6 Finally, on the 100th anni- in favor of an emphasis on needs-based graduate outcomes,
versary of the Flexner report (1910), the Carnegie Founda- authenticity, and learner-centeredness.11,26 As defined
tion released recommendations for medical education that by Frank and colleagues, CBME is “an outcomes-based
embraced many of the key principles and goals of CBME.9 approach to the design, implementation, assessment, and
All of these reports have highlighted the critical need for evaluation of medical education programs, using an orga-
better assessment. nizing framework of competencies.”11 Although outcomes
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 3

• BOX 1.1    Fundamental Characteristics of an instructor. Clinical skill and judgment were tested using
Competency-Based Medical Education an oral examination that often required the student to go
to the bedside, gather patient information, and present it
Graduate outcomes in the form of achievement of predefined along with a diagnostic list and treatment plan to one or
desired competencies are the goals of competency-based
more examiners who asked questions. Because these were
medical education (CBME) initiatives. These are aligned with the
roles graduates will play in the next stage of their careers. the only generally accepted methods available, they were
These predefined competencies are derived from the needs of applied to most assessment problems even if they were not
patients, learners, and institutions and organized into a coherent completely suitable to the task. That may have been accept-
guiding framework. able in a time when supervisors had much more control over
Time is a resource for learning, not the basis of progression
the health care process and had natural checks of everything
of competence (i.e., time spent on a ward is not the marker of
achievement). learners reported. Over the past decades health care has
Teaching and learning experiences are sequenced to facilitate become too complex to warrant this type of “on-the-fly,” ad
an explicitly defined progression of ability in stages. hoc approach. For example, lengths of stay in hospitals have
Learning is tailored to the learner’s individual progression in dropped dramatically and faculty have multiple competing
some manner.
responsibilities.
Numerous direct observations and focused feedback
contribute to effective learner development of expertise. From that point to the present, there have been exten-
Assessment is planned, systematic, systemic, and integrative. sive changes in the way assessment is conducted. Meth-
ods have proliferated, as have the requirements for their
appropriate use. Much progress has been made in the
assessment of medical knowledge with a variety of written
are now the primary driver, that does not mean educational and computer-based techniques offering reliable and valid
structures and processes are not important. The famous results (see Chapter 6). In the last few decades, consid-
Donabedian equation for quality, Structure × Process = Out- erable gains have been made in defining and enhancing
comes, highlights that good outcomes depend on effective the psychometric qualities of objective structured clinical
structures and processes.27 However, we are also learning examinations (OSCEs), particularly related to their use
that the relationship between structure and process is quite in high-stakes examinations (see Chapter 5). However,
complex and nonlinear in its actual execution.28 Chapter 16 assessment in the context of learners caring for patients
provides helpful guidance on how to embrace complexity as in clinical units (i.e., wards, operating theater, ambulatory
part of program design and evaluation. Assessment is a criti- clinic) has lagged to some degree, especially in the areas of
cal part of the complex interaction between structure and clinical skills, interprofessional teamwork, and quality and
process in an educational program. safety of care.24,30
Assessment is an essential activity (i.e., process) that can Equally important, the methods that have been devel-
be used to demonstrate outcomes of interest. This is not a oped to support clinical education often rely on faculty
new insight—assessment has always been critically impor- who are inexperienced in their use, do not share common
tant in any educational endeavor. However, the problems standards or shared mental models of the competencies of
with assessment in medical education, and in general all of importance, and have not been trained to apply them in a
health professions education, have been long-standing and consistent fashion. In addition, faculty now experience sub-
persistent, such as lack of direct observation of learner per- stantial time pressures, more learner and patient handoffs,
formance and meaningful feedback, overreliance on testing higher degrees of comorbidity among hospitalized patients,
for assessment of medical knowledge, lack of attention to and increasing personal clinical responsibilities. Perhaps
other essential competencies that address our graduates’ more concerning are recent findings that one of the prin-
abilities to function effectively in our health care systems cipal drivers of faculty assessment relates to their own clini-
such as interprofessional teamwork and quality improve- cal skills, with a number of studies highlighting important
ment, and ineffective use of assessment methods and tools deficiencies in practicing physician clinical skills such as
by faculty, to name a just a few. In this introductory chapter, medical interviewing, physical examination, and communi-
we will first explore fundamental issues in assessment, fol- cation skills.31,32 Finally, many of the faculty are also being
lowed by recent attempts to more effectively operationalize asked to assess and judge competencies, such as care coor-
competencies through milestones and entrustable profes- dination, patient safety, and use of information technology,
sional activities, and then close with the importance of cre- areas in which they themselves were never formally trained.
ating a program of assessment. Throughout this chapter we Compounding this state of affairs has been the lack of effec-
will refer the reader to other chapters in the book to help the tive faculty development approaches and models to address
reader create and revise their own program of assessment. these new clinical and educational methods.33

A Brief History of Assessment Drivers of Change in Assessment


Through the early 1950s, physicians were assessed in lim- The increased public focus on the medical education enter-
ited ways.29 Medical knowledge was evaluated with essays prise is important; medical education should always be in
and other open-ended question formats that were graded by service of individual patients and the public. Using a service
4 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

logic can help educators develop assessment programs that and programs need to build in ongoing evaluation of their
meet public, patient, and learner needs.34 Many programs assessment activities. (See Chapter 16.)
globally are implementing curricular changes that embrace
competencies and outcomes, supported by improvements Technology
in technology, psychometrics, and evolving work-based
assessment approaches that increasingly incorporate more Over the past 50 years, the availability of increasingly
qualitative techniques and systematic judgment. sophisticated technology has changed the testing of medi-
cal knowledge and judgment in fundamental ways.43,44 The
Accountability and Quality Assurance introduction of the computer heralded an era of large-scale
testing by encouraging the use of multiple-choice ques-
The movement to competency-based medical education tions (MCQs), the answers to which could be scanned by
has been accompanied by significant efforts to enhance machine, turned into scores, and then reported in an effi-
the accountability of physicians.3 Motivated by the need cient and objective fashion.
to improve quality and safety, and in part by high-pro- More recently, the intelligence of the computer has
file cases such as those involving Michael Swango in the improved assessment in two ways:
United States and Howard Shipman in the United King- 1. On the one hand, it has enabled the application of signif-
dom in the 1990s, the public has continued to pressure icant psychometric advances to the assessment of medical
medicine to increase its level of oversight and eliminate the knowledge. Specifically, the computer’s intelligence has
“bad apples.”35,36 Medical educators are also more keenly improved efficiency by allowing the selection of ques-
aware that too many trainees graduate with substantial tions that are targeted to the ability of particular examin-
deficiencies in foundational knowledge and clinical skills ees. Sequential testing and adaptive testing permit gains
and more recently have become aware of deficiencies in in efficiency and precision.
competencies important to succeed in our health care sys- 2. On the other hand, it has improved the assessment of
tems.12–14,37 Effective quality assurance depends on robust higher cognitive abilities, including clinical reasoning, by
assessment programs and is critically important to ensure permitting the use of interactive item formats that more
that graduates of medical education programs are truly closely simulate the types of judgments physicians need
ready for promotion to the next stage and ultimately unsu- to make in practice. (See Chapter 6.)
pervised practice. Promoting trainees who lack compe- Although the impact of technology on assessment of
tence erodes, if not destroys, the trust between the medical clinical skills has been slower to develop, advances in simu-
profession and the public. lation and computer technology have led to the develop-
ment of approaches and tools that recreate aspects of the
Quality Improvement Movement clinical encounter with considerable fidelity. These methods
have a growing impact on assessment, especially in the area
At the same time, there has been a variety of efforts focused on of procedural skills, where mastery models are beginning to
continuously improving the quality of health care.4,27,38–41 gain traction.45–48
These efforts have relied on methods devised by workers in Finally, technology, especially through smartphone and
the field of quality management science and, in some cases, tablet applications, is beginning to change the way assess-
used successfully in industry for over 60 years to drive con- ment data is obtained and processed. For example, tools
tinuous improvement in health care and now increasingly designed for assessment through direct observation are
in medical education programs. Central to quality improve- increasingly being converted into smartphone applica-
ment is assessment—it is very hard to improve without tions.46,47 Learning management systems, increasingly used
meaningful measurement and data. It offers a means of by programs, are also beginning to incorporate mobile apps
identifying those whose overall performance is well below into their platforms.49 These portable applications hold sub-
standard and also identifying areas for improvement for stantial promise to reduce the data collection burden while
those who are generally performing adequately, helping to guiding the assessment activity of the faculty to attend to
drive the continuous quality improvement process. These critical competencies.
developments have helped to fuel the creation of several
new methods of assessment and to increase the use of other Psychometrics
methods already available. For example, the milestones
initiative, an attempt to better describe competencies in At the same time that the technology has improved, there
narrative, developmental terms in the United States, uses have been significant advances in psychometrics, the basic
the principles of continuous quality improvement as part science of assessment. Classical test theory, prominent from
of its foundation to improve graduate medical education. the turn of the 20th century, has gradually given way to
The milestones initiative can be viewed through the lens measurement models based on strong assumptions about
of “action- or practice-based research” to learn and develop test items and examinees. The family of item response the-
evidence over time.42 There is no single “holy grail” of ory models now makes it possible to produce equivalent
assessment. All assessments have strengths and weaknesses, scores even when examinees take tests made up of different
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 5

questions.50 They also support the computer-based admin- Group process, commonly through entities called clini-
istration of examinations that are tailored to the ability level cal competency committees, has also become an impor-
of individual test-takers; this allows tests to be shortened by tant part of the assessment process and programs. Effective
as much as 40%.51 The ability to shorten tests has cost and group process can lead to better judgments around com-
validity implications; less test material exposure decreases petence.57–59 Finally, qualitative research techniques have
the likelihood that future examinees are familiar with exam- been shown to have value in judging aggregate assessment
ination content.52 Generalizability theory makes it possible information, such as that contained within a portfolio (see
to identify how much error is associated with different fac- Chapter 14). Again, a rigorous approach to application
ets of measurement (e.g., raters, patients).53 Based on this of qualitative research techniques and principles helps to
information, assessments can be prospectively designed to enhance the reliability and validity of judgments.60–62
make the best use of resources, such as faculty time, while
maintaining the reliability of the results. Framework for Assessment
In addition to these major developments, there have
been a number of other advances. For example, there are As methods of assessment have proliferated, so has the
a variety of systematic methods available for setting stan- need to use them efficiently and to combine them into a
dards on tests and for identifying when test questions are system of assessment. Developing, implementing, and
biased against particular groups of examinees.2,54,55 Test sustaining effective systems for the assessment of clinical
development methods have gotten better, as have the means competence in medical school, residency, and fellowship
for judging whether particular items are working properly. programs require consideration of what competencies need
Overall, these advances have improved both the quality and to be assessed, how to best assess them, and the level of the
efficiency of assessment. trainee being assessed. Consequently, a three-dimensional
framework for structuring an assessment system can help
Qualitative Assessment and Group Process medical educators make better judgments about learner
development. Along the first dimension are the competen-
Although advances in psychometrics have clearly helped to cies that need to be assessed, along the second is the level of
improve assessment in medical education and will remain assessment required, and along the third is the trainees’ stage
a core science for assessment, many have noted limita- of development.
tions of the traditional psychometric approach in today’s
complex clinical and educational environment.56 Often Dimension 1: Competencies
referred to as “qualitative” or “narrative” assessment, use
of the written word has grown in importance. For exam- As shown in Table 1.1, there are several schemes for describ-
ple, many of the new smartphone apps contain natural ing the knowledge, skills, and attributes of the physi-
language processing capability that allow for the capture cian.16–19 The CanMEDS model, which was developed and
of narrative assessment and feedback through dictation. recently updated by the Royal College of Physicians and
Milestones, discussed in more detail later, are more robust Surgeons in Canada, describes the competencies in terms
narrative descriptors of stages of development, bringing of the roles of a physician. Good Medical Practice, which
both quantitative and qualitative aspects of measurement was created by the General Medical Council in the United
more closely together.48 Kingdom, describes the elements of good practice. In the

TABLE
1.1 The Competencies of Physicians as Described by Four Organizations

CanMEDS GMC ACGME/ABMS IOM


Medical expert Good clinical care Medical knowledge Employ evidence-based practice
Communicator Maintaining good medical practice Interpersonal and communication skills Work in interdisciplinary teams
Collaborator Teaching and training Patient care Provide patient-centered care
Appraising and assessing
Leader Relationships with patients Professionalism —
Systems-based practice
Health advocate Working with colleagues Practice-based learning and improvement Apply quality improvement
Scholar Probity Systems-based practice Utilize informatics
Professional Health — —

ABMS, American Board of Medical Specialists; ACGME, Accreditation Council for Graduate Medical Education; CanMEDS, Canadian Medical Education Direc-
tions for Specialists; GMC, General Medical Council (UK); IOM, Institute of Medicine.
6 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

United States, two influential groups developed a set of Shows how. Although trainees may know and know
core competencies. The Accreditation Council for Gradu- how, they may not be able to integrate these skills into a
ate Medical Education (ACGME) and the American Board successful performance with patients. Consequently, certain
of Medical Specialties (ABMS) adopted six general compe- assessment methods require the trainee to show how they
tencies in 2001. These competencies consist of the educa- perform with patients. For example, a standardized patient
tional outcomes framework for residency and fellowship presenting with an ethical challenge would offer the trainee
training, as well as maintenance of certification programs an opportunity to “show how” he or she would respond to a
throughout a physician’s career in the United States. The professionalism challenge.
Institute of Medicine (IOM) has recommended five core Does. No matter how good traditional assessment meth-
skills, or competencies, that create a framework for evaluat- ods become, there remains the concern that what happens
ing performance and stimulating the reform of education. in a controlled testing environment does not generalize
They are intended to improve professional education and directly or predict what happens in practice. The highest
practice with a goal of enhancing the safety and quality of level of Miller’s pyramid therefore focuses on methods that
health care. Although there are some differences among the provide an assessment of routine performance. For example,
schemes, there is also significant overlap in these descrip- the development and use of a critical incident system, such
tions of a physician. as the one currently used in some medical schools, offers
These competencies are intended as the first step in iden- an assessment of what students actually do in terms of
tifying key educational outcomes that should inform the professionalism.
learning objectives, assessment, and curriculum of graduate Miller’s pyramid is a useful framework for considering
training programs, adapted to the content, education, and differences and similarities among assessment methods.
practice of the particular specialty/subspecialty. As we will However, the fact that it is a pyramid might imply to some
see later, milestones and entrustable professional activities that methods addressing the higher levels are better, or con-
(EPAs) are concepts, specified and adapted by specialties, versely that the larger area occupied by the base of the pyra-
that can facilitate the implementation of competency-based mid implies that knowledge assessment is most important.
programs. The data produced by the assessment of these Instead, superior methods are those best aligned with the
competencies serve as a basis for judging the quality of the purpose of the assessment. For example, if an assessment
trainees and their training, as well as supporting the con- of foundational medical knowledge is needed, a method
tinuous improvement of both. associated with that level (e.g., multiple-choice questions)
is likely better than a method associated with another level
Dimension 2: Levels of Assessment (e.g., standardized patients). Recently Cruess and colleagues
argued to add “Is” to the top of the pyramid to recognize the
The multifaceted nature of the competencies makes it appar- importance of professional formation, but it is not yet clear
ent that no single method could provide a sufficient basis for where this fits into an assessment program.64
making judgments about students or residents. In an orga-
nized approach to this problem, Miller proposed a classifi- The Cambridge Model
cation scheme that stratifies assessment methods based on As physicians near the end of training and enter practice,
what they require of the trainee. Often referred to as Miller’s external forces come to play a very large role in performance.
pyramid, it is composed of four levels: knows, knows how, The Cambridge Model, a variation on Miller’s pyramid,
shows how, and does.63 proposes that performance in practice (the highest level
of the pyramid) is influenced by two large forces beyond
Miller’s Pyramid competence.65 Systems-related factors, such as government
Knows. This is the lowest level of the pyramid and it con- programs, clinical microsystems (i.e., the clinical units
tains methods that assess what a trainee “knows” in an area where learners care for patients), institutional care delivery
of competence. Forming the base of the pyramid, knowl- practices, patient expectations, and guidelines, among other
edge represents the foundation upon which clinical com- factors, strongly influence what physicians do. Similarly,
petence is built. An MCQ-based examination composed of factors related to the individual physician such as state of
questions focused on ethics and principles of patient con- mind, physical and mental health, and relationships with
fidentiality would provide an assessment of what a trainee peers and family have a significant effect. Consequently,
“knows” about professionalism. assessment becomes more difficult because it is harder to
Knows how. To function as a physician, a good knowl- disentangle the effects of the context (e.g., context speci-
edge base is necessary but insufficient. It is important to ficity; see Chapter 7) of care from the competence of the
know how to apply this knowledge in the acquisition of individual physician. Here, a focus on health care processes
data, the analysis and interpretation of findings, and the and outcomes as a measure of what a physician “does” can
development of management plans. For example, a method provide a robust assessment of a physician’s ability to inte-
that poses a moral dilemma, asks trainees to reason through grate multiple competencies within a complex social con-
it, and evaluates the sophistication of their moral thinking text. However, processes and outcomes are still impacted by
would provide a “knows how” assessment of professionalism. system factors that can affect patient preferences and thus
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 7

impact the measurement of processes of care. Finally, avail- to learners who are in the competence or proficiency stages.
ability of specific services may also impact outcomes. It is important to realize that learners are typically at differ-
ent stages depending on the content and context of the task
Dimension 3: Assessment of Progression being assessed. For example, a resident may be seen as pro-
ficient in working up a patient with chest pain but be at the
Acquiring competence is not an overnight process. Train- advanced beginner level in counseling a patient regarding
ees progress through a series of stages that begin in under- end-of-life care. Likewise, many students achieve compe-
graduate medical education and continue throughout their tence with regard to medical knowledge, or perhaps com-
careers. Educators must be able to recognize when a trainee munication skills, before they acquire the same level in more
has attained sufficient knowledge, skills, and attitudes to challenging systems-based practice domains such as care
enter the next stage, and this requires appropriate standards coordination or cost-conscious care delivery. Ultimately,
and benchmarks for the transition. Hubert and Stuart Drey- work-based assessment will need to predominate, especially
fus have created a developmental model of learning appli- for ongoing professional development in both training and
cable to the health professions that proposes five stages of practice. Educators need to recognize this developmental
educational development (Table 1.2).17,66 sequence when designing an assessment system, and it will
The characteristics of learners and the steps they must be critical to ensure that the chosen method is suitable to
go through to acquire competence will change over the five the task.
stages of development. Necessarily, the methods of assess-
ment applied at each developmental level will likely also Criteria for Choosing a Method
evolve. For example, at the level of the novice, an MCQ-
based knowledge test might be most appropriate, but a stan- Decisions about which method of assessment to use in a
dardized patient–based examination might be better suited particular circumstance have traditionally rested on validity

TABLE
1.2 The Stages of Learning as Proposed by Dreyfus

Stage of Method of Learning (Teaching


Learning Style) Learning Steps Learner Characteristics
1. Novice Instruction (instructor) Recognizes the context-free features Learning occurs in a detached
Breaks skill into context-free, discrete Knows rules for determining actions analytic frame of mind
tasks, concepts, rules based on these features
2. Advanced Practice (coach) Recognizes relevant aspects based Learning occurs in a detached,
beginner Experiences coping with real situations on experience that makes sense analytic frame of mind
Points out new aspects of material of the material
Teaches rules and reasoning Learns maxims about actions based
techniques for action on new material
3. Competence Apprenticeship (facilitator) Volume of aspects is overwhelming Learner is emotionally involved in
Develops a plan or chooses perspec- Performance is exhausting the task and its outcome
tive that separates “important” Sense of what’s important is lacking Too many subtle differences for
from “ignored” elements Stands alone making correct and rules; student must decide in
Demonstrates that rules and incorrect choices each case
reasoning techniques for choosing Coping becomes frightening, dis- Makes a mistake, then feels remorse
are difficult to come by couraging, elating Succeeds, then feels elated
Role models are also emotionally Emotional learning builds
involved in making decisions competence
4. Proficiency Apprenticeship (supervisor) Rules and principles are replaced by Learner immediately sees the goal
Gains more specific experience with situational discrimination and salient features
outcomes of one’s decisions Emotional responses to success or Learner reasons how to get to
Applies rules and maxims to decide failure build intuitive responses the goal by applying rules and
what to do that replace reasoned ones principles
5. Expertise Independence (mentor) Gains experience with increasingly Immediately sees the goal and what
Experiences multiple, small random subtle variations in situations must be done to achieve it
variations Automatically distinguishes situations Builds on previous learning
Observes other experts or experi- requiring one response from those experiences
ences nonrandom simulations requiring another
Working through the cases must
emotionally matter

From Dreyfus HL: On the Internet. Thinking in Action Series. New York, Routledge, 2001.
8 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

and reliability. Validity is the degree to which the inferences particular method fits into the overall system for assessment.
based on the results of an assessment are correct. Valid infer- The same method can (and arguably should) be used to
ences regarding a particular test score or assessment result assess more than one competency. For example, peer assess-
depend on the reliability of these outcomes, and reliability ment can provide a measure of both professionalism and
is a component in more “modern” concepts of validity such interpersonal skills. Likewise, two different methods can
as those Kane and Messick discuss in Chapter 2 (Chapter be used to capture information on the same competency,
2 provides more detail about validity and psychometric thereby increasing confidence in the results. For example,
theory). patient care can be assessed using both the mini-CEX (clini-
For purposes of assessment in medical education, van der cal evaluation exercise) and monthly ratings by attending
Vleuten added educational effect, feasibility, and accept- physicians.
ability as factors to be considered in choosing a method of Educational effect, catalytic effect, feasibility, and accept-
assessment. This combination of factors is often referred to ability are not easily quantifiable, nor is the relationship
as the utility index and represented by the equation Valid- among methods of assessment in a system. However, these
ity × Reliability × Educational Effect × Cost Effectiveness × factors plus reliability and validity should be weighed inter-
Acceptability = Utility.67 Utility is a useful concept as pro- actively when considering selection of a particular method.
grams choose and implement assessment methods. It is also
important to note that utility is a multiplicative construct— Elements of Effective Faculty Development
if any one of the terms, or variables, is zero, then utility by
definition is zero. Faculty members play a particularly critical role in assess-
In terms of educational effect, van der Vleuten and ment in the clinical setting because it is often based on
Schuwirth argue that trainees will work hard in preparation observation. And by faculty we mean any health profes-
for an assessment.61 Consequently, the method should direct sional, at a minimum, who participates in an assessment
them to study in the most relevant way. For example, if an system. Recall that Miller placed “does,” meaning the care of
educational objective is for trainees to know the differential actual patients, at the tip of the pyramid. Envision the pyra-
diagnoses for a particular chief complaint, then assessment mid as a spear and at the tip of that spear is patients. Using
using extended matching questions will likely induce better this metaphor helps faculty appreciate the central role of
learning than assessment based on standardized patients. observation in both ensuring trainee competence (at a mini-
Feasibility is the extent to which an assessment method mum) and guaranteeing that patients receive high-quality,
is affordable and efficient. Although high-fidelity simula- safe care in the context of training.32 Most important is the
tions might be a good way to assess procedural competence, fact that the actual measurement instrument is the faculty,
the use of a method such as direct observation of proce- not the assessment tool. We cannot emphasize this enough
dural skills (DOPS), which is based on faculty observa- throughout the book that assessment in the workplace is
tion, is likely to be more feasible in most graduate training essential and relies on informed, expert judgment.
settings.68 Assessment methods and tools are only as good as the
Acceptability is the degree to which the trainees and fac- individuals using them. Although there has been substan-
ulty believe that the method produces valid results. This fac- tial progress in creating many new methods and tools,
tor will influence motivation of faculty to use the method significantly less attention has been paid to the develop-
and enhance the trainees’ trust of the results. It is important ment of approaches to training faculty in how to use them
that educational leaders not underestimate trainee knowl- most effectively. This omission continues to occur despite
edge and understanding of assessment and their ability to repeated studies over time demonstrating significant prob-
participate in decisions regarding assessment practices. lems with the quality of faculty assessments.31,70–72 Chapter
More recently, an international group of assessment 4 will cover in greater detail key issues in observation and
experts led by Norcini updated the concept of utility.69 rater cognition. There are three significant reasons faculty
Validity, acceptability, and educational effect were retained training is urgently needed.
as separate categories, and for validity the importance of First, to perform quality assessment, faculty members
coherence (a body of evidence that hangs together to support must possess sufficient knowledge, skill, and attitudes in
the results for a specific purpose) was highlighted. Reliabil- the competency targeted by the assessment. For example,
ity was essentially split into two new categories: reproduc- the decline of clinical skills teaching in the workplace was
ibility and consistency (i.e., repeatability) and equivalence noted by George Engel73 in 1976 and has resulted in many
(assessment yields equivalent results across space and time). of today’s educators failing to acquire a high level of clinical
Catalytic effect was added to highlight the important role skills needed for effective care and teaching. This likely lim-
of assessment in driving future learning forward (through its the degree to which they can validly assess clinical perfor-
assessment date and feedback). Finally, the last new category mance, and recent research adds evidence to the importance
was feasibility, namely, that assessment should be practical, of the faculty’s own underlying clinical skills.31
realistic, and sensible.69 Second, competencies will evolve and change over time.
In addition to factors highlighted in two versions of cri- Witness the birth of the competencies of practice-based
teria for good assessment, it is important to consider how a learning and improvement and systems-based practice, and
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 9

more recently the change of the manager role in CanMEDS Overview of Assessment Methods
to leader.74 The majority of faculty today during their own Traditional Measures
training never received any formal instruction in many of
the competencies and subcompetencies now needed for Traditional measures will continue to play an important
modern practice. Many faculty learn new knowledge and role in the assessment of clinical proficiency (see Chapters
skills alongside their trainees.33 5 and 6).
Finally, assessment is a core tenet of professionalism for Specifically, written methods such as MCQs and stan-
medical educators. Too often, faculty members view it as dardized patients will be foundational components of
someone else’s job, especially when a negative performance assessment programs for the near future, especially in
appraisal is involved (see Chapter 15). Faculty development undergraduate medical education. All of these methods can
reinforces the importance of assessment and provides medi- be improved and work on each must continue.
cal educators the opportunity to develop common standards
for performance. Methods Based on Observation
To make effective use of the methods of assessment, edu-
cational institutions must commit the necessary resources Even though assessment has been woven through the basic
for faculty development. However, too often faculty devel- science curriculum, historically it has not been as well inte-
opment translates into a project or a brief workshop. If grated with clinical education (see Chapters 3, 4, 7, and 11).
faculty development is to be truly successful, medical Nonetheless, assessment methods based on the obser-
educators need to embrace new strategies that embed fac- vation of routine encounters in the clinical setting offer a
ulty development in real-time teaching and clinical activi- rich and feasible target for assessment. Continued refine-
ties. For example, Hemmer and colleagues embed faculty ment of the methods themselves is needed, as is faculty
frame-of-reference training into formal evaluation sessions development, which is a key to their successful use. Further-
for students.75 Faculty development, like quality improve- more, the opportunity for educational feedback as part of
ment and maintenance of competence, must become a these methods is probably as important as their assessment
continuous process and be appropriately rewarded. As potential.
noted earlier, the quality and safety of patient care depend
on it. Simulation
Medical educators must also end their quest for the holy
grail of assessment, the perfect rating form imbued with Improvements in technology have spurred the development
special powers to solve all measurement needs. Assessment is of a series of simulators that recreate reality with high fidel-
hard work and requires a multifaceted approach. Landy and ity (see Chapters 5 and 12). The use of simulation in assess-
Farr, in a landmark article in the performance appraisal field ment is growing but much of the technology still remains
over 35 years ago, pleaded with researchers to redirect devel- expensive, and several developments are needed before
opment efforts from a search for the perfect rating form widespread adoption and use. Researchers will need to con-
to training the assessors.76 Researchers in this field subse- tinue to focus on identifying appropriate scoring methods,
quently developed numerous rater training approaches that optimizing the generalizability of scores, and ensuring their
can lead to better assessments. Chapter 4 provides guidance relevance to performance in practice.80 Particularly in the
on a number of practical faculty training methods. area of procedural skills, however, these methods will offer
Milestones and EPAs, described later, require special the ability to test under a variety of conditions without con-
consideration. Using EPAs and milestones for curriculum cern for harm to patients. Some evidence is accruing that
development and assessment requires a shift in thinking mastery-based approaches combined with simulation-based
by faculty and an infrastructure to support new assessment deliberate practice can translate into improved patient care
practices. Both individual faculty and committees must get and outcomes.45,81,82 Educators will confront difficult deci-
acquainted with and become experienced in entrustment sions requiring them to balance the cost, variable fidelity
decision making for EPAs and its conditions.77 Training in of individual simulation methods, and potential risks to
the dimensions to be used in assessment and in the criteria patients (and trainees) in making decisions regarding how
for decisions is needed, and specific tools related to EPA- best to assess procedural skills.83
based assessment, such as video recording, are now being
developed. If anything, sufficient and adequate supervi- Work
sion and feedback are key to entrustment decisions, which
requires longitudinal mentorship.78,79 This does not neces- The assessment of physicians’ performance at work (mostly
sarily mean huge investments in time for mentoring, but the “does” level of Miller’s pyramid) is the area of assessment
does mean an efficient use of any encounter that mentors undergoing the most change and development (see Chap-
and mentees have, for the benefit of learning. Group process ters 3, 4, 7, 8, 9, 10, 11, and 14). Although learners may try
will also likely enhance the effectiveness of milestones and to “perform” when under direct observation (“show how”),
EPAs as part of an assessment system, and faculty will need most learners acclimate quickly, and even if what the faculty
training in effective group process.57 observe is “best behavior,” there is still much utility in the
10 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

assessment and in ensuring the patient receives safe, effec- national systems of assessment,29 we provide some back-
tive, patient-centered care.31 The day-to-day performance ground in this chapter to help guide the reader in evaluat-
of physicians is being used increasingly in the settings of ing and exploring these concepts for their own assessment
continuous quality improvement and physician account- program.
ability. Assessment in this context is a matter of identifying
the basis for the judgments (e.g., outcomes, process of care), Milestones
deciding how the data will be gathered, and avoiding threats
to validity and reliability (e.g., patient mix, patient com- The ACGME competency framework has always been
plexity, attribution, and numbers of patients).84 The patient inspired by the five “Dreyfus stages of development of skill,”
is also playing a much greater role in work-based assess- including novice, advanced beginner, competent, proficient,
ment, predominantly through patient experience surveys.85 and expert, first described in 1986,17,26 but only several years
In addition, patient-reported outcome measures (PROMs) later it was suggested to actually superimpose the stages as
are being increasingly used by health systems to judge func- milestones on the framework.26 Milestones were adopted to
tional outcomes for patients (see Chapter 10). Although facilitate the assessment of learners in the workplace and
substantial research is now occurring in quality and safety facilitate curricular change.86 They are concrete behavioral
measures, patient experience surveys, and PROMs, much descriptions aligned with the five developmental steps to
work remains to be done. However, given that this is ulti- assist faculty in the assessment of medical trainees using a
mately what patients and the public most care about, edu- logical trajectory of professional development within com-
cational programs need to embrace work-based assessments petencies and subcompetencies. Developed as benchmarks
as part of an overall assessment program. for effective assessment, ACGME milestones were written
for all US postgraduate medical disciplines and published
New Directions in Assessment in the Journal of Graduate Medical Education in March 2013
and March 2014.87 Specialty milestones are the framework
Implementation of competency-based medical education programs use for semiannual reports on resident progress.
models has been very challenging for many programs across Table 1.3 shows, as an example, one of the 21 milestone
the educational continuum.11,29 One reason has been the sets of the pediatric competencies.88 In 2014, all specialties
difficulty in translating the language and concepts of com- had described milestones for their programs,87 and every
petencies into educational practices and assessments. As US resident must now be regularly evaluated against all
a result, two new approaches, milestones and EPAs, have competencies of the specialty using these milestones. Early
arisen and continue to evolve as mechanisms to potentially research using national data for all emergency medicine and
facilitate more effective implementation of outcomes-based internal medicine programs in the United States demon-
education using competency frameworks. Although both of strate encouraging findings for some aspects of validity.89,90
these newer approaches are grounded in robust educational Milestones also have been reported to be helpful for earlier
theory, it is important for the reader to recognize that we identification of residents having difficulty, for better feed-
are in the very early days in determining the utility, includ- back to residents and fellows, and for development of better
ing validity, and impact of both milestones and EPAs on assessment approaches, and to be a useful framework for
educational and clinical outcomes. Although early research faculty development.91
is encouraging, much work remains to be done. However, In the 2015 edition of CanMEDS, milestones were
given that both milestones and EPAs are becoming part of also introduced and defined as “descriptions of the abilities

TABLE
1.3 Example of ACGME Milestone Descriptions with One of the 21 Competencies of Pediatric Training

Competency: Demonstrate humanism, compassion, integrity, and respect for others; based on the characteristics of an empathetic
practitioner
Level 1 Level 2 Level 3 Level 4 Level 5
Sees the patients Demonstrates compassion Demonstrates consis- Is altruistic and goes Is a proactive
in a “we versus for patients in selected tent understanding of beyond responding to advocate
they” framework situations (e.g., tragic patient- and family- expressed needs of on behalf of
and is detached circumstances, such as expressed needs and patients and families; individual
and not sensitive unexpected death) but has a desire to meet those anticipates the human patients,
to the human a pattern of conduct that needs on a regular needs of patients and families, and
needs of the demonstrates a lack of basis; is responsive in families and works to groups of
patient and sensitivity to many of the demonstrating kindness meet those needs as part children in need
family needs of others and compassion of skills in daily practice

ACGME, Accreditation Council for Graduate Medical Education.


From Carraccio C, Benson B, Burke A, et al: Pediatrics milestones. J Grad Med Educ. 2013;5(1 Suppl 1):59–73.
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 11

expected of a trainee or physician at a defined stage of pro- EPAs have now been identified for many graduate medi-
fessional development” of each of the “enabling competen- cal education programs including obstetrics/gynecology,
cies” under the seven CanMEDS competency roles, to guide pediatrics, internal medicine, family medicine, psychia-
learners and educators in determining whether learners are try, hematology and oncology, and pulmonary and critical
“on track.”15 care.96–102 An example of an EPA is conducting an uncompli-
cated delivery. This activity, performed by family physicians
Entrustable Professional Activities and obstetrics-gynecology specialists, needs to be entrusted
to a trainee at some point in his or her training, as the
The concept of EPAs was introduced in 2005.92 After a trainee eventually will need to conduct it without supervi-
publication in Academic Medicine in 2007,93 it has attracted sion. It requires specific knowledge, skills, and behaviors;
substantial attention among postgraduate programs in the proficiency is acquired through training; and it is directly
United States, Canada, and other countries. Since then, observable and reflects competencies. As this activity par-
EPAs have been proposed for numerous programs and con- ticularly reflects the CanMEDS roles of medical expert,
stitute an emerging basis for judging readiness for resident communicator, and collaborator, it exemplifies how EPAs
entry in the United States and Canada.94 A most recent integrate competencies. Other examples of EPAs are provid-
elaborate description of the concept and how to use it in ing preoperative assessment, managing care of patients with
workplace training and assessment can be found in a Guide acute common diseases across multiple care settings, provid-
99 of the Association for Medical Education in Europe.49 ing palliative care, managing common infections in nonim-
An EPA can be defined as a unit of professional practice munosuppressed and immune-compromised populations,
that can be fully entrusted to a trainee as soon as he or conducting a family education session about schizophrenia,
she has demonstrated the necessary competence to execute conducting a risk assessment, serving as the primary admit-
this activity unsupervised. In contrast with competencies, ting pediatrician for previously well children suffering from
they are not a quality of a trainee, but a part of work that common acute problems, pharmacologically managing an
must be done. Table 1.3 shows a typical competency— anxiety disorder, providing end-of-life care for older adults,
not related to a specific task—whereas an EPA would be a and offering office-based counseling in developmental and
concrete task that requires that and often other competen- behavioral pediatrics. A comprehensive set of EPAs should
cies. More specifically defined, EPAs are part of essential cover the core of a profession. Each EPA should be described
professional work in a given context—they must require well and include, next to an informative title, specifications
adequate knowledge, skill, and attitude, generally acquired and limitations; a listing of required competencies; elabo-
through training; must lead to recognized output of pro- ration of required experience, knowledge, and skills; sug-
fessional labor; should usually be confined to qualified gestions for assessment; and an expiration date showing
personnel; should be independently executable; should be when the practitioner should no longer be assumed to be
executable within a time frame; should be observable and competent in the EPA after a period of non-practice.94 See
measurable in the trainee’s process and outcome, leading to Appendix 1.2.
a conclusion (“well done” or “not well done”); and should Linked to the EPA construct is the purpose of entrustment
reflect one or more of the competencies to be acquired (see decision making. This process serves to acknowledge ability,
Appendix 1.1).92 provide permission to act with limited supervision, and enable
Much of the work done in health care can be captured duties in health care practice. True competency-based medi-
by tasks or responsibilities that must be entrusted to indi- cal education grants certification as soon as competence is
viduals. EPAs require a practitioner to possess and inte- adequately demonstrated, irrespective of the time in training,
grate multiple competencies simultaneously from several
domains, such as content expertise, skills in collaboration, TABLE
communication, management, and so forth. Conversely, 1.4 Overview of EPAs–Competencies Matrix
each domain of competence is relevant to many different
activities. Combining competencies (or domains of com- EPA EPA EPA EPA EPA EPA
petence) and EPAs in a matrix reveals which competencies 1 2 3 4 5 6
in particular a trainee must achieve before being trusted Competency 1 ● ● ● ●
to perform an EPA.94 The two-dimensional matrix in
Competency 2 ● ● ●
Table 1.4 provides specifications that are helpful for assess-
ment and feedback, for individual development, and for Competency 3 ● ● ● ●

grounding entrustment decisions. This makes assessment Competency 4 ● ●


based on EPAs a holistic or synthetic approach, rather
Competency 5 ● ● ● ● ●
than the wish to evaluate competencies analyzed into great
detail as stand-alone qualities of learners.95 EPAs are not Competency 6 ●

an alternative for competencies; they constitute a different Competency 7 ● ● ●

dimension, with the purpose of grounding competencies


EPA, Entrustable professional activity.
in clinical practice.
12 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

EPA Competency domains Milestones


1 2 3 4 5

* Patient care

** Medical knowledge
Provide
telephone Interpersonal and
** communication skills
advice and
manage-
ment of * Systems-based practice
patients
Practice-based learning
**
and improvement

* Professionalism

direct oversight
supervision only

observe indirect aspirational/


only supervision provide
supervision
• Fig. 1.1 Using milestones to determine an appropriate level of supervision for an entrustable professional
activity (EPA).

and this requires a personalized and flexible approach to train- • BOX 1.2    Five Levels of Supervision and
ing programs. EPAs allow for making entrustment decisions Permission
for separate units of professional practice, resulting in a more
gradual, legitimate participation in professional communi- 1. Be present and observe, but not permitted to perform the
ties of practice103 rather than a full license to practice on the entrustable professional activity (EPA).
2. Be permitted to act under direct, pro-active supervision,
last day of training.94 Certification for EPAs is not a dichoto- present in the room.
mous process. As trust increases, the level of supervision can 3. Be permitted to act under indirect, re-active supervision,
decrease. A model of five levels of supervision, entrustment, readily available to enter the room.
and permission has been proposed for postgraduate training, 4. Be permitted to act without qualified supervision in the
shown in Box 1.2.93,104 vicinity; with distant supervision or clinical oversight;
basically acting unsupervised.
5. Be permitted to supervise junior trainees regarding the EPA.
Combining Milestones and Entrustable
Professional Activities
Although the implementation of milestones and EPAs, on top 4 can be viewed as passing the threshold that allows for clinical
of competencies, may feel to critics like another burden for oversight only. It does not qualify a trainee to stop developing
programs and individual teachers,105 some authors have sug- but would allow for a formal recognition of ability, permission,
gested combining both. Eric Warm, program director of the and duty to enact the EPA, sometimes called a Statement of
University of Cincinnati internal medicine residency training Awarded Responsibility (STAR)93 or a summative entrustment
program, has simply equated the five milestone levels of com- decision (Table 1.5).
petencies (see Table 1.3) with the five supervision levels of EPAs Given this alignment, an example may be given. Suppose
(Box 1.2). Faced with the need to regularly report on milestones a pediatric residency program has an EPA called “Provide
for all residents, he asks clinicians to estimate the trainees’ readi- telephone advice and management of patient” (taken from
ness for direct supervision, indirect supervision, or unsuper- Jones and colleagues109). In the EPAs–competencies matrix,
vised practice. This serves efficiency and conceptual elegance. it has been determined that the most important domains
To take this approach one step further, the Dreyfus model,66 the of competence are medical knowledge, interpersonal and
broadly used RIME model (reporter-interpreter-manager-edu- communication skills, and practice-based learning and
cator106 [see Chapter 3]), the milestones approach,107 and levels improvement. Let us assume that for each of these domains,
of supervision can all be aligned as shown in Fig 1.1. The model milestones have been described. A trainee must be assessed
can be extended with more detailed representations of behavior to determine whether indirect supervision (i.e., not with a
and supervision,66,108 but the core idea is that of alignment of supervisor present in the room) is justified. If the trainee
frameworks. Moving from milestone or supervision level 3 to meets the expected behavior at milestone level 3 in all of
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 13

TABLE
1.5 Alignment of Various Models of Development

Appropriate Level
Milestone Dreyfus Model Transition to of Supervision and
Level Stages Learner Behavior RIME Stages Practitioner Permission
1 Novice Doing what is told, Reporter Introduction to clinical Observation, no
rule driven practice enactment
2 Advanced Comprehension Reporter/Interpreter Guided clinical practice Act under direct,
beginner proactive supervision
3 Competent Application to Interpreter/Manager Early independence Act under indirect,
common practice reactive supervision
4 Proficient Application to Manager/Educator Full unsupervised practice Clinical oversight
uncommon practice
5 Expert Experienced clinician Educator Aspirational growth after Provide supervision to
graduation others

the three most relevant domains, that decision seems justi- Entrustable Professional Activities Across
fied. If the trainee does not yet show the behavior or skill the Continuum and Nested Entrustable
expected at level 3 in any one of the competencies, there will Professional Activities
have to be more close supervision. In the terminology of the
RIME model, the learner would be evaluated as an adequate Activities can be small or large. There is no easy answer to
interpreter and beginning manager. Table 1.3 shows this the “right” breadth of EPAs and consequently to the num-
relationship.49 ber of EPAs. If the question is “What is the scope of respon-
The model can also be used in reverse order. Clini- sibility that is covered when an EPA is entrusted to a trainee
cal educators may start with an intuitive gut feeling that a for indirect supervision?” then clearly big differences can
trainee is ready for indirect supervision, based on his or her arise depending on the stage of training of the trainee in
experience in various settings. Then a quick check of the question. The first EPA that may be entrusted to a junior
important competency domains may confirm this and the medical student could be “measuring blood pressure.” If we
conclusion drawn that the trainee meets milestone level 3 of consider this a unit of professional practice or activity that
relevant competencies. Warm and colleagues reported suc- one can trust a trainee to complete without being checked
cess in organizing regular assessments of all internal medi- by a supervisor, then it is a true EPA (Fig. 1.2).
cine residents of a large Cincinnati program by scoring on Clearly, however, at a later stage this responsibility is part
an entrustment–supervision scale, assuming alignment with of a full standard physical examination that is a more logical
milestones scales.110 activity for entrustment for advanced medical students. The
full standard physical examination, in turn, can be included
Entrustable Professional Activities – in a broader EPA of a standard outpatient consultation that
Competencies – Skills also includes the history. In technical terminology, smaller
EPAs are nested within larger EPAs.49
Although EPAs are units of work and competencies are Among the Utrecht University undergraduate EPAs, one
descriptors of personal qualities and abilities, in common is “the clinical consultation,” to be entrusted to any medical
language educators tend to call “physical examination” a graduate before graduation for indirect supervision. This is
competency. Strictly, not the physical examination itself, a relatively broad EPA, as it requires neurologic, ENT, gyne-
but the ability to perform a physical examination is the cologic, psychiatric, and other history and physical exami-
skill and is a feature of the learner or professional. And it nation skills. In the Utrecht curriculum, students are to be
would be correct to say that that skill, on a more detailed entrusted with a focused “ENT clinical consultation” at an
level, requires manual skills, visual skills, auditory skills, earlier stage, and likewise for other specialties during a dedi-
and even time management and communication skills. If a cated clerkship. Only in the final year do all these smaller
learner possesses these skills, or competencies if one would EPAs lead to full trust in the broad EPA of “the clinical
call them such, he or she may be granted the trust to do consultation,” to be signed off separately in a subinternship
the physical examination without supervision. Simply put, for indirect supervision.
health professionals require an integrated set of abilities For EPA-based evaluation, it is therefore adequate to
(i.e., competencies) to effectively execute the clinical activ- design EPAs for a particular course within the educational
ity (i.e., EPA). continuum, for example, EPAs for undergraduate education
14 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

Medical student: well as you do as a faculty member (normative and self as


measuring & frame of reference)? There are basically three benchmarks
reporting blood
pressure
for assessment and feedback provision: (1) comparison with
standards of professional practice (what is the expected per-
Intern: complete formance?), (2) comparison with other trainees (how well
physical & history
and report do others do at similar stages of training?), and (3) compari-
son with past development (how did the trainee progress
Junior resident: since last time?). Naturally, in competency-based training,
management of
uncomplicated eventually the first must be the benchmark for certification,
ambulant patient but learners are usually evaluated with a mix of benchmarks.
Assessors should be clear in expressing how the benchmarks
Senior resident: are defined, based on the context and purpose of assessment,
running a regular and those being assessed should be aware of expectations for
outpatient clinic
their performance.
• Fig. 1.2 Entrustable professional activities (EPAs) nested in other Frame the assessment as a developmental entrustment
EPAs. Modified from Ten Cate O, Chen HC, Hoff RG, et al: Curriculum decision. Trusting a learner to work unsupervised, if even
development for the workplace using Entrustable Professional Activi- occasionally, requires a broader view than simply measur-
ties (EPAs): AMEE Guide No. 99. Med Teach. 2015;37(11):983–1002.
ing a skill in an examination. The question in the back of
one’s head is: Would I trust this trainee to execute this EPA
to be mastered before entering residency,111 EPAs for the end tomorrow morning with a critical patient (or maybe a rela-
of training in geriatrics,112 or EPAs for a fellowship.113 This tive of yours) without a qualified professional present? This
does not mean that the EPAs are only mastered at the end of question about trust includes features of the trainee other
that training period. Indeed, key to competency-based train- than knowledge and skill, such as his or her discernment
ing is that EPAs may be mastered and the trainee awarded of his or her own limitations, willingness to ask for help if
with a decrease of supervision and increase in autonomy as needed, conscientiousness in carrying out clinical tasks, and
soon as the he or she demonstrates the required competence. truthfulness in communications to staff.115 Although leni-
ency bias is a known and common problem of workplace-
Entrustment Decision Making as Assessment based assessments,116,117 cautious entrustment decisions
could lead to a stringency bias, but this has not been docu-
Evaluating trainees with a focus on EPAs potentially has mented in the literature.
the benefit that it aligns with the daily practice of clinician Core in the definition of trust is the “acceptance of risk
thinking. Much of it naturally focuses on whether clinical and vulnerability, based on a positive expectation of the
activities are carried out well, both trainees’ own activi- intentions and behavior of the other.”118 Trust in clinical
ties and those of others. Patient satisfaction and rewarding trainees is an area of investigation that is sure to get more
success in diagnostic reasoning and in therapeutic actions attention in the coming years, including the role of intu-
are important drivers to monitor competence and quality ition, gut feelings, and heuristics in making entrustment
of care.114 From this a series of recommendations can be decisions.119,120 When trusting a trainee, the context needs
derived. to be taken into account. Risks may vary and range from
First focus on EPA execution, and then look at com- breaching confidentiality or hurting and confusing the
petencies. The primary focus of EPA-based assessment is patient to neglecting critical information, overestimating
whether a job is being done well. In many cases, there is ability, inadequately performing diagnostic testing, and
little need to make detailed evaluations of all competencies applying the wrong therapies and recommendations.
involved. When learners do not perform optimally, then an Align scales with supervision recommendation. Align-
analysis of the reasons and causes is most useful. The EPAs– ment of the constructs of clinical practice and assessment
competencies matrix and milestone descriptions may help of learners is likely to lead to enhanced reliability. Weller
to identify weaknesses and coach learners to improve. and colleagues used scales of “supervisor required in the the-
Distinguish three benchmarks, or frames of refer- atre suite – supervisor required in hospital – supervisor not
ence for assessment. Educators often struggle in assessing required” for the evaluation of anesthesiology residents121;
because of lack of clear standards, benchmarks, or appropri- George and colleagues used a scale (call the Zwisch scale)
ate frame of reference (see Chapters 3 and 4). Often edu- for surgical residents with “show and tell – active help – pas-
cators struggle using criterion referenced judgments (e.g., sive help – supervision only” signifying the required role of
appropriate and effective care of the patient with multiple the supervisor during surgery.46,121,122 Both authors report
chronic conditions) versus normative referenced judgments: increased reliabilities when using these scales compared with
For example, is Robert as good as Jane at this stage of train- a traditional one. The scale in Fig. 1.1 and Box 1.1 is a more
ing? Is a learner scored higher because she made great prog- general representation of this idea, but more detail may be
ress after a previous observation? Or because she exerted added, depending on the stage of training or the setting or
extraordinary effort? Or because the learner performs as specialty.
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 15

TABLE Sources of Information to Support Use multiple sources of information to support


1.6 Summative Entrustment Decisions entrustment decisions. Although ad hoc decisions to trust
a trainee are usually taken by individuals and very much sit-
Sources Examples
uated in time and place, summative entrustment decisions
Knowledge testing Written or e-tests, case-based must be grounded in multiple identifiable sources of infor-
discussions, observed teaching mation. The sources of information to inform entrustment
Short practice Mini-CEX, DOPS, handoffs, video, decisions are not necessarily different from other workplace-
observations and other* based assessment and may be subsumed under the five cat-
Long practice Multisource feedback, review of egories as outlined in Table 1.6.78,93
observations shifts
Simulation tests OSCE, OSATS,† standardized patient Systems of Assessment (See Chapter 16.)
tests
As the section on milestones and EPAs clearly highlights,
Work product EHR entries, presentations, papers,
evaluation reports, event analysis regardless of whether your program decides to utilize these
concepts, all medical education programs need a robust
*From Gigerenzer G: Gut Feelings. The Intelligence of the Unconscious.
New York, Penguin Group, 2007, pp 1–280.
assessment program using a multifaceted array of assess-
†Can also be used as a direct observation tool. ment methods embedded in an effective educational sys-
DOPS, Direct observation of procedural skills; EHR, electronic health tem. The movement toward outcomes-based education and
record; mini-CEX, clinical evaluation exercise; OSATS, objective struc-
tured assessment of technical skill; OSCE, objective structured clinical
assessment presents many challenges for medical educators.
examination. Educational leaders will need to integrate traditional and
   
new assessment methods into their educational programs to
ensure that individual trainees meet important educational
Distinguish ad hoc entrustment decisions from sum- and professional objectives and to inform continued qual-
mative entrustment decisions. Entrustment decisions may ity improvement of their programs. Assessment approaches
be distinguished between ad hoc entrustment decisions and must be clearly aligned with educational objectives and
summative entrustment decisions. Ad hoc entrustment deci- congruent with teaching and learning methods. Assessment
sions happen every day in the moment, usually made by should be closely intertwined with instructional activities to
individual supervisors and pertaining to immediate permis- optimize efficient use of resources and to consolidate learn-
sion for the trainee to act. Summative entrustment decisions, ing. The assessment system will need to include multiple
grounded in more systematic observation, lead to permission methods to capture each of the general competencies and
to act under a specified level of supervision, comparable with ideally to provide for the assessment of different aspects of
the driver’s license that formalizes permission to drive unsu- each competency by different methods. Program and clerk-
pervised from that point onward but that may need to be ship directors will need to prepare the assessors, through the
reviewed at some later point in time.78 Ad hoc entrustment implementation of robust faculty development programs,
is without long-term consequences but may stimulate devel- and inform and engage trainees for the assessment system
opment and evaluation of trainee readiness for summative to succeed.
decisions. Conversely, a summative entrustment decision is a Beyond the performance of individual trainees, the assess-
general statement that must be documented, awards a higher ment system will need to support the continuous collection
level of responsibility for future actions, and should be rec- and analysis of aggregate data to provide feedback regarding
ognizable by third parties. Both are important in EPA-based the quality of the educational program. This includes infor-
curricula. The ad hoc decision experiences of a supervisor may mation from more traditional assessment methods, such as
be documented in the trainee’s portfolio (Was this a justified program-level subscores on MCQ examinations or aggre-
decision? If not, why not? Would the observer recommend gate case-level data from clinical skills examinations, as well
a summative entrustment decision soon?). Summative deci- as composite scores or ratings from newer methods such as
sions may be informed by multiple ad hoc decisions supple- multisource feedback, computer simulation–based exercises,
mented with information gathered through other channels and work-based assessments. It also involves collection and
(multisource feedback, knowledge assessment, skills assess- analysis of clinical information, such as compliance with
ment). Summative entrustment decisions should be multi- evidence-based health care processes or patient health out-
source decisions based on the summation of smaller elements comes that can provide the impetus for curricular change or
of information. Summative entrustment decisions for level feedback on the quality of educational interventions. Estab-
4 of supervision may look like certifications, STARs (state- lishing such a connection, at least at the institutional level,
ments of awarded responsibility92), or digital badges that may will facilitate conduct of needed research to elucidate the
be accessible to the outside world.123 As these should sig- relationships between educational activities and health care
nify current competence, a summative entrustment decision practices and outcomes. Milestones and EPAs were created
for level 4 of an EPA should potentially be retracted if the to facilitate this integration and connection.
individual does not maintain the practice of the EPA, either In addition to compiling aggregate data within programs
within the training or after training.93 to inform quality improvement initiatives, assessment
16 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

systems will need to enable information gathering regarding References


the performance of program graduates. As with concurrent
measures, educational leaders will need to access and incor- 1. Institute of Medicine. To Err Is Human: Building a Safer Health
porate into their assessment systems information about System. Washington, DC: National Academy of Health Sci-
future competence and performance of program graduates ences; 1999.
to guide quality improvement efforts. Some information, 2. National Patient Safety Foundation: Free From Harm: Accel-
erating Patient Safety Improvement Fifteen Years After To Err Is
such as licensure actions, in-training or board certification
Human. Available at: http://www.npsf.org/?page=freefromharm.
examination scores, or program director ratings, may not be 3. Institute of Medicine. Crossing the Quality Chasm. Washington,
difficult to obtain. Obtaining other sources of information, DC: National Academy Press; 2001.
such as specific performance measures or clinical data, to 4. Berwick DM, Nolan TW, Whittington L. The triple aim: care,
provide additional feedback regarding educational program health cost. Health Aff (Millwood). 2008;27(3):759–769.
quality will require more effort. The formation of collabora- 5. Mossialos E, Wenzl M, Osborn R, et al. International Profiles
tive projects and networks linking professional and clinical of Health Care Systems, 2014. Australia, Canada, Denmark,
outcomes across the spectrum of education and practice will England, France, Germany, Italy, Japan, The Netherlands, New
facilitate understanding and incorporation of information Zealand, Norway, Singapore, Sweden, Switzerland, and the
critical to the continuous quality improvement of educa- United States: The Commonwealth Fund; 2015. Available at:
tional programs. http://www.commonwealthfund.org/publications/fund-reports
/2015/jan/international-profiles-2014.
6. Frenk J, Chen L, Bhutta ZA, et al. Health professionals for
Conclusion a new century: transforming education to strengthen health
systems in an interdependent world. Lancet. 2010;376(9756):
Public and professional pressure to increase accountability 1923–1958.
and quality improvement in clinical care has resulted in 7. Harden RM, Crosby JR, Davis M. An introduction to outcome-
important changes in medical education and assessment. based education. Med Teach. 1999;21(1):7–14.
Delineation of essential physician competencies and wide- 8. McGaghie WC, Miller GE, Sajid AW, et al. Competency-based
spread implementation of outcomes-based medical edu- curriculum development in medical education: an introduction.
cation, to varying degrees, have led to a critical review of Geneva: World Health Organization; 1978.
the quality and methods used in the assessment of com- 9. Cooke M, Irby DM, O’Brien BC. Educating Physicians. A
petence and performance. Advances in technology and Call for Reform of Medical School and Residency. San Francisco:
Jossey-Bass; 2010.
psychometrics have supported continued refinement of tra-
10. Frank JR, Mungroo R, Ahmad Y, et al. Toward a definition of
ditional assessment modalities and the development of new competency-based education in medicine: a systematic review of
approaches. Educational leaders now face difficult challenges published definitions. Med Teach. 2010a;32(8):631–637.
in developing and integrating assessment programs embed- 11. Frank JR, Snell LS, ten Cate O, et al. Competency-based medi-
ded within an effective system and overall educational pro- cal education: theory to practice. Med Teach. 2010b;32(8):
gram. They must understand the psychometric properties of 638–645.
various assessment tools; consider their relevance to trainee 12. Crosson FJ, Leu J, Roemer BM, et al. Gaps in residency
level, as well as to instructional methods and educational training should be addressed to better prepare doctors for a
objectives; and then balance these factors against program twenty-first-century delivery system. Health Aff (Millwood).
culture and resource availability in deciding what methods 2011;30(11):2412–2418.
to use in their assessment system. Educators also need to 13. Skochelak SE. A decade of reports calling for change in medi-
cal education: what do they say? Acad Med. 2010;85(suppl 9):
understand the evolving science of work-based assessment
S26–S33.
such as quality and safety measures, patient experience sur- 14. MedPAC. Graduate medical education financing: focusing on edu-
veys, and PROMs. Finally, the use of qualitative assessments cational priorities. In Report to the Congress: Aligning Incentives in
and judgment techniques, combined with group process, is Medicare. Washington, DC: MedPAC; 2010:103–128.
also growing in importance for assessment programs. The 15. Frank JR, Jabbour M, Tugwell P, et al. Skills for the new millen-
chapters that follow are intended to help guide educational nium: report of the societal needs working group, CanMEDS
leaders in designing their assessment programs and systems 2000 Project. Ann R Coll Phys Surg Can. 1996;29:206–216.
to support evaluation of individual trainees and continu- 16. Frank JR, ed. The CanMEDS 2005 Physician Competency Frame-
ous quality improvement of their educational programs for work. Better Standards. Better Physicians. Better Care. Ottawa:
the benefit of the trainee, program, and, most important, The Royal College of Physicians and Surgeons of Canada;
patients and the public. 2005.
17. Batalden P, Leach D, Swing S, et al. General competencies and
accreditation in graduate medical education. Health Aff (Mill-
Acknowledgment wood). 2002;21(5):103–111.
18. General Medical Council: Good Medical Practice. 2013. Avail-
The authors wish to sincerely thank Dr. John Norcini for able at: http://www.gmc-uk.org/static/documents/content/Goo
donating content from the first edition to this chapter. We d_medical_practice_-_English_1015.pdf.
are very appreciative of his graciousness to this chapter and 19. Institute of Medicine. Health Professions Education: A Bridge to
contributions to medical education. Quality. Washington, DC: The National Academies Press; 2003.
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 17

20. Royal Australasian College of Surgeons: Nine RAC Competen- 42. Holmboe ES, Yamazaki K, Edgar L, et al. Reflections on the
cies. 2015. Available at: http://www.surgeons.org/becoming- first 2 years of milestone implementation. J Grad Med Educ.
a-surgeon/surgical-education-training/competencies/. 2015;7(3):506–511.
21. Ten Cate O. Medical education in the Netherlands. Med Teach. 43. Bunderson CV, Inouye DK, Olsen JB. The four generations of
2007;29(8):752–757. computerized educational measurement. In: Linn RL, ed. Edu-
22. Association of Medical Education in Europe. Education Guide cational Measurement. Washington, DC: American Council on
No 14: Outcome-based Education. Dundee: AMEE; 1999. Education; 1989.
23. Iobst WF, Sherbino J, ten Cate O, et al. Competency-based 44. Norcini JJ. Computers in physician licensure and certifica-
medical education in postgraduate medical education. Med tion: new methods of assessment. J Educ Computing Res.
Teach. 2010;32(8):651–656. 1994;10:161–171.
24. Holmboe ES, Sherbino J, Long DM, et al. The role of assess- 45. Griswold-Theodorson S, Ponnuru S, Dong C, et al. Beyond
ment in competency-based medical education. Med Teach. the simulation laboratory: a realist synthesis review of clinical
2010;32(8):676–682. outcomes of simulation-based mastery learning. Acad Med.
25. Campbell C, Silver I, Sherbino J, et al. Competency-based 2015;90(11):1553–1560.
continuing professional development. Med Teach. 2010;32(8): 46. George BC, Teitelbaum EN, Meyerson SL, et al. Reliability,
657–662. validity, and feasibility of the Zwisch scale for the assessment of
26. Carraccio C, Englander R, Van Melle E, et al. Advancing com- intraoperative performance. J Surg Educ. 2014;71(6):e90–e96.
petency-based medical education: a charter for clinician-educa- 47. Foundation for Excellence in Women’s Healthcare:
tors. Acad Med. 2016;91(5):645–649. MyTIPreport. 2016. Available at: https://mytipreport.org/.
27. Donabedian A. An Introduction to Quality Assurance in Health 48. Spickard 3rd A, Ridinger H, Wrenn J, et al. Automatic scoring
Care. New York: Oxford University Press; 2003. of medical students’ clinical notes to monitor learning in the
28. Durning SJ, Lubarsky S, Torre D, et al. Considering “nonlinear- workplace. Med Teach. 2014;36(1):68–72.
ity” across the continuum in medical education assessment: sup- 49. Ten Cate O, Chen HC, Hoff RG, et al. Curriculum development
porting theory, practice, and future research directions. J Contin for the workplace using Entrustable Professional Activities (EPAs):
Educ Health Prof. 2015;35(3):232–243. AMEE Guide No. 99. Med Teach. 2015;37(11):983–1002.
29. Norman GR. Research in medical education: three decades of 50. Hambleton RK, Swaminathan H. Item Response Theory: Prin-
progress. BMJ. 2002;324:1560–1562. ciples and Applications. Dordrecht. Kluwer; 1985.
30. Kogan JR, Holmboe ES. Realizing the promise and impor- 51. Green BF. Adaptive testing by computer. In: Ekstrom RB, ed.
tance of performance-based assessment. Teach Learn Med. Principles of Modern Psychological Measurement. San Francisco:
2013;25(suppl 1):S68–S74. Jossey-Bass; 1983:5–12.
31. Kogan JR, Hess BJ, Conforti LN, et al. What drives faculty rat- 52. American Board of Internal Medicine: A Vision for Certification
ings of residents’ clinical skills? The impact of faculty’s own clini- in Internal Medicine in 2020. Available at: http://transforming.a
cal skills. Acad Med. 2010;85(suppl 10):S25–S28. bim.org/assessment-2020-report/.
32. Kogan JR, Conforti LN, Iobst WF, et al. Reconceptualizing 53. Brennan RL. Generalizability Theory. New York: Springer-Ver-
variable rater assessments as both an educational and clinical lag; 2001.
care problem. Acad Med. 2014;89:721–727. 54. Norcini JJ. Standard setting. In: Dent JA, Harden RM, eds. A
33. Wong BM, Holmboe ES. Transforming academic faculty to Practical Guide for Medical Teachers. Edinburgh: Churchill Liv-
better align educational and clinical outcomes. Acad Med. ingston; 2005:293–301.
2016;91(4):473–479. 55. Berk RA, ed. Handbook of Methods for Detecting Test Bias. Balti-
34. Holmboe ES, Batalden P. Achieving the desired transformation: more: Johns Hopkins Press; 1982.
thoughts on next steps for outcomes-based medical education. 56. Hodges BD, Lingard L. A Question of Competence. Reconsider-
Acad Med. 2015;90(9):1215–1223. ing Medical Education in the Twenty-First Century. New York:
35. Stewart JB. Blind Eye: How the Medical Establishment Let a Cornell University Press; 2012.
Doctor Get Away With Murder. New York: Simon and Shuster; 57. Hauer KE, Ten Cate O, Boscardin CK, et al. Ensuring resident
1999. competence: a narrative review of the literature on group deci-
36. The Final Report of the Shipman Inquiry. 2005. http:// sion making to inform the work of clinical competency commit-
webarchive.nationalarchives.gov.uk/20090808154959/http:// tees. J Grad Med Educ. 2016;8(2):156–164.
www.the-shipman-inquiry.org.uk/6r_page.asp. 58. Holmboe ES, Edgar L, Padmore J, et al. Clinical competency
37. Reilly BM. Physical examination in the care of medical inpa- committees & use of milestones in residency. Guide to Medical Edu-
tients: an observational study. Lancet. 2003;362(9390): cation in the Teaching Hospital. 5th ed. Philadelphia: Association
1100–1105. for Hospital Medical Education; 2015.
38. Nelson EC, Batalden PB, Godfrey MM. Quality by Design: 59. Gaglione MM, Moores L, Pangaro L, et al. Does group discus-
A Clinical Microsystems Approach. San Francisco: Jossey-Bass; sion of student clerkship performance at an education commit-
2007. tee affect an individual committee member’s decisions? Acad
39. Ogrinc GS, Headrick LA. Fundamentals of Health Care Improve- Med. 2005;80(suppl 10):S55–S58.
ment. A Guide to Improving Your Patients’ Care. Oakbrook Ter- 60. Battistone MJ, Milne C, Sande MA, et al. The feasibility and
race, IL: Joint Commission Resources; 2008. acceptability of implementing formal evaluation sessions and
40. Von Korff M, Gruman J, Schaefer J, et al. Collaborative using descriptive vocabulary to assess student performance on a
management of chronic illness. Ann Intern Med. 1997;127: clinical clerkship. Teach Learn Med. 2002;14(1):5–10.
1097–1102. 61. Van der Vleuten CPM, Schuwirth LWT. Assessing profes-
41. Batalden M, Batalden P, Margolis P, et al. Coproduction of sional competence: from methods to programmes. Med Educ.
healthcare service. BMJ Qual Saf. 2016;25(7):509–517. 2005;39(3):309–317.
18 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

62. van der Vleuten CP, Schuwirth LW, Driessen EW, et al. A 83. Ziv A, Wolpe RP, Small SD, et al. Simulation-based medical
model for programmatic assessment fit for purpose. Med Teach. education: an ethical imperative. Acad Med. 2003;78:783–788.
2012;34(3):205–214. 84. Norcini JJ. Current perspectives in assessment: the assessment of
63. Miller G. The assessment of clinical skills/competence/perfor- performance at work. Med Educ. 2005;39:880–889.
mance. Acad Med. 1990;65(suppl):S63–S67. 85. Agency for Healthcare Quality and Research: CAHPS Toolkit.
64. Cruess RL, Cruess SR, Steinert Y. Amending Miller’s pyra- Available at: http://www.ahrq.gov/cahps/index.html.
mid to include professional identity formation. Acad Med. 86. Holmboe ES, Edgar L, Hamstra S: The Milestones Guidebook.
2016;91(2):180–185. Available at: www.acgme.org.
65. Rethans JJ, Norcini JJ, Barón-Maldonado M, et al. The relation- 87. Swing SR, Beeson MS, Carraccio C, et al. Educational mile-
ship between competence and performance: implications for stone development in the first 7 specialties to enter the next
assessing practice performance. Med Educ. 2002;36:901–909. accreditation system. J Grad Med Educ. 2013;5(1):98–106.
66. Dreyfus HL. On the Internet: Thinking in Action. New York: 88. Carraccio C, Benson B, Burke A, et al. Pediatrics milestones. J
Routledge; 2001. Grad Med Educ. 2013;5(1 suppl 1):59–73.
67. Van Der Vleuten CP. The assessment of professional compe- 89. Hauer KE, Clauser J, Lipner RS, et al. The internal medi-
tence: developments, research and practical implications. Adv cine reporting milestones: cross-sectional description of initial
Health Sci Educ Theory Pract. 1996;1(1):41–67. implementation in U.S. residency programs. Ann Intern Med.
68. Higgins R, Cavendish S. Modernising Medical Careers foun- 2016;165(5):356–362.
dation programme curriculum competencies: will all rotations 90. Beeson M, Holmboe E, Korte R, et al. Initial validity analy-
allow the necessary skills to be acquired? The consultants’ pre- sis of the emergency medicine milestones. Acad Emerg Med.
dictions. Postgrad Med J. 2006;82(972):684–687. 2015;22(7):838–844.
69. Norcini J, Anderson B, Bollela V, et al. Criteria for good assess- 91. Holmboe ES, Yamazaki K, Edgar L, et al. Reflections on the
ment: consensus statement and recommendations from the first 2 years of milestone implementation. J Grad Med Educ.
Ottawa 2010 Conference. Med Teach. 2011;33(3):206–214. 2015;7(3):506–511.
70. Herbers Jr JE, Noel GL, Cooper GS, et al. How accurate are 92. ten Cate O. Entrustability of professional activities and compe-
faculty evaluations of clinical competence? J Gen Intern Med. tency-based training. Med Educ. 2005;39(12):1176–1177.
1989;4:202–208. 93. ten Cate O, Scheele F. Competency-based postgraduate train-
71. Noel GL, Herbers Jr JE, Caplow MP, et al. How well do inter- ing: can we bridge the gap between theory and clinical practice.
nal faculty members evaluate the clinical skills of residents? Ann Acad Med. 2007;82(6):542–547.
Intern Med. 1992;117:757–765. 94. Englander R, Flynn T, Call S, et al. Toward defining the founda-
72. Kroboth FJ, Hanusa BH, Parker S, et al. The inter-rater reli- tion of the MD degree: core entrustable professional activities
ability and internal consistency of a clinical evaluation exercise. for entering residency. Acad Med. 2016;91(10):1352–1358.
J Gen Intern Med. 1992;7:174–179. 95. Pangaro L, ten Cate O. Frameworks for learner assessment
73. Engel GL. Editorial: are medical schools neglecting clinical in medicine: AMEE Guide No. 78. Med Teach. 2013;35(6):
skills? JAMA. 1976;236(7):861–863. e1197–e1210.
74. Frank JR, Snell L, Sherbino J (Eds): The Draft CanMEDS 2015 96. Scheele F, Caccia N, Van Luijk S, et al. BOEG-Better Educa-
Framework Physician Competency Framework. 2015. Available at: tion for Obstetrics and Gynaecology. A National Competency-Based
http://www.royalcollege.ca/portal/page/portal/rc/common/ Curriculum for Obstetrics & Gynaecology. Utrecht: Netherlands
documents/canmeds/framework/canmeds2015_framework_ Association for Gynaecology and Obstetrics; 2013:1–61.
series_IV_e.pdf. 97. Gilhooly J, Schumacher DJ, West DC, et al. The promise
75. Hemmer PA, Dadekian GA, Terndrup C, et al. Regular formal and challenge of entrustable professional activities. Pediatrics.
evaluation sessions are effective as frame-of-reference training 2014;133(suppl):S78–S79.
for faculty evaluators of clerkship medical students. J Gen Intern 98. Caverzagie KJ, Cooney TG, Hemmer PA, et al. The develop-
Med. 2015;30(9):1313–1318. ment of entrustable professional activities for internal medicine
76. Landy FJ, Farr JL. Performance rating. Psychol Bull. 1980;87:72–107. residency training: a report from the Education Redesign Com-
77. ten Cate O. Nuts and bolts of entrustable professional activities. mittee of the Alliance for Academic Internal Medicine. Acad
J Grad Med Educ. 2013;5(1):157–158. Med. 2015;90(4):479–484.
78. ten Cate O. Trust, competence, and the supervisor’s role in post- 99. Shaughnessy AF, Sparks J, Cohen-osher M, et al. Entrustable
graduate training. BMJ. 2006;333(7571):748–751. professional activities in family medicine. J Grad Med Educ.
79. ten Cate O, Hart D, Ankel F, et al. Entrustment decision-mak- 2013;5(1):112–118.
ing in clinical training. Acad Med. 2016;91(2):191–198. 100. Schultz K, Griffiths J, Lacasse M. The application of entrust-
80. Boulet JR, Swanson DB. Psychometric challenges of using sim- able professional activities to inform competency decisions in
ulations for high-stakes assessment. In: Dunn D, ed. Simulators a family medicine residency program. Acad Med. 2015;90(7):
in Critical Care Education and Beyond. Philadelphia: Lippincott 888–897.
Williams and Wilkins; 2004:119–130. 101. Boyce P, Spratt C, Davies M, et al. Using entrustable profes-
81. McGaghie WC, Barsuk JH, Cohen ER, et al. Dissemination sional activities to guide curriculum development in psychiatry
of an innovative mastery learning curriculum grounded in training. BMC Med Educ. 2011;11:96.
implementation science principles: a case study. Acad Med. 102. Fessler HE, Addrizzo-Harris D, Beck JM, et al. Entrustable
2015;90(11):1487–1494. professional activities and curricular milestones for fellowship
82. Barsuk JH, Cohen ER, Potts S, et al. Dissemination of a training in pulmonary and critical care medicine: report of a
simulation-based mastery learning intervention reduces cen- multisociety working group. Chest. 2014;146(3):813–834.
tral line-associated bloodstream infections. BMJ Qual Saf. 103. Lave J, Wenger E. Situated Learning. Legitimate Peripheral Par-
2014;23(9):749–756. ticipation. Edinburgh: Cambridge University Press; 1991.
CHAPTER 1 Assessment Challenges in the Era of Outcomes-Based Education 19

104. ten Cate O, Snell L, Carraccio C. Medical competence: the 114. Crossley J, Johnson G, Booth J, et al. Good questions, good
interplay between individual ability and the health care environ- answers: construct alignment improves the performance of
ment. Med Teach. 2010;32(8):669–675. workplace-based assessment scales. Med Educ. 2011;45(6):
105. Norman G, Norcini J, Bordage G. Competency-based education: 560–569.
milestones or millstones?. J Grad Med Educ. 2014;6:1–6 (March). 115. Kennedy TJT, Regehr G, Baker GR, et al. Point-of-care assess-
106. Pangaro L. A new vocabulary and other innovations for ment of medical trainee competence for independent clinical
improving descriptive in-training evaluations. Acad Med. work. Acad Med. 2008;83(suppl 10):S89–S92.
1999;74(11):1203–1207. 116. Albanese M. Challenges in using rater judgements in medical
107. Hicks PJ, Schumacher DJ, Benson BJ, et al. The pediatrics mile- education. J Eval Clin Pract. 2000;6(3):305–319.
stones: conceptual framework, guiding principles, and approach 117. Govaerts MJB, van der Vleuten CPM, Schuwirth LWT, et al.
to development. J Grad Med Educ. 2010;2(3):410–418. Broadening perspectives on clinical performance assessment:
108. Chen HC, van den Broek WES, ten Cate O. The case for use rethinking the nature of in-training assessment. Adv Health Sci
of entrustable professional activities in undergraduate medical Educ Theory Pract. 2007;12(2):239–260.
education. Acad Med. 2015;90(4):431–436. 118. Earle TC. Trust in risk management: a model-based review of
109. Jones MD, Rosenberg A, Gilhooly JT, et al. Perspective: compe- empirical research. Risk Anal. 2010;30(4):541–574.
tencies, outcomes, and controversy–linking professional activi- 119. Gigerenzer G. Gut Feelings. The Intelligence of the Unconscious.
ties to competencies to improve resident education and practice. New York: Penguin Group; 2007:1–280.
Acad Med. 2011;86(2):161–165. 120. Gigerenzer G, Gaissmaier W. Heuristic decision making. Annu
110. Warm EJ, Mathis BR, Held JD, et al. Entrustment and map- Rev Psychol. 2011;62:451–482.
ping of observable practice activities for resident assessment. J 121. Weller JM, Misur M, Nicolson S, Morris J, Ure S, Crossley J,
Gen Intern Med. 2014;29(8):1177–1182. et al. Can I leave the theatre? A key to more reliable workplace-
111. Englander R, Flynn T, Call S, et al: Core Entrustable Professional based assessment. Br J Anaesth. 2014;112(March):1083–1091.
Activities for Entering Residency - Curriculum Developers Guide [Inter- 122. DaRosa DA, Zwischenberger JB, Meyerson SL, et al. A theory-
net]. Washington, DC, 2014. Available at http://www.aamc.org. based model for teaching and assessing residents in the operating
112. Leipzig RM, Sauvigné K, Granville LJ, et al. What is a geriatri- room. J Surg Educ. 2012;70(1):24–30.
cian? American Geriatrics Society and Association of Directors 123. Mehta NB, Hull AL, Young JB, et al. Just imagine: new
of Geriatric Academic Programs End-of-Training Entrustable paradigms for medical education. Acad Med. 2013;88(10):
Professional Activities for Geriatric Medicine. J Am Geriatr Soc. 1418–1423.
2014;62(5):924–929.
113. Rose S, Fix OK, Shah BJ, et al. Entrustable professional activi-
ties for gastroenterology fellowship training. Gastrointest Endosc.
2014;80(1):16–27.

Appendix 1.1
Developing an Entrustable
Professional Activity
20 CHA P T ER 1 Assessment Challenges in the Era of Outcomes-Based Education

1. Title:

2. Specification and limitations

3. Most relevant domains of competencies (use the competency framework of your program;
if you do not use a competency framework, work with what you are most familiar)

4. Required experience, KSA, and behaviors for entrustment

5. Assessment information to assess progress and inform summative decision

6. Entrustment at which level of supervision is to be reached at which stage of training?

7. Expiration Date

• Fig. 1.3 Template for the description of an entrustable professional activity.

KSA, Knowledge, skills, and abilities.


Appendix 1.2
Entrustable Professional Activities,
Competencies, and Milestones:
Pulling It All Together
Milestones
Entrustable Professional 1 2 3 4 5
Competency Domains
Activity
1. Title:
Patient Care

2. Specification and
limitations Medical Knowledge

3. Most relevant domains of


competencies (Connect to Professionalism
pertinent domains to right)

4. Required experience,
KSA, and behaviors Interpersonal Skills
for entrustment and Communication

5. Assessment information to Practice-Based


assess progress and inform Learning and
summative decision Improvement

6. Expiration date
Systems-Based
Practice

Direct Oversight
Supervision Only

Observe Indirect Aspirational/


only Supervision provide
supervision

21
2
Issues of Validity and Reliability for
Assessments in Medical Education
BRIAN E. CLAUSER, EDD, MELISSA J. MARGOLIS, PHD, AND
DAVID B. SWANSON, PHD

CHAPTER OUTLINE viewed as a structured argument in support of the intended


Historical Context interpretations made based on test scores, will be the main
Kane’s View of Validity structural focus of the chapter. Kane’s approach is important
because the view that the validation process is one of collect-
Scoring
ing evidence to construct a coherent argument in support of
Example I: A Multiple-Choice Examination
the intended interpretations leads to a notable conclusion:
Example II: Performance Assessment there is no such thing as a valid test! The score from any
Example III: Workplace-Based Assessment given test could be used to make a variety of decisions in
Generalization different contexts and with different examinee populations;
Generalizability Theory evidence to support the validity of one type of interpreta-
Example I: A Multiple-Choice Examination tion in one context with one population may or may not
Example II: Performance Assessment
support the validity of a different interpretation in a differ-
ent context with a different population. This point will be
Example III: Workplace-Based Assessment
discussed in greater detail later; it is introduced here because
Extrapolation it is central to the understanding of the argument to be
Example I: A Multiple-Choice Examination made throughout this chapter.
Example II: Performance Assessment Within the components of Kane’s validity framework,
Example III: Workplace-Based Assessment various medical education assessment contexts will provide
Decision/Interpretation a structure for examples of the types of evidence that might
Example I: A Multiple-Choice Examination be collected to create a validity argument. The discussion
Example II: Performance Assessment
of reliability will be presented within the context of gener-
alizability theory, and the generalizability of scores will be
Example III: Workplace-Based Assessment
considered in the context of the overall validity argument.
Conclusion It is hoped that this chapter will provide the reader with
Annotated Bibliography a greater understanding of issues that are central to valid-
References ity and reliability as these concepts pertain to assessment in
medical education.

Historical Context
A proposition deserves some degree of trust only when it has
survived serious attempts to falsify it. Practically speaking, the history of test theory as we know it
—LEE CRONBACH begins with Charles Spearman around the turn of the 20th
century. Spearman’s interest was not in assessment but in the
The purpose of this chapter is to provide an overview of the psychological study of intelligence. Most of the basic equa-
concepts of validity and reliability as they apply to assess- tions from classical test theory were developed by Spearman
ment in medical education. The discussion begins with a to aid his research on the presence of a common (g) factor
brief history of validity theory and a description of how the shared by most if not all tests of mental proficiency.1–4 These
conceptualization of validity has changed. Michael Kane’s equations are all dependent on Karl Pearson’s mathematical
approach to validity, in which the validation process is formulation of the correlation coefficient.5

22
CHAPTER 2 Issues of Validity and Reliability for Assessments in Medical Education 23

This groundwork laid the foundation for a science of content to sample. Rather, there is a theory that sketches out
testing that expanded explosively during the First World the presumed nature of the trait. If the test score is a valid
War. The US Army had a monumental personnel problem: manifestation of ego strength, so conceived, its relations to
tens of thousands of recruits had to be placed in jobs. Test- other variables conform to the theoretical expectations.
ing provided a potentially effective and efficient means of
determining appropriate job placements.6 This effort estab- This approach to validation greatly expanded the types
lished psychological testing in the United States and, not of evidence that could be considered in evaluating an assess-
surprisingly, the science of testing was used in an effort to ment. For example, in the context of achievement testing,
boost industrial efficiency after the war. In both the mili- construct validation might argue for collecting evidence to
tary and industrial contexts, the question of interest was, demonstrate that examinees with advanced training in the
“How well do these tests predict performance on the job?” topic area outperform those with less training.
Evidence to justify the use of the test naturally conformed The 1950s brought two other important changes in the
to the approach established by Spearman and took the form conceptualization of validity. First, Campbell and Fiske
of a correlation between the test scores and an independent introduced the multitrait–multimethod matrix.12 The
assessment of job performance. matrix provided correlational evidence about the relative
The explosion in placement testing did much to define strength of relationship between different traits measured by
the view of validity during the period from 1920 through a single method and measures of the same trait using differ-
1950. Correlational evidence, referred to as criterion valid- ent methods. In the context of personality testing, examples
ity, was the standard during this period; in his 1951 chap- of traits may have included extroversion and aggression;
ter in the first edition of Educational Measurement, Edward methods may have included individual examiner-admin-
Cureton defined validity “in terms of the correlation istered assessments and group-administered paper-and-
between the actual test scores and the ‘true’ criterion score.”7 pencil assessments. Campbell and Fiske’s matrix provided
As a practical matter, criterion validity has obvious util- an empirical means of assessing the extent to which scores
ity. In placement testing, it has clear relevance to the inter- are impacted by otherwise irrelevant characteristics of the
pretation of the score and it provides an objective basis for assessment method or format (signaled by relatively higher
comparing multiple assessments available for a given pur- correlations between different traits measured by the same
pose. However, the strength of this approach is less apparent method compared with the same trait measured by differ-
for applications outside placement testing. One problem is ent methods). This method effect relates to what was later to
that an obvious and practical criterion may not be available. become known as construct-irrelevant variance, a concept
No clear and objective external criterion is likely to exist for that will be addressed in more detail later in this chapter.
an achievement test. If such a criterion is identified, the test The second important change in the conceptualization of
developer would need to provide validity evidence to sup- validity came when Loevinger focused attention on the
port the use of the criterion.8 proposed interpretation of test scores.13 This represented
Questions about the appropriateness of criterion validity an important shift in perspective from consideration of the
as a primary evaluation of assessments of academic achieve- relationship between the construct the test was designed to
ment led to the development of procedures for assessing measure and the test score to consideration of the corre-
content validity. The purpose of such evidence is to estab- spondence between what is measured by the test and the
lish that the content of the test reasonably represents the proposed interpretations of the test score.
domain of interest. This type of evidence clearly is necessary By the publication of the third edition of Educational
but not sufficient to establish the validity of interpretations Measurement, Messick was able to present a unified theory
for an achievement test. As Messick pointed out, evidence of validity.9 Rather than being defined as “the correlation
that the test is domain relevant provides no direct support between the actual test scores and the ‘true’ criterion score,”7
for inferences based on the test scores.9 validity now was viewed as the “… degree to which empirical
During the period after the Second World War, inter- evidence and theoretical rationales support the adequacy and
est in personality testing pushed researchers to continue to appropriateness of interpretations and actions based on test
consider the types of evidence required to support the use of scores.”9 Messick’s model built on the contributions of his pre-
these new instruments. Neither criterion nor content valid- decessors; following Cronbach and Meehl10 and Loevinger,13
ity models provided a particularly good fit to these tests. It he emphasized the need to specify the intended meaning and
was in this context that Cronbach and Meehl introduced the use of the test score before validation. Consistent with Cron-
idea of construct validity.10 In describing the issues that led bach and Meehl and Campbell and Fiske,12 Messick empha-
to their formulation of construct validity, Cronbach com- sized the importance of considering alternative hypotheses
mented in the second edition of Educational Measurement:11 such as the impact of construct-irrelevant variance. Addition-
ally, like these predecessors, Messick argued that the process
The rationale for construct validation (Cronbach and
of validation would involve an extended program of research.
Meehl, 1955) developed out of personality testing. For a
Messick’s formulation is consistent with previous
measure of, for example, ego strength, there is no uniquely
frameworks for validity, although he introduces a change
pertinent criterion to predict, nor is there a domain of
in emphasis. In particular, he placed increased emphasis
24 CHA P T ER 2 Issues of Validity and Reliability for Assessments in Medical Education

on evaluating the consequences of the testing program. previous secondary sources describing Messick’s concept
He believed that both the actual and potential social con- of validity theory. It is, however, important to remember
sequences of a test must be evaluated. Considering as an that both Messick’s unified theory of validity and the Stan-
example a test for medical licensure, at a minimum this dards emphasize that these are not different types of validity.
requirement leads to examination of consequences such as Rather, they are different sources of evidence, each of which
the test’s impact on what teachers choose to teach and what may be more or less important in providing support for a
learners choose to learn. More broadly, consequential valid- specific score interpretation.
ity would require consideration of the test’s impact on the The history of validity theory should make it clear that the
availability of medical practitioners both for the community definition of validity has expanded over time. The emphasis
at large and, perhaps, specifically for underserved communi- also has changed as the focus of testing has changed. Crite-
ties. Messick’s view of consequential validity went beyond rion validity (evidence based on relations to other variables)
these considerations; his views additionally required consid- has not been replaced; this type of evidence remains essential
eration of the impact that such an examination might have in evaluating admissions and employment tests. Similarly,
on the entrance of minority candidates into the profession. content validity represents an important source of evidence
This broad definition of consequential validity emphasizes in support of tests of achievement. The history of validity
the importance of test developers and administrators accept- is a history of both an expansion in meaning and a shift in
ing responsibility for their actions. The definition takes the emphasis. Recently, Kane has introduced an additional shift
validation process beyond the scientific evaluation of the in perspective by representing validity as an argument in
assessment into the arena of social and political values. support of the proposed interpretations of a test score.8,15
By 1999, the place of consequences within the sphere of As with previous stages in the evolution of validity theory,
validity was sufficiently well established that it was included Kane’s view does not deny the importance of the evidence
as one of the five sources of validity evidence referenced and perspectives that have been discussed during the last
in the Standards for Educational and Psychological Test- half century; those readers familiar with Messick’s writing
ing.14 Table 2.1 briefly describes these five sources of evi- on validity will find that Kane provides a shift in perspec-
dence. They are presented here to provide continuity with tive, not a rejection of the basic arguments. That shift in
TABLE
2.1 Sources of Validity Evidence Highlighted in the Standards for Testing

Response Internal Relationship to


Content Process Structure Other Variables Consequences
This includes evidence This includes This source of evidence This source of evidence may This includes intended
relating the test evidence about includes analytic include correlations with benefits such as
content to the domain how examin- results to support relevant criteria external protection of the
that is intended to be ees formulate the conclusion that to the test such as other public from practi-
represented by the test their responses. the structure of the assessments measuring tioners who are not
score. It may include Evidence could response data is the same or related qualified to practice
evidence about the come from inter- consistent with the constructs or more direct or identification
development of the views or think- intended test design. measures of the criteria of of students who
test specifications aloud studies, This may require interest such as direct would benefit from
(e.g., practice analysis); direct observation evidence that the test measures of performance remediation; issues
the match between of performance, is unidimensional or on the job or in an edu- related to fairness
the specifications and or examina- evidence that identified cational program. This and other societal
the actual content; tion of recorded components of the source of evidence may values such as
aspects of the domain evidence such as test measure distinct also include measures of impact on minority
that are not repre- drafts produced (but likely related) convergence/divergence. candidates; and
sented on the test or in the process characteristics. Examples might include unanticipated out-
are underrepresented; of completing a Another common type evidence that measures comes such as the
characteristics of the final version of of evidence related of the same trait assessed misuse of scores.
tasks/test questions. an essay. When to internal structure with different methods Note:
Note: human judges are focuses on identifying correlate more highly than The broad range of
As Kane points out, used for scoring, tasks/items that dis- measures of different traits evidence related
Messick viewed response process play differential perfor- assessed with the same to consequences
content as playing a evidence will mance for members of method. Similarly, there may include consid-
limited role in score also focus on the identifiable examinee may be theoretical grounds erations such as
validity because it extent to which subgroups (e.g., men for expecting certain per- misuse of scores.
does not directly the process used and women) after the sonality characteristics or This clearly goes
provide evidence for by judges is con- groups are matched proficiencies to be display well beyond the
the inferences to be sistent with the by the proficiency of patterns of convergence evidence and argu-
made based on test intended score interest. This matching or divergence. Results that ments related to the
scores. interpretations. is typically done based are in line with the theory intended interpreta-
on the total test score. may support the credibility tion of test scores.
of the scores.
CHAPTER 2 Issues of Validity and Reliability for Assessments in Medical Education 25

perspective does have one important characteristic: it high- Table 2.2 provides examples of some of the kinds of ques-
lights the fact that the collection of evidence in support of tions that arise at each stage of the argument. The questions
the interpretations of test scores must form a structured and are provided as examples and are not intended as an exhaus-
coherent argument that leads from the test administration tive list. The next sections will describe these four aspects of
to the interpretation. That structured argument is only as the validity argument. Within each section, details will be
strong as its weakest component. provided for three types of assessments that span a range of
the types of assessments currently used in medical education:
Kane’s View of Validity a multiple-choice examination, a performance assessment,
and a workplace-based assessment. Multiple-choice exami-
Implicit in the interpretation of a test score is a series of nations are ubiquitous in medical education from selection
assertions and assumptions that support that interpreta- to medical school, through classroom assessment, to licen-
tion. For example, the interpretation of a passing score on sure and certification (Swanson & Hawkins, this volume).
a medical licensing examination requires the assumption Performance assessments have a long history in medical
that the test was administered under standardized condi- education, with objective structured clinical examinations
tions and that the examinee did not have prior access to the and standardized patient-based examinations being in com-
test material. If the examinee cheated, no interpretation can mon use both within medical schools and as part of licen-
be made about the score regardless of other characteristics sure testing.16 Workplace-based assessments are becoming
of the test. Interpretation of the test score requires assump- an increasingly important part of assessment within medical
tions about the precision of the score; if the test score is not schools and particularly as part of residency training.17
reproducible, there is no basis for making an interpretation.
Interpretation of the score assumes that the test measures TABLE Questions Associated With Each of the Four
some relevant aspect of the overall set of knowledge, skills, 2.2 Components of Kane’s Argument-Based
and abilities required for the practice of medicine. It also Approach to Validity
assumes that the cut-score has been established in a way that Component Questions
supports the interpretation. If any one of these assumptions
is unfounded, the strength of the others is of little relevance. Scoring 1. Were the observations made or
Kane provides a structure for this validity argument that stimulus materials administered under
standardized conditions?
outlines four links in the inferential chain from the test 2. Were the scores recorded accurately?
administration to the final decision or interpretation.8,15 He 3. Were the scoring algorithms applied
labels these four components scoring, generalization, extrapo- correctly?
lation, and decision. Support for the scoring component of 4. Were appropriate security procedures
the overall argument includes evidence that the test was implemented?
administered properly, examinee behavior was captured cor- Generalization 1. What are the sources of measurement
rectly, and scoring rules were appropriate and were applied error that contribute to the observed
accurately and consistently. The generalization component scores on the assessment?
2. How similar would scores be across
of the argument requires evidence that the observations were replications of the measurement
appropriately sampled from the universe of test items, clini- procedure?
cal encounters, and the like. Generalization also requires evi- 3. How similar would classification
dence that the sample of observations was large enough to decisions be across replications of
produce scores with an acceptable level of precision. Broadly the measurement procedure?
4. To what extent are test forms con-
speaking, this stage in the argument asks the question: Is structed using a systematic process?
the test reliable? The extrapolation component of the argu-
ment requires evidence that the observations represented by Extrapolation 1. To what extent do the scores
correspond to real-world proficiencies
the test score were relevant to the target proficiency or con- of interest?
struct measured by the test. This requires a demonstration 2. Are there factors that interfere with
that the observations were relevant to the interpretation and assessment of the proficiencies of
that the scores were not unduly influenced by sources of interest?
variance that are irrelevant to the intended interpretation. 3. Do scores predict real-world outcomes
of interest?
The decision component of the argument requires evidence 4. Are there artificial aspects of the testing
in support of any theoretical framework necessary for score conditions that impact the scores?
interpretation or evidence in support of decision rules. For Decision 1. Was the standard established through
tests with a cut-score, this evidence would include support implementation of a defensible and
for the procedure used to establish that cut-score. Again, properly implemented procedure?
the score user can have confidence in an interpretation only 2. Do examinees identified for remediation
if there is evidence for each component of the overall argu- improve to meet the standard or benefit
more from a remediation program than
ment. The types of evidence required will vary with the pur- would those who were not identified?
pose and characteristics of the assessment.
26 CHA P T ER 2 Issues of Validity and Reliability for Assessments in Medical Education

Scoring little effort is required to be satisfied that two examin-


ees assigned to the same test form but sitting at differ-
The scoring component of the validity argument must provide ent computers are seeing the same items and that those
evidence that assessment data have been collected appropri- items are being scored in the same way. The same is not
ately and scored accurately. This will include consideration necessarily true for performance assessments such as tests
of a variety of types of evidence such as the extent to which using standardized patients or other formats that require
the stated conditions of standardization have been imple- humans to present and/or score the assessment. Adding
mented, the accuracy of the scoring process, and the choice the human element creates the possibility that two stan-
and implementation of scaling procedures. As with each of dardized patients trained to portray the same scenario
the four components of the validity argument, the specifics of may perform in a less-than-perfectly-standardized man-
the evidence that will be relevant to the scoring aspect of the ner; the same standardized patient may not portray the
argument will vary with the characteristics of the test. same scenario in the same way on two different occasions.
The scoring phase of the validity argument will need to
Example I: A Multiple-Choice Examination include evidence that standardized patients are trained to
an acceptable standard, and it also will require evidence
Standardized tests have been developed to provide the that standardized patients are monitored over time to
strongest possible evidence for the scoring and generaliza- ensure both interpatient and intrapatient consistency.
tion components of the validity argument. Adherence to the Similar issues arise with scoring for these tests; whether
conditions of standardization ensures that the data are col- the scores are produced by a standardized patient or con-
lected in the same manner for all examinees. Factors such tent expert, it will be necessary to assess the accuracy of
as the time allowed for the examination, seating, lighting, the process. Again, this aspect of testing must be verified
and the quality of the stimulus materials are controlled. To before testing begins and must continue to be monitored
the extent that administration procedures require documen- over time. It also is important to remember that collect-
tation of the violation of these conditions and annotation ing evidence of a high level of rater agreement during a
of score reports, the score user will have confidence in the small-scale pilot administration should not replace col-
conditions under which the test responses have been col- lecting the same evidence once the test is being adminis-
lected. Similarly, professionally administered and scored tered operationally.
tests routinely will have quality control steps built into the In addition to verifying that the overall error rate is
scoring process. “Key validation”—statistical analyses of low, both in standardized patient portrayal and in the
examinee responses designed to verify that the keyed answer scoring process, it will be important to provide evidence
is correct—provides evidence that the scoring rules have that there are not significant interactions between exam-
been applied accurately. This step includes examining the inee characteristics and standardized patients’ perfor-
proportion of examinees receiving credit for each item and mance or scoring. For example, the examinee’s gender or
comparing the probability of a correct score on the item for ethnicity should have no impact on the way the scenario is
examinees at different proficiency levels. portrayed and scored. If significant interactions are found
One important consideration for scores from high-stakes suggesting that examinees of otherwise equal proficiency are
tests is security. Examinees with low proficiency may be likely to receive better scores if they are, for example, male
motivated to cheat and may attempt to do so in any num- rather than female, this would be a serious threat to valid
ber of ways. When items are reused from one administra- score interpretation. This type of effect is more serious
tion to another, it is possible for examinees testing earlier than random error in portrayal or scoring because random
to steal (i.e., remember, copy, photograph) items and make errors tend to average out across encounters; systematic
them available to individuals testing on a later date. When effects do not.
computerized examinations are administered on a continu- Security issues also may be important with perfor-
ous basis, this threat to validity may be increased. Evidence mance assessments. If the test is used to make impor-
about the size of the item pool and the frequency with which tant decisions, examinees may attempt to improve their
items are reused will support the user’s confidence that prior scores by gaining prior access to test information. In most
exposure has not threatened the integrity of the score. For circumstances, performance assessments (and particu-
computer-based tests, encryption of test items at all times larly standardized patient–based tests) are administered
except when they are displayed on the screen may provide on multiple occasions. This creates the opportunity for
additional confidence in the security of the test material. examinees who have completed the examination to share
information with others who will test in the future; in
Example II: Performance Assessment most situations, prior knowledge about the specific tasks
that will appear on a test should be expected to influence
The reproducibility of the stimulus material and scoring examinee scores.18 This threat to validity is analogous
procedures is, as previously noted, a strength of standard- to the problem associated with the reuse of material on
ized tests comprising multiple-choice items. Relatively tests comprising multiple-choice items, but in the case
CHAPTER 2 Issues of Validity and Reliability for Assessments in Medical Education 27

of performance assessments it is much more difficult Two kinds of evidence are required for this stage of the
to produce large banks of test “items.” When tests are argument. First, it is necessary to show that the sample of
administered during a relatively short period, sequester- items presented or observations made of the examinee are
ing examinees to prevent the sharing of information may representative of the domain to which the score is to be
provide evidence that this threat to validity has been con- generalized. Second, it is necessary to demonstrate that the
trolled. With standardized patient–based tests, an addi- sampling is sufficiently extensive to prevent the observed
tional threat to security exists in that standardized patients scores from being unduly influenced by sampling error. The
themselves may share information with examinees before extent to which the sample is representative will depend on
the test administration. the procedures used for test construction (data collection for
workplace-based assessments); the adequacy of the sampling
Example III: Workplace-Based Assessment can be examined directly through a well-developed set of
theory-based statistical procedures.
Evaluation of a clinician or trainee through direct observa- The samples will be representative to the extent that
tion involves taking another step away from the completely data collection follows specified rules. In some cases, ran-
standardized stimulus material of the multiple-choice dom selection from a specified domain will be appropriate,
examination and the partially standardized conditions that whereas in other cases stratified sampling will be preferred.
exist in performance assessment to the relatively uncon- In some contexts, rules for the range of conditions under
trolled conditions of observation on the ward or clinic. which observations may (or must) be made will replace the
To support the interpretations of scores produced in this sampling of stimulus material.
setting, it will be necessary to produce evidence that dif- Far and away, the most developed aspect of test theory
ferent evaluators working in different settings are, in fact, relates to evaluation of reliability; conceptually this meth-
assessing the same construct in the same way. One means odology is designed to assess the relationship between
of providing such evidence would be to carefully define the observed scores and true scores or universe scores. The most
characteristics of performance to be rated. A combination common index of this relationship is the reliability coef-
of careful specification of what is being rated and thorough ficient; this coefficient represents the correlation between
training of evaluators may provide reasonable support for the observed test scores from two equivalent forms of the
the assertion that individuals are being assessed on the test. The square root of this value represents the correlation
same construct. between observed scores and true scores on the test. In the
Even with careful specification of the rating elements and classical test theory framework, the reliability coefficient
evaluator training, it will be important to collect evidence also is directly related to the standard error of measure-
that demonstrates that evaluators are in fact assessing the ment, which represents the distribution of observed scores
same constructs; think-aloud or other interview-based pro- around a given true score.
cedures may provide evidence regarding the specific attri- A wide variety of approaches have been developed to
butes the evaluators are considering. estimate the relationship between observed scores and true
scores. The usefulness of these procedures will depend on
Generalization how one conceptualizes the meaning of “a replication of
the measurement procedure.”22 Because the specific set of
This stage of the argument focuses on the relationship items and the specific time and date on which the test was
between the observed scores and the associated universe administered rarely are central to how the scores are to be
scores or true scores. Both universe scores and true scores interpreted, replication and reliability are often described
are conceptualizations; the universe score represents the in terms of the correspondence between scores achieved on
score that an examinee would receive if it were possible two forms of a test. The value of this standard rests on the
for that examinee to respond to all items representing the assumption that the characteristic to be measured has not
universe of acceptable observations (i.e., if the examinee changed between administrations.
responded to all items in the domain). The true score is When it is unlikely that relevant conditions of testing
a closely related concept representing the mean score that remain constant across occasions, it may be more appro-
the examinee would receive if he or she completed an priate to conceptualize a replication so that occasion is
unlimited number of randomly equivalent (parallel) forms held constant. In the practical sense in which the test is
of the test. (The observed score is the score that is actu- administered twice on the same day, replication on the
ally recorded when an examinee completes a specific test same occasion is open not only to the effects of fatigue but
form.) The details of these definitions and the related theo- also to the effects of practice leading to increased familiar-
ries are beyond the scope of this chapter; the interested ity with the format. In the literal sense of replication on
reader is referred to Gulliksen1 and Lord and Novick19 for the same occasion (in which two forms are administered
a detailed discussion of classical test theory and to Cron- simultaneously), these effects are absent but actual replica-
bach and associates20 and Brennan21 for discussions of tion is not possible; only a conceptual or theoretical repli-
generalizability theory. cation can exist.
28 CHA P T ER 2 Issues of Validity and Reliability for Assessments in Medical Education

For a test based on multiple-choice items, the definition relationship between items in a single test form. They work
of replication must include consideration of occasion and on the assumption that the strength of relationship (covari-
the selection of items. For more complex testing formats, ance) between item n and item m (n ≠ m) on a single test
the definition of replication similarly will be more com- form will provide a good approximation of the strength of
plex. Consider, for example, an essay examination. In this relationship between item “n” on test form 1 and item “m”
instance, the stimulus will be standardized but a replication on test form 2.
reasonably may involve a different set of essay prompts to Coefficient alpha and the Kuder-Richardson formulas
which an examinee will respond on a different occasion. are useful tools for collecting evidence about the general-
Additionally, the responses may be scored by a different set ization of test scores. Unfortunately, they have become a
of judges and the judges may evaluate the material on dif- kind of knee-jerk response to the question of score reliabil-
ferent occasions. The definition of a replication therefore ity. Too often, researchers seem to view estimating reliabil-
will depend on which features are considered fixed and ity as a requirement that allows them to report a coefficient
which are considered random. In this context, the defini- that a journal editor will demand rather than as an oppor-
tion of fixed and random variables is guided by the desired tunity to better understand the characteristics of their
interpretation of scores. In the event that score interpre- assessment. When applying these procedures, two impor-
tation assumes that judgments were made by a specific tant considerations arise. First and foremost, the evaluator
group of experts and that all examinees were judged by must again ask the question about what is meant by a rep-
the same experts, judges will be a fixed facet in the design. lication of the measurement procedure. These procedures
If it is sensible to view the specific judges as sampled from are appropriate when generalization is viewed in terms of
a larger group of similarly acceptable judges, the judges replication across items (or test forms) with all other con-
should be viewed as a random facet. Similarly, if the score ditions of measurement held constant. A second impor-
interpretations assume a specific set of test items or other tant consideration in applying these procedures is related
stimulus materials, then this facet is fixed; otherwise, if the to the assumptions used in their derivation. The central
stimuli may be viewed as sampled from a larger domain, assumption in interpreting coefficient alpha (or KR20) is
items must be viewed as a random facet. Random facets that, on average, the strength of relationship between any
will vary from one replication to the next; fixed facets will two items on a single test form is equal to that between
not vary. any two items on different forms of the test. When this
The appropriate methodology for examining the rela- assumption is violated, the results may misrepresent the
tionship between observed scores and true scores or estimat- actual reliability of the assessment substantially. Typically,
ing the standard error of measurement will depend on the this violation will result in an overestimate of reliability.
complexity of the data collection design. When practicality Consider, for example, the case in which a passage describ-
allows, actually repeating the measurement procedure will ing a clinical scenario is followed by several questions. It
provide a sound basis for assessing the relationship of inter- is common that the strength of relationship between ques-
est. The correlation between scores produced across replica- tions associated with the same passage will be greater than
tions will provide an appropriate estimate of the reliability the strength of relationship between items from different
of the test. Again, the square root of this value will represent passages. Because the scenarios typically will be different
the correlation between observed and true scores, and the from one test form to the next, the average relationship
well-known formula between items across test forms will be best approximated
√ by the relationship between items from different scenarios
σe = σX 1 − rXX ′
on a single test.
provides an estimate of the standard error of measurement. Another example of a situation in which these proce-
(In this formula, σx represents the standard deviation of the dures may be misapplied occurs in assessments in which
observed scores, σe represents the standard error, and rxx′ multiple judges assess an examinee’s performance on the
represents the reliability of the test.) same task. For example, consider the circumstance in which
In many circumstances, replication will not be practi- judges work in pairs to evaluate an examinee’s interac-
cal; for example, candidates for licensure cannot be called tion with a real patient. The assessment requires that each
upon to retest under the same high-stakes conditions after examinee interact with five patients, and each interaction
they have completed and passed an examination. Numer- is evaluated by a different pair of judges. If the judges score
ous procedures are available to evaluate test scores based on separately, the examinee will receive 10 scores. If all exam-
a single administration of an examination. Over a century inees have interacted with the same five patients and been
ago, Spearman and Brown introduced the first of these pro- scored by the single set of judges assigned to that patient,
cedures based on the correlation between split halves (e.g., the evaluator may be tempted to calculate coefficient alpha
even- and odd-numbered items) of an examination.4,23 based on this set of 10 scores. Again, however, because the
KR2024 and coefficient alpha25 estimate a value equal to strength of relationship between scores from judges evaluat-
the average of all possible split halves. These procedures ing performance with the same patient will be greater than
provide estimates of reliability based on the strength of that between pairs evaluating performance with different
CHAPTER 2 Issues of Validity and Reliability for Assessments in Medical Education 29

patients, this approach will not appropriately approximate of comparable test forms. The simplest case of the construc-
the strength of relationship between scores from different tion of multiple forms would be based on random selec-
tests. In this instance, the error in estimation may be sub- tion of items from an available pool of acceptable items.
stantial (e.g., the estimated standard error may be 50% of This is conceptually simple, but it is unusual for standard-
the correct value) and may grossly overstate the reproduc- ized tests. A more common approach would be to select
ibility of the scores. items to meet the constraints of a table of specifications
or test “blueprint.” In this case, items may be randomly
Generalizability Theory selected from each of a number of content categories (see
Table 2.3 for an example using a hypothetical 200-item
Classical test theory divides observed scores into two com- multiple-choice test in internal medicine). When differ-
ponents: true score and error. Because an examinee’s true ent item formats are included on the test, the table of
score is defined as uncorrelated with error, it follows that specifications may specify the number of items from each
observed-score variance is composed of true-score vari- combination of format and category. A common varia-
ance and error variance. Generalizability theory expands tion on this theme is to write items for a new form of the
this framework to divide the overall variance into multiple test to meet the specifications of the previous form. When
components. Consider as an example the simple testing systematic differences exist in the test construction proce-
situation in which examinees respond to essay prompts that dure across forms, estimation of the correlation between
are evaluated by raters. To study the generalizability of the scores on multiple forms based on generalizability analysis
results, a researcher collected data for a group of examin- of a single form will be inappropriate. When systematic
ees; all examinees responded to the same prompts, and all test construction procedures are used, multiple-choice-
responses were evaluated by the same raters. In the frame- based tests typically will have a reasonably simple data
work of generalizability theory, essay prompts and raters collection design, examinees will typically be the focus of
become distinguishable sources of error variance. As in the the measurement procedure (referred to as the object of
classical test theory framework, it is possible to take data measurement in generalizability theory terminology), and
from a single administration, estimate the reliability (or the sampling of items will represent a potential source of
generalizability) of the test, and project the expected reli- measurement error. With this simple design, three sources
ability of the test with differing numbers of essay prompts. of variance (referred to as variance components) can be
However, because generalizability theory provides a means estimated: a person variance component, which is con-
of making explicit the error contributed by variability both ceptually equivalent to true-score variance in classical test
in essay prompts and raters, this framework makes it pos- theory; an item variance component, which represents the
sible to further project how the reliability of the test would variability in item difficulty; and a person-by-item vari-
change if the number of raters evaluating performance on ance component, which represents residual variance not
each prompt also was varied. explained by the other two effects. The person-by-item
The previous paragraphs provide an overview of variance component divided by the number of items will
approaches for evaluating the generalizability of test represent the error variance when comparisons are being
scores. Again, a full discussion of these issues is well made between examinees who have completed the same
beyond the scope of this chapter. The following sections test form; when comparisons are being made between
provide additional specificity about generalizability con- examinees who have completed different test forms, the
siderations within the three assessment contexts consid- definition of error variance is more complicated. If the
ered previously. test forms are constructed through a process that approxi-
mates random sampling from an undifferentiated item
Example I: A Multiple-Choice Examination pool and there is no formal procedure to adjust the scores
for difficulty differences, the appropriate error variance
The focus of the generalization stage of the validity argu- will be the sum of the item variance component and the
ment is on the extent to which scores will be comparable person-by-item variance component both divided by the
across replications of the assessment procedure. In the con- number of items. When statistical equating procedures
text of standardized multiple-choice-based assessments, are used, the impact of the item variance component may
the interpretation of scores typically will require that they be reduced.
are comparable across multiple test forms. For example, a When items are sampled from fixed content categories,
licensing or certifying examination would lose credibility the analysis becomes more complicated. In this situation,
if examinees could expect widely varying scores based on there are variance components for persons (p); content
which test form they were assigned. categories (c); items nested in content categories (i:c); per-
Viewed from a generalizability theory framework, this sons by content categories (p × c); and persons by items
part of the argument will require several types of evidence. nested in content categories (p × i:c). In this case, the c
First, it will be necessary to demonstrate that the sampling component will not contribute measurement error because
procedure used for test construction supports the creation this structure is fixed across test forms. Similarly, because
30 CHA P T ER 2 Issues of Validity and Reliability for Assessments in Medical Education

TABLE
2.3 Sample Blueprint for a 200-Item Multiple-Choice Test in Internal Medicine

Number of Questions per Clinical Task


Making a Making Therapeutic Preventing Using Diagnostic
Disease Category/Organ System* Diagnosis Decisions Disease Studies Total
Cardiovascular disorders 10 9 5 6 30
Dermatologic disorders 4 2 2 2 10
Endocrine and metabolic disorders 7 6 3 4 20
Gynecologic disorders 3 3 2 2 10
Hematologic disorders 3 3 1 3 10
Immunologic disorders 3 3 2 2 10
Mental disorders 4 3 1 2 10
Musculoskeletal disorders 8 6 2 4 20
Neurologic disorders 6 4 2 3 15
Nutritional and digestive disorders 8 9 4 4 25
Renal, urinary, and male reproductive disorders 6 3 2 4 15
Respiratory disorders 8 9 4 4 25
Total 70 60 30 40 200

*Items related to infectious and neoplastic diseases are included in the affected organ system.
   

the categories are fixed, the p × c variance component more stable across groups, making it a more interpretable
will contribute to universe or true score variance. The p and useful index of precision.
× i:c component will contribute to error, and when com-
parisons are made across forms, the i:c will contribute to Example II: Performance Assessment
measurement error. The impact of this latter component
again will be mitigated to the extent that test forms are The logic of the argument described in the previous example
constructed or equated to be statistically equivalent. This holds in the context of performance assessments. To draw
stratification process typically will yield a smaller standard conclusions from analyses based on a single administration
error and larger generalizability coefficient than analysis of the assessment, the rules employed in test construction
without stratification; this is one reason that coefficient must guarantee that there will not be systematic differences
alpha is referred to as a lower-bound estimate of reliability. in test forms. The logistic realities of test delivery may make
It should be noted, however, that in practice the difference this more difficult when the items are people, but clearly
in coefficients typically is modest. generalization across test forms will be threatened if the
The error variance estimates produced using generaliz- tasks (e.g., standardized patients) on one form are systemati-
ability theory provide a basis for estimating the standard cally different than those on another. Important differences
error of measurement for the test; these are useful for pro- could include changes in the types of problems portrayed,
viding confidence intervals around scores. Generalizability as well as changes in the level of experience and training of
coefficients also may be produced as the ratio of the uni- the patients.
verse score variance divided by the sum of the universe score The generalizability of standardized tests composed of
variance and the error variance. Although these indices are multiple-choice items is relatively easy to evaluate, and even
commonly reported, caution is required because they will be the simpler classical test theory models provide adequate
sensitive to the specific sample of examinees used in the esti- tools for most situations. The complexity of performance
mation. Consider, for example, estimation of such an index assessments, however, makes evaluation of the generaliz-
for one of the steps of the United States Medical Licens- ability of scores a more difficult matter. Consider a test in
ing Examination; if the coefficient is estimated based on the which examinees rotate through a set of stations and at each
relatively homogeneous group of US graduates taking the station they interact with a patient and complete a patient
test for the first time, it may be several points lower than if note. The notes then are scored by a group of raters. When
it is estimated based on all examinees completing the test. examinees complete the same set of stations and notes are
In contrast, the standard error of measurement tends to be rated by the same set of raters, variance components can
CHAPTER 2 Issues of Validity and Reliability for Assessments in Medical Education 31

be estimated for persons, stations, raters, persons by sta- structured format—such as a professionally developed multi-
tions, persons by raters, stations by raters, and persons by ple-choice test—to a performance assessment or workplace-
stations by raters. (This is a relatively infrequent occurrence based observation. There are two reasons for this. First, it is
for large-scale objective structured clinical examinations. possible to sample from the domain of interest more widely
Even if all examinees rotate through the same set of stations, and efficiently with multiple-choice items because it takes
multiple “circuits” with different standardized patients por- relatively little time to respond to them and they are inex-
traying case roles and different raters grading performance pensive to score. Second, both the sampling of content and
are commonly used, and this adversely affects precision.)16 the scoring can be more highly standardized with multiple-
The evaluator will need to determine which of these compo- choice assessments so that the contribution of these factors
nents contributes to measurement error in the specific con- to measurement error can be markedly reduced.
text. Interaction terms that include the person and station The potential to sample more widely reduces the impact
effect almost always will contribute to measurement error— of the examinee-by-item interaction, as well as the effect of
regardless of the intended score interpretation—because the any higher-order interaction terms (including residual vari-
generalization argument is about the extent to which the ance). There is a widely held view that the examinee-by-
score from this test form is comparable to the score from item interaction term in the typical person-by-item design
a similarly constructed test form. By contrast, generaliza- represents “content specificity,” or the tendency for physi-
tion over raters may or may not be important. If the test is cian knowledge to be highly problem specific. The pervasive
administered in a context in which the same group of rat- nature of the effect is well documented: the examinee-by-
ers rates all examinees, and if there is no intention to draw item (or case) interaction term is routinely the largest single
inferences about how the examinees may have performed source of error variance. It is, however, less clear whether
with other raters, then raters are considered a fixed facet this term represents content specificity or other sources of
in the design. In this case, the rater and station-by-rater uncontrolled variability in the design. There is relatively
variance components will not contribute to measurement little research investigating how consistently examinees
error and the person-by-rater component will contribute respond to the same items or cases on different occasions.
to universe score variance. In most circumstances, however, To the extent that the effect of interest actually is content
users of test scores will want to draw inferences that extend specificity, examinees completing the same multiple-choice
beyond the group of raters scoring an examinee’s perfor- items or completing the same performance task on multiple
mance, and these variance components are best viewed as occasions would receive highly consistent scores. There is
contributing to measurement error (often substantially if some evidence from outside the domain of medical assess-
the typical examinee is scored by a small number of raters). ment to suggest that scores may not be highly reproducible
Up to this point, it should be clear that when a facet in across occasions. Similarly, there is evidence that the gen-
the design is considered fixed, the scores will have a smaller eralizability of test scores can be improved by building test
error variance and a higher level of generalizability. The eval- forms to consistently sample from fixed content categories;
uator may be tempted to try to increase the generalizability however, the absolute magnitude of this improvement gen-
of scores by considering facets fixed. This strategy is without erally is small.
merit; it gives an encouraging answer to the wrong question. As noted previously, a second reason for the lower gener-
alizability of scores resulting from assessments using perfor-
Example III: Workplace-Based Assessment mance tasks or workplace observation is that the conditions
of observation and scoring are more difficult to standardize.
When examinees are observed in a practice setting, the gen- This argues for increasing the structure of the assessment, but
eralization portion of the validity argument may be prob- this process requires careful thought. The decision to imple-
lematic. Although there may be explicit rules controlling the ment a less structured assessment instead of one that is more
sampling of observations, the logistics of workplace-based highly structured (e.g., a clinical rather than a multiple-
assessment could make it likely that the environmental fac- choice examination) is based on the perceived need to more
tors and patient characteristics are more similar from one directly assess the construct of interest. The problem lies in
observation to another within versus between examinees. the fact that changing the scoring procedure may increase the
This may lead to an overly optimistic report on the generaliz- standardization of the assessment by altering what is being
ability of scores. In this setting, the scores will be influenced assessed; the focus of the assessment, therefore, may shift in
by the rater effect, as well as an effect for the specific patient the direction of proficiencies that are more easily quantified
or task that provides the context for the observation. Depend- and away from its original intent. This is not to argue against
ing on the design used to assign raters, it may be difficult to making every effort to structure the assessment; the key is to
accurately estimate a rater effect. It also may be difficult or structure the assessment with a careful eye on the intended
even impossible to fully differentiate between variance associ- interpretation of the scores. Inevitably, it will be necessary to
ated with the difficulty of the patient’s presentation or other strike a balance between the generalizability of scores and the
characteristics of the task and the residual variance. extent to which one can extrapolate from those scores to the
It usually is the case that the generalizability of scores will actual proficiencies of interest. The next section examines the
decrease as the type of assessment changes from a highly extrapolation phase of the argument.
Another random document with
no related content on Scribd:
507

Temperature in acute simple meningitis,

717

718

in cerebral hemorrhage and apoplexy,

937

in cerebral hyperæmia,

770

in cerebral meningeal hemorrhage,

713

in chorea,

449
in concussion of the brain,

909

in epilepsy,

480

in general paralysis of the insane,

179

in hemiplegia,

960

in hysteria,

252

in infantile spinal paralysis,

1115

1124
in tabes dorsalis,

833

in tetanus,

551

in tetanus neonatorum,

564

in the opium habit,

656

657

in thermic fever,

391

in tubercular meningitis,

726

,
728

in tumors of the spinal cord,

1095

of face, in progressive unilateral facial atrophy,

696

of head, in tumors of brain,

1036

Tendon reflexes in chorea,

448

in epilepsy,

481

Tenotomy in writers' cramp,


538

indications for, in infantile spinal paralysis,

1158

Terminal dementia,

203

Termination of chorea,

449

of progressive unilateral facial atrophy,

699

of tumors of brain,

1045

of spinal cord,
1106

ETANUS

544

Definition,

544

Diagnosis,

551

from hysterical spasm and cerebro-spinal meningitis,

552

from strychnia-poisoning and hydrophobia,

552
Etiology,

544-547

Age, race, and sex, influence on causation,

544

Atmospheric and climatic conditions,

544

Theories regarding origin,

546

547

Traumatic causes,

545

Morbid anatomy,

548
Prognosis,

553

Symptoms,

549

Digestive and urinary organs, disorders of,

551

Mental condition in,

550

Prodromata,

549

Pulse and temperature in,

550

Spasms, seat and characters of,

549
,

550

Treatment,

555-562

Alcohol, use,

559

Bleeding, use,

555

Calabar bean and cannabis indica, use,

557

Chloral and potassium bromide, use,

559

Conium, use,

557
Curare, use,

556

Mercury, use,

555

Opium, use,

557

Preventive,

561

Surgical measures,

559-562

Tobacco and anæsthetics, use,

566
Tetanus neonatorum,

563

Etiology,

563

Sims on displacement of occipital bone as a cause,

563

Morbid anatomy,

564

Symptoms,

564

Treatment,

565

Tetanus, puerperal, treatment,


562

Thermic fever,

388

Diagnosis,

396

Etiology,

388

Alcohol, influence on causation,

389

Race, influence on causation,

389

Post-mortem changes in and theories of the disease,

392
Sequelæ,

399

Symptoms,

389

Cerebral,

390

391

Digestive disorders,

390

391

Entero-colitis of infancy, relation of, to,

390
Motor system, disturbances of,

391

Synonyms,

388

Treatment and prophylaxis,

396

Antipyrine, use,

398

Bleeding, use,

398

399

Blisters, use,

398-400
Cold and cold baths, use,

396

397

Morphia, use,

398

of cerebral,

398

of pyrexia,

396

397

of sequelæ,

398
Thomsen's disease,

461

Thrombosis of cerebral veins and sinuses,

982

of spinal cord,

808

Tic douloureux,

1233

Tinnitus aurium in hyperæmia,

772

in nervous diseases,

40
,

41

in tabes dorsalis,

834

Tinnitus in cerebral anæmia,

780

Titubation, in nervous diseases,

49

Tobacco, abuse of, influence on causation of alcoholism,

577

of tabes dorsalis,

854
of tremor,

429

of writers' cramp,

512

use, in tetanus,

556

Tongue-biting in epilepsy,

481

state of, in cerebral hemorrhage and apoplexy,

941

in labio-glosso-laryngeal paralysis,

1171
in tubercular meningitis,

725

726

729

Tonic spasms in nervous diseases,

44

Tonics, use, in catalepsy,

338

in melancholia,

160

Tonka, use, in neuralgia,


1229

Torsion of abdominal walls in treatment of hystero-epilepsy,

311

Torticollis,

463

Etiology,

463

Prognosis,

464

Symptoms,

463

Treatment,

You might also like