Professional Documents
Culture Documents
DOI 10.1007/s10459-016-9716-3
Brian Hodges4
Received: 13 March 2016 / Accepted: 26 September 2016 / Published online: 1 October 2016
Springer Science+Business Media Dordrecht 2016
Abstract Validity is one of the most debated constructs in our field; debates abound about
what is legitimate and what is not, and the word continues to be used in ways that are
explicitly disavowed by current practice guidelines. The resultant tensions have not been
well characterized, yet their existence suggests that different uses may maintain some
value for the user that needs to be better understood. We conducted an empirical form of
Discourse Analysis to document the multiple ways in which validity is described, under-
stood, and used in the health professions education field. We created and analyzed an
archive of texts identified from multiple sources, including formal databases such as
PubMED, ERIC and PsycINFO as well as the authors’ personal assessment libraries. An
iterative analytic process was used to identify, discuss, and characterize emerging dis-
courses about validity. Three discourses of validity were identified. Validity as a test
characteristic is underpinned by the notion that validity is an intrinsic property of a tool
and could, therefore, be seen as content and context independent. Validity as an argument-
based evidentiary-chain emphasizes the importance of supporting the interpretation of
assessment results with ongoing analysis such that validity does not belong to the tool/
instrument itself. The emphasis is on process-based validation (emphasizing the journey
instead of the goal). Validity as a social imperative foregrounds the consequences of
assessment at the individual and societal levels, be they positive or negative. The existence
of different discourses may explain—in part—results observed in recent systematic
reviews that highlighted discrepancies and tensions between recommendations for practice
123
854 C. St-Onge et al.
and the validation practices that are actually adopted and reported. Some of these practices,
despite contravening accepted validation ‘guidelines’, may nevertheless respond to dif-
ferent and somewhat unarticulated needs within health professional education.
Introduction
123
Validity: one word with a plurality of meanings 855
Purpose
The primary purpose of this study was to use discourse analysis to identify the different
ways in which the term ‘validity’ is used within health professions education. Further, we
aimed to determine who participates in each discourse (and how) and to gain some
understanding of the potential consequences of emphasizing one discourse over another.
The objective of this research was not to generate one unifying conception or definition of
validity, nor to label one conceptualization or discourse as the right one. Rather, we
examined the literature on and about validity with the objective of identifying underlying,
implicit, and sometimes disavowed conceptions that surround the construct of validity.
Method
Design
Our research is based on discourse theory (Mills 2004) and thus employs a methodology of
Discourse Analysis. Our approach is primarily what Hodges et al. (2008) have labeled
‘empirical discourse analysis’—having a primary focus on ways validity is constructed in
text and language in the health professions education literature.
Data (Archive)
Discourse analysis begins with identification of an archive—the textual and other materials
that are analyzed to identify discourses of interest (Hodges et al. 2008). We predominantly
restricted our study archive to sources from health professions education because the
purpose was to explore how the construct is variably used by people working in that field.
Recognizing that the word validity is used in many disciplinary fields with a multiplicity of
meanings, however, we used secondary sources as necessary to generate a more complete
understanding of health professions education sources that made reference to or concep-
tually relied on other literatures. The archive assembled for this research was constructed
in a four-step process and is comprised of English and French published peer-reviewed
articles and books on the topic of validity and assessment.
Step 1 All authors of this paper identified five to six key papers on validity and
assessment in the field of health professions education from their personal collection that
they considered to be ‘very important’ in terms of framing assessment practices. One
author (CSO) reviewed the texts critically to identify emerging language and concepts
about validity.
Step 2 The references from this collection of papers were examined to expand the
archive to include secondary sources. One author (CSO) reviewed these texts in order to
further identify emerging discourses.
Step 3 Involved a more formal search of health professions education literature using
PubMED, ERIC, and PsycINFO/PsycLit, employing the assistance of an academic
librarian. The goal of this search was to identify a larger set of articles, published
between 1995 and 2013, on the topic of ‘validity of assessment’ in health professions
education. The review was conducted using the truncated key words valid* and assess*
with the intention of increasing the breadth and coverage of the archive rather than with
the aim of undertaking a traditional systematic review. In other words, the goal was not
to be comprehensive with this search or to set very precise inclusion/exclusion criteria,
123
856 C. St-Onge et al.
Procedure
For each document identified, NVivo 9 was used to facilitate the coding of ways in which
the concept of validity was described. Particularly helpful in identifying different con-
ceptualizations of validity were statements indicating what one ‘should’ or ‘must’ do
(deontic modals) to claim validity, statements about the consequences and implications of
such claims, and statements of truth (i.e., seemingly self-evident statements) about validity.
Analysis
The analysis was iterative, building the discourse table sequentially at each successive
stage and it was informed by the team’s collective and individual expertise (that are
subsequently described) and the inherent perspectives that result. As elements of discourse
(key words, concepts, arguments, associated individuals and institutions) were more
clearly identified, they were sorted into patterns that eventually led to the construction of
Table 1. Throughout this process there were frequent discussions with the whole research
team, and the discourses and their associated elements developed iteratively and consen-
sually. An effort was made to identify various aspects of each emerging discourse: a core
conceptualization, characteristics, things made possible by each discourse, observations
about which institutions appeared to have authority or to hold power in relation to par-
ticular discourses, and finally, ways in which individuals appeared to be participating in
each discourse. Though much more speculative, we also looked for indications of con-
sequences that might result from emphasizing one discourse over another in practice. This
process took place over several months with underlying and identified discourses being re-
visited, discussed, and refined until consensus among the research team was achieved.
The authors have various backgrounds and expertise which informed the analysis of the
data. The primary author, CSO, was trained in measurement and assessment. BH is a
clinician and educator who is well-versed in discourse analysis. MY and KE brought a
cognitive psychology perspective to the study of assessment and validity. KE and BH are
senior health professions education researchers who have dedicated much of their aca-
demic effort to studying issues of assessment as well as contributing to various local,
national, and international committees charged with guiding assessment practices. Both
have an extensive track record of developing assessment protocols for medical training.
MY and CSO bring more novice yet complementary perspectives.
123
Validity: one word with a plurality of meanings 857
Definition The degree to which the test The evidences presented to A bird’s eye view of
actually measures what it support or refute the assessment that
purports to measure meaning or interpretation foregrounds broader
assigned to assessment individual and societal
results issues
Characteristics Validity is a goal or a gold Validity is a journey on which Validity and validation
seal of approval one embarks to provide are matters of social
evidence supporting the accountability
interpretation of scores
Validity is Static Fluid Built-in
viewed as…
Focus of Individual tools can be Defensible interpretation of Individual and societal
evidence is considered valid, and the scores impact of assessment
on… validity can generalize to
the tool format («MCQs are
valid»)
Things made The quest for the holy grail of Validation approaches and Holistic and a priori
possible assessment; one tool that is standards consideration for
more valid than the others societal impact of
assessment
Validation A posteriori (mainly) A posteriori (mainly) A priori (mainly)
occurs…
Validation Psychometric Mostly psychometric Mostly expert judgement
data focused
on…
Results
Three different discourses were identified in the archival materials studied: (1) Validity as
a test characteristic, (2) Validity as an argument-based evidentiary-chain and (3) Validity
as a social imperative. We do not consider these to be the only possible conceptualizations
of validity, but these were the discourses that were observed most strongly in our analysis.
Further, we do not believe these discourses to be mutually exclusive given that we wit-
nessed instances where more than one discourse was present in a given publication.
Nevertheless, as analysis proceeded, these three discourses were the most distinguishable.
Validity as a test characteristic is underpinned by the notion that validity is an intrinsic
property of a tool and could, therefore, be seen as content and context independent.
Validity as an argument-based evidentiary-chain emphasizes the importance of supporting
the interpretation of assessment results with ongoing analysis such that validity does not
belong to the tool/instrument itself. The emphasis is on process-based validation (em-
phasizing the journey instead of the goal). Validity as a social imperative foregrounds the
consequences of assessment at the individual and societal levels, be they positive or
negative. Detailed descriptions follow, but see Table 1 for a summary of the discourses.
123
858 C. St-Onge et al.
In this discourse, validity is often defined as ‘‘the degree to which the test actually mea-
sures what it purports to measure’’ (Anastasi 1988, p. 28). Taken literally and at face value,
such a definition treats claims of validity as statements about a characteristic of the tool
itself (i.e., something that inherently belongs to the tool). As such, validity as a property
inherent in a tool spans domains of content, context, and time. Often found associated with
this discourse is a concept illustrated by the statement: ‘‘a test is valid for anything with
which it correlates’’ (Guilford 1946, p. 429). Once a tool is branded as ‘valid’ using this
discourse, the tool is treated as though it retains that quality indefinitely.
Validity as a test characteristic, therefore, can be thought of as a ‘gold seal of approval’.
As an example, those employing this discourse would say that Multiple Choice Questions
(MCQs) are a valid measure of knowledge, full stop. Thus they can be defended as the go-
to format when creating written exams. Here are a few examples of claims about tools that
are framed as having achieved a ‘gold seal’ of validity:
The Job Descriptive Index (JDI), a validated survey, was used to measure levels of
job satisfaction. (Graeff et al. 2014, p. 15 -emphasis by authors)
Evidence has shown the JSE [Jefferson Scale of Empathy] to be a valid and reliable
measure of empathy in medical students and physicians in the context of healthcare.
(Van Winkle et al. 2013, p. 219 -emphasis by authors)
When one considers validity to be a quality of a tool, the door is opened for the possibility
that for any given domain (knowledge, skills, professionalism, etc.) there could be one
superior tool that could be shown to be the most valid. Thus, this discourse makes possible
the quest for ‘holy grails’ of assessment, a quest to identify the ‘best’ assessment tools,
independent of content or ability to-be-measured. For example, it is written that,
MCQ testing is the most efficient form of written assessment, being both reliable and
valid by broad coverage of content. (McCoubrie 2004, p. 711 -emphasis by authors)
Interestingly, we found that the discourse of validity as a test characteristic is judged
harshly by some who argue that this view is ‘antiquated’, ‘controversial’, or ‘lacking in
value’.
We often read about ‘‘validated instruments.’’ This conceptualization implies a
dichotomy—either the instrument is valid or it is not. This view is inaccurate. First,
we must remember that validity is a property of the inference, not the instrument.
Second, the validity of interpretations is always a matter of degree. An instrument’s
scores will reflect the underlying construct more accurately or less accurately but
never perfectly. (Cook and Beckman 2006, p. 166e10)
When considering categories of validity, it is understood that validity evidence exists
to various degrees, but there is no threshold at which an assessment is said to be
valid. (Beckman et al. 2004, p. 973)
However, the continued presence of this discourse of validity in the health professions
education literature suggests that it may fill a pragmatic need for individuals who require
‘off-the-shelf solutions’ (e.g., educators and administrators who do not have the desire,
knowledge, or resources to create ‘valid’ assessment programs, tools, or approaches de
novo). Using the discourse of validity as a test characteristic permits the possibility of
123
Validity: one word with a plurality of meanings 859
‘found’ solutions to overcome the challenges associated with assessing students and future
professionals using tools reported to be of high quality in a context of limited resources or
limited psychometric expertise.
… Developing a valid and reliable assessment of competence is not easy to achieve
with the resources available at the university level. (Roberts et al. 2006, p. 542 -
emphasis by authors)
…it is important that valid instruments are created so administrators can better assess
the educational needs of prospective physicians, their practices, and patient out-
comes… (Schulman and Wolfe 2000, p. 107 -emphasis by authors)
We can speculate about some of the effects of emphasizing validity as a test characteristic
over other discourses of validity. Validity as an immutable property of a test has the
potential to create a false sense of security for assessment practitioners, who may never
feel the need to question or re-evaluate an instrument’s ‘gold seal’. Using an MCQ exam
format, for example, because it is said to be ‘valid’ (McCoubrie 2004) without
consideration for item-writing guidelines such as those put forward in Haladyna et al.
2002), or the nature of the context in which the MCQs are used, is an example of such a
blind spot. Similarly, choosing to put in place an assessment approach like OSCEs or
MMIs without proper blueprinting strategies or without vigilance for problematic scenarios
may defeat the intended purpose of achieving a contextually meaningful assessment (Eva
and Macala 2014). Moreover, using tests beyond their original contexts or for purposes
other than the originally studied uses can have consequences. Gould (1996) eloquently
illustrates the dangers of employed ‘validated’ tests indiscriminately in his book The
Mismeasure of Man in which he documents how the IQ test, based on the premise that
intelligence can be quantified in a single, decontextualized score, has been used in novel
contexts (such as for immigration purposes) to draw inappropriate conclusions such as
labelling of entire ethno-cultural groups as ‘less intelligent’ based on test scores.
123
860 C. St-Onge et al.
would apply only to one instance of its use. Subsequent usages and the results generated
would require repeated validation.
For those employing the discourse of validity as an argument-based evidentiary-chain,
there is an analogy to the scientific method in that ‘validation’ is used to provide evidence
to support or refute the use and interpretation of data/scores. There is also recognition that
conclusions may change as evidence continues to accumulate. It is thus characterized by
some as a journey on which one embarks after having defined the assessment purpose(s).
The goal is to collect as much evidence as possible in a validation process and to identify
variables/factors that inform the degree to which data produced by a particular test are
valid but also to understand and set limits on claims of validity.
Users of this discourse aim to create assessment strategies that are based on theories and
then evaluate if the observed results show evidence of expected manifestations of the
underlying theory. The use of script-concordance-testing (SCT) is an example of a theory-
based assessment strategy (Charlin et al. 2000) and the validation practices put in place for
SCT often reflect the premise that experts have more developed ‘illness scripts’ and thus
should perform better on the ill-defined tasks reflected in the SCT than more novice
examinees. While there is considerable evidence in that regard, the recent scrutiny of SCTs
(e.g., Lineberry et al. 2013) highlights the ‘argumentative’ nature of validation practices.
Thus, this discourse places validation approaches and standards at the forefront, and two
authors—Messick (1995) and Kane (2006)—are very frequently cited as anchor authorities
for this discourse. Validity as an argument-based evidentiary-chain thus creates a set of
rules and regulations to be applied if one wants to speak authoritatively about the quality of
the scores generated by an assessment.
Methods for evaluating the validity of results from psychometric assessments derive
from theories of psychology and educational assessment, and there is extensive
literature in these disciplines. (Cook and Beckman 2006)
This discourse appears to be strongly associated with formalized assessment institutions
such as; the Standards for Educational and Psychological Testing (AERA et al. 1999), the
Educational Testing Service, and others that regulate validity and validation practices.
Moreover, these institutions legitimize the role of highly qualified people who apply and
enforce the recommended practices.
In an attempt to establish a unified approach to validity, the American Psychological
Association published standards that integrate emerging concepts. These standards
readily translate to medical practice and research and provide a comprehensive
approach for assessing the validity of results derived from psychometric instruments.
(Cook and Beckman 2006, p. 166.e8)
An apparent consequence of an over-emphasis of this discourse of validity is that the
validation process would become a never-ending process (Bertrand and Blais 2004):
‘‘validity and assessment validation and revision is a never-ending cycle’’ (Beckman et al.
2009, p. 188). In addition, there appear to be no clear rules about how to weigh the
different forms of evidence for different score interpretations. Thus, one can remain
engaged in a continuous quality assurance process with the need to interpret and
incorporate each new piece of validity evidence collected and each new score
interpretation (Cook et al. 2015). Finally, one further consequence of emphasizing this
discourse is that little consideration may be given to content experts who feel that they
‘know’ what defines good performance in practice but who may become undervalued in
the assessment process because their judgement does not seem relevant or reliable in the
123
Validity: one word with a plurality of meanings 861
123
862 C. St-Onge et al.
123
Validity: one word with a plurality of meanings 863
Discussion
This study has focused on the use of the term ‘‘validity’’ within the specific context of
health professions education and assessment. Independent of the underlying conceptual-
ization, it is clear that validity is a commonly used -and highly loaded- term in health
professions education. This can be seen through considerable recent work that has
extensively critiqued reported validation practices both in the health professions education
literature and in the more general education psychology literature (Cook et al. 2013, 2014).
The purpose of our discourse analysis was to clarify some of the diversity that might arise
in order to help clarify underlying tensions or unarticulated assumptions. Validity seems to
signify, for some, that a tool can or even should be used since it has met a certain gold
standard. For others, it seems to speak to the process put in place to ensure the appro-
priateness of the score interpretation. For a third group, validity seems to be about the
considerations for the role and value of assessment for learners and society with a focus on
minimizing unintended consequences.
The power relations and benefits gained by adopting each of these discourses may
explain—in part—the observed discrepancies between the processes that are sometimes
adopted in published validation practices and the processes recommended by the guide-
lines generated and endorsed by testing organizations. To be credible to those employing
123
864 C. St-Onge et al.
123
Validity: one word with a plurality of meanings 865
Conclusion
Validity has several different meanings in health professions education. What these
meanings have in common, perhaps, is an implicit understanding that validity, in some
form, is and should be at the heart of any discussion about assessment development and
quality monitoring. It is likely that the discourses observed in this study arise in relation to
usages of the concept of validity in a number of parallel disciplines and fields that influence
health professions education. It may be that changes in which discourse is seen as legit-
imate or dominant can be traced to changing relationships between the health professions
and other fields. Having mapped some of the dominant discourses within health profes-
sions education, future work in elucidating these relationships would be of value. We
believe that our study suggests that taking up validation practices requires a better
understanding of the diverse conceptions of validity and that each may fill very different
and sometimes disavowed needs for health professions educators, researchers, adminis-
trators, and their organizations. As such, our recommendation for those employing
123
866 C. St-Onge et al.
Acknowledgments The authors would like to thank Catherine Côté for her help with data management in
NVivo, Tim Dubé, Ph.D. and anonymous reviewers for feedback on a previous version of this manuscript.
Funding for this project was provided by the Société des médecins de l’Université de Sherbrooke Research
Chair in Medical Education, held by Christina St-Onge.
References
AERA, Apa, & NCME (American Educational Research Association & National Council on Measurement
in Education) Joint Committee on Standards for Educational and Psychological Testing, A. P. A.
(1999). Standards for Educational and Psychological Testing. Washington, DC: AERA.
Anastasi, A. (1988). Psychological testing (Vol. 6th). New York: Macmillan.
Andreatta, P. B., & Gruppen, L. D. (2009). Conceptualising and classifying validity evidence for simulation.
Medical Education, 43(11), 1028–1035. doi:10.1111/j.1365-2923.2009.03454.x.
Beckman, T. J., Ghosh, A. K., Cook, D. A., Erwin, P. J., & Mandrekar, J. N. (2004). How reliable are
assessments of clinical teaching? A review of the published instruments. Journal of General Internal
Medicine, 19(9), 971–977. doi:10.1111/j.1525-1497.2004.40066.x.
Beckman, T. J., Mandrekar, J. N., Engstler, G. J., & Ficalora, R. D. (2009). Determining reliability of
clinical assessment scores in real time. Teaching and Learning in Medicine, 21(3), 188–194.
Berendonk, C., Stalmeijer, R. E., & Schuwirth, L. W. T. (2013). Expertise in performance assessment:
Assessors’ perspectives. Advances in Health Sciences Education, 18(4), 559–571.
Bertrand, R., & Blais, J.-G. (2004). Modèles de Mesure: L’Apport de la Théorie des Réponses aux Items
(Vol. 2004). Retrieved from https://books.google.com/books?hl=fr&lr=&id=3hPlCHaA7DoC&pgis=1.
Charlin, B., Roy, L., Brailovsky, C., Goulet, F., & van der Vleuten, C. (2000). The script concordance test:
A tool to assess the reflective clinician. Teachning and Learning in Medicine, 12(4), 189–195.
Cizek, G. J., Bowen, D., & Church, K. (2010). Sources of validity evidence for educational and psycho-
logical tests: A follow-up study. Educational and Psychological Measurement, 70(5), 732–743. doi:10.
1177/0013164410379323.
Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and
psychological tests. Educational and Psychological Measurement, 68(3), 397–412. doi:10.1177/
0013164407310130.
Cook, D. A., & Beckman, T. J. (2006). Current concepts in validity and reliability for psychometric
instruments: Theory and application. The American Journal of Medicine, 119(2), 166.e7–166.e16.
doi:10.1016/j.amjmed.2005.10.036.
Cook, D. A., Brydges, R., Ginsburg, S., & Hatala, R. (2015). A contemporary approach to validity argu-
ments: A practical guide to Kane’s framework. Medical Education, 49(6), 560–575.
Cook, D. A., Brydges, R., Zendejas, B., Hamstra, S. J., & Hatala, R. (2013). Technology-enhanced simu-
lation to assess health professionals: A systematic review of validity evidence, research methods, and
reporting quality. Academic Medicine: Journal of the Association of American Medical Colleges,
88(6), 872–883. doi:10.1097/ACM.0b013e31828ffdcf.
Cook, D. A., Zendejas, B., Hamstra, S. J., Hatala, R., & Brydges, R. (2014). What counts as validity
evidence? Examples and prevalence in a systematic review of simulation-based assessment. Advances
in Health Sciences Education: Theory and Practice, 19(2), 233–250. doi:10.1007/s10459-013-9458-4.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.,
pp. 443–507). Washington, DC: American Council on Education.
Crossley, J., Humphris, G., & Jolly, B. (2002). Assessing health professionals. Medical Education, 36,
800–804.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (1st ed., pp. 621–694).
Washington, DC: American Council on Education.
123
Validity: one word with a plurality of meanings 867
Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education,
37, 830–837.
Eva, K. W., & Macala, C. (2014). Multiple mini-interview test characteristics:’Tis better to ask candidates to
recall than to imagine. Medical Education, 48(6), 604–613. doi:10.1111/medu.12402.
Gieryn, T. F. (1983). Boundary-work and the demarcation of science from non-science: Strains and interests
in professional ideologies of scientists. American Sociological Review, 48(6), 781–795.
Gould, S. J. (1996). The mismeasure of man. New York: WW Norton & Company.
Graeff, E. C., Leafman, J. S., Wallace, L., & Stewart, G. (2014). Job satisfaction levels of physician assistant
faculty in the United States. The Journal of Physician Assistant Education, 25(2), 15–20.
Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6(4),
427–438.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing
guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–334.
Hodges, B. D. (2003). Validity and the OSCE. Medical Teacher, 25(3), 250–254.
Hodges, B. D., Kuper, A., & Reeves, S. (2008). Discourse analysis. BMJ (Clinical Research Ed.), 337, a879.
doi:10.1136/bmj.a879.
Huddle, T. S., & Heudebert, G. R. (2007). Taking apart the art: The risk of anatomizing clinical competence.
Academic Medicine: Journal of the Association of American Medical Colleges, 82(6), 536–541. doi:10.
1097/ACM.0b013e3180555935.
Kane, M. (2006). Content-related validity evidence in test development. In S. M. Downing & T. M. Hala-
dyna (Eds.), Handbook of test development (pp. 131–153). Mahwah, NJ: Lawrence Erlbaum Associates
Publishers.
Kuper, A., Reeves, S., Albert, M., & Hodges, B. D. (2007). Assessment: Do we need to broaden our
methodological horizons? Medical Education, 41, 1121–1123.
Lineberry, M., Kreiter, C. D., & Bordage, G. (2013). Threats to validity in the use and interpretation of script
concordance test scores. Medical Education, 47(12), 1175–1183. doi:10.1111/medu.12283.
Lingard, L. (2009). What we see and don’t see when we look at ‘‘competence’’: Notes on a god term.
Advances in Health Sciences Education, 14, 625–628.
McCoubrie, P. (2004). Improving the fairness of multiple-choice questions: A literature review. Medical
Teacher, 26(8), 709–712.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Edu-
cational measurement: Issues and practice, 14(4), 5–8.
Mills, S. (2004). Discourse. London: Routledge.
Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., et al. (2012). The
COSMIN checklist manual. Amsterdam: VU University Medical. doi:10.1186/1471-2288-10-22.
Norman, G. (2004). Editorial—The morality of medical school admissions. Advances in Health Sciences
Education, 9(2), 79–82. doi:10.1023/B:AHSE.0000027553.28703.cf.
Norman, G. (2015). Identifying the bad apples. Advances in Health Sciences Education, 20(2), 299–303.
doi:10.1007/s10459-015-9598-9.
Portney, L. G. (2000). Validity of measurements. In U. S. River (Ed.), Foundations of clinical research:
Applications to practice (Vol. 2, Chap. 6). NJ: Prentice Hall.
Roberts, C., Newble, D., Jolly, B., Reed, M., & Hampton, K. (2006). Assuring the quality of high-stakes
undergraduate assessments of clinical competence. Medical Teacher, 28(6), 535–543. doi:10.1080/
01421590600711187.
Schulman, J. A., & Wolfe, E. W. (2000). Development of a nutrition self-efficacy scale for prospective
physicians. Journal of Applied Measurement, 1(2), 107–130.
Schuwirth, L. W. T., & van der Vleuten, C. (2012). Programmatic assessment and Kane’s validity per-
spective. Medical Education, 46(1), 38–48. doi:10.1111/j.1365-2923.2011.04098.x.
Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educational Mea-
surement: Issues and Practice, 16(2), 5–8. doi:10.1111/j.1745-3992.1997.tb00585.x.
Swanson, D. B., & Roberts, T. E. (2016). Trends in national licensing examinations in medicine. Medical
Education, 50(1), 101–114. doi:10.1111/medu.12810.
Van Der Vleuten, C. P. M., Schuwirth, L. W. T., Scheele, F., Driessen, E. W., & Hodges, B. (2010). The
assessment of professional competence: Building blocks for theory development. Best Practice and
Research: Clinical Obstetrics and Gynaecology, 24(6), 703–719. doi:10.1016/j.bpobgyn.2010.04.001.
Van Winkle, L. J., La Salle, S., Richardson, L., Bjork, B. C., Burdick, P., Chandar, N., et al. (2013).
Challenging medical students to confront their biases: A case study simulation approach, 23(2),
217–224.
Wools, S., & Eggens, T. (2013). Systematic review on validation studies in medical education assessment.
In AERA annual meeting 2013. San Francisco.
123