You are on page 1of 15

Adv in Health Sci Educ (2017) 22:853–867

DOI 10.1007/s10459-016-9716-3

Validity: one word with a plurality of meanings

Christina St-Onge1 • Meredith Young2 • Kevin W. Eva3 •

Brian Hodges4

Received: 13 March 2016 / Accepted: 26 September 2016 / Published online: 1 October 2016
 Springer Science+Business Media Dordrecht 2016

Abstract Validity is one of the most debated constructs in our field; debates abound about
what is legitimate and what is not, and the word continues to be used in ways that are
explicitly disavowed by current practice guidelines. The resultant tensions have not been
well characterized, yet their existence suggests that different uses may maintain some
value for the user that needs to be better understood. We conducted an empirical form of
Discourse Analysis to document the multiple ways in which validity is described, under-
stood, and used in the health professions education field. We created and analyzed an
archive of texts identified from multiple sources, including formal databases such as
PubMED, ERIC and PsycINFO as well as the authors’ personal assessment libraries. An
iterative analytic process was used to identify, discuss, and characterize emerging dis-
courses about validity. Three discourses of validity were identified. Validity as a test
characteristic is underpinned by the notion that validity is an intrinsic property of a tool
and could, therefore, be seen as content and context independent. Validity as an argument-
based evidentiary-chain emphasizes the importance of supporting the interpretation of
assessment results with ongoing analysis such that validity does not belong to the tool/
instrument itself. The emphasis is on process-based validation (emphasizing the journey
instead of the goal). Validity as a social imperative foregrounds the consequences of
assessment at the individual and societal levels, be they positive or negative. The existence
of different discourses may explain—in part—results observed in recent systematic
reviews that highlighted discrepancies and tensions between recommendations for practice

& Christina St-Onge


Christina.St-Onge@USherbrooke.ca
1
Université de Sherbrooke, Sherbrooke, Canada
2
McGill University, Montreal, Canada
3
University of British Columbia, Vancouver, Canada
4
University of Toronto, Toronto, Canada

123
854 C. St-Onge et al.

and the validation practices that are actually adopted and reported. Some of these practices,
despite contravening accepted validation ‘guidelines’, may nevertheless respond to dif-
ferent and somewhat unarticulated needs within health professional education.

Keywords Assessment  Discourse analysis  Health profession education  Validation 


Validity

Introduction

Validity is generally considered a beacon of quality assessment, informing the choice of


concepts measured, assessment tools developed, analytic approaches used, and interpre-
tation of assessment results (AERA et al. 1999). That is, ‘‘Validity is the sine qua non of
assessment, as without evidence of validity, assessments in medical education have little or
no intrinsic meaning’’ (Downing 2003). As such, validity is often used to ‘attest’ to the
quality of tools and to justify the use of assessments in health professions education where
stakes are high from admissions to entry into practice. In other words, because decisions
about what is considered valid impact upon individuals and society, validity has become a
‘god term’ (Lingard 2009) that is used rhetorically to convince others that particular
instruments, analytic procedures, or test scores collected meet high standards of quality.
There is clear inconsistency, however, in the way the term ‘validity’ is used within
Health Professions Education (HPE), as documented by several recent reviews (Cook et al.
2013, 2014). Authors who point to such variability commonly assume (at least implicitly)
that it is driven by ignorance regarding modern theories and approaches to validity. It is
conceivable, however, that the variability in validation practices could arise, in part, from
different conceptualizations of validity that provide value to those who work in health
professions education in different ways. Health professions education is an applied field
full of practitioners that is continuously informed by many different disciplines, including
educational psychology, measurement, sociology and experimental psychology, all of
which contribute to the methodological and conceptual richness of the field. This ‘rich-
ness’, however, may lead to different priorities that are supported by differing conceptu-
alizations of validity, thereby generating multiple interpretations and understandings that
have the potential to create confusion, miscommunication, and conflict.
The co-existence of implicit and different conceptualizations of validity is problematic
because important decisions are made every day within the health professions that rely
heavily on the ‘quality’ of assessment scores (e.g., individuals’ access to a career of their
choice, legal recourse when faced with perceived fault, and most generally and impor-
tantly, the quality of care received by patients). Notions of the ‘defensibility’ of assessment
scores and the ‘accuracy’ of such scores rest strongly on arguments regarding the validity
of the scores produced. One way forward, therefore, is to make explicit the similarities and
differences between conceptualizations of validity in order to determine how they influ-
ence assessment processes and what forms of ‘evidence’ of validity are put forward as
valued by different groups.

123
Validity: one word with a plurality of meanings 855

Purpose

The primary purpose of this study was to use discourse analysis to identify the different
ways in which the term ‘validity’ is used within health professions education. Further, we
aimed to determine who participates in each discourse (and how) and to gain some
understanding of the potential consequences of emphasizing one discourse over another.
The objective of this research was not to generate one unifying conception or definition of
validity, nor to label one conceptualization or discourse as the right one. Rather, we
examined the literature on and about validity with the objective of identifying underlying,
implicit, and sometimes disavowed conceptions that surround the construct of validity.

Method

Design

Our research is based on discourse theory (Mills 2004) and thus employs a methodology of
Discourse Analysis. Our approach is primarily what Hodges et al. (2008) have labeled
‘empirical discourse analysis’—having a primary focus on ways validity is constructed in
text and language in the health professions education literature.

Data (Archive)

Discourse analysis begins with identification of an archive—the textual and other materials
that are analyzed to identify discourses of interest (Hodges et al. 2008). We predominantly
restricted our study archive to sources from health professions education because the
purpose was to explore how the construct is variably used by people working in that field.
Recognizing that the word validity is used in many disciplinary fields with a multiplicity of
meanings, however, we used secondary sources as necessary to generate a more complete
understanding of health professions education sources that made reference to or concep-
tually relied on other literatures. The archive assembled for this research was constructed
in a four-step process and is comprised of English and French published peer-reviewed
articles and books on the topic of validity and assessment.
Step 1 All authors of this paper identified five to six key papers on validity and
assessment in the field of health professions education from their personal collection that
they considered to be ‘very important’ in terms of framing assessment practices. One
author (CSO) reviewed the texts critically to identify emerging language and concepts
about validity.
Step 2 The references from this collection of papers were examined to expand the
archive to include secondary sources. One author (CSO) reviewed these texts in order to
further identify emerging discourses.
Step 3 Involved a more formal search of health professions education literature using
PubMED, ERIC, and PsycINFO/PsycLit, employing the assistance of an academic
librarian. The goal of this search was to identify a larger set of articles, published
between 1995 and 2013, on the topic of ‘validity of assessment’ in health professions
education. The review was conducted using the truncated key words valid* and assess*
with the intention of increasing the breadth and coverage of the archive rather than with
the aim of undertaking a traditional systematic review. In other words, the goal was not
to be comprehensive with this search or to set very precise inclusion/exclusion criteria,

123
856 C. St-Onge et al.

but was to be as inclusive of as many validity conceptualizations as possible. To be


considered for the archive, an article had to address the topic of validity in the context of
assessment in the field of health professions education.
Step 4 The abovementioned search was complemented by texts from 2013 to 2015
flagged by Table of Content alerts set by the principal author. One author (CSO)
reviewed all titles and abstracts retrieved from the literature search to assess the
inclusion/exclusion of papers with these goals in mind.
The complete archive contained 68 peer-reviewed articles (16 from the educational
psychology literature), two books (from educational psychology; one French), one research
report (from educational psychology) and seven commentaries or editorials (from the field
of health professions education).

Procedure

For each document identified, NVivo 9 was used to facilitate the coding of ways in which
the concept of validity was described. Particularly helpful in identifying different con-
ceptualizations of validity were statements indicating what one ‘should’ or ‘must’ do
(deontic modals) to claim validity, statements about the consequences and implications of
such claims, and statements of truth (i.e., seemingly self-evident statements) about validity.

Analysis

The analysis was iterative, building the discourse table sequentially at each successive
stage and it was informed by the team’s collective and individual expertise (that are
subsequently described) and the inherent perspectives that result. As elements of discourse
(key words, concepts, arguments, associated individuals and institutions) were more
clearly identified, they were sorted into patterns that eventually led to the construction of
Table 1. Throughout this process there were frequent discussions with the whole research
team, and the discourses and their associated elements developed iteratively and consen-
sually. An effort was made to identify various aspects of each emerging discourse: a core
conceptualization, characteristics, things made possible by each discourse, observations
about which institutions appeared to have authority or to hold power in relation to par-
ticular discourses, and finally, ways in which individuals appeared to be participating in
each discourse. Though much more speculative, we also looked for indications of con-
sequences that might result from emphasizing one discourse over another in practice. This
process took place over several months with underlying and identified discourses being re-
visited, discussed, and refined until consensus among the research team was achieved.
The authors have various backgrounds and expertise which informed the analysis of the
data. The primary author, CSO, was trained in measurement and assessment. BH is a
clinician and educator who is well-versed in discourse analysis. MY and KE brought a
cognitive psychology perspective to the study of assessment and validity. KE and BH are
senior health professions education researchers who have dedicated much of their aca-
demic effort to studying issues of assessment as well as contributing to various local,
national, and international committees charged with guiding assessment practices. Both
have an extensive track record of developing assessment protocols for medical training.
MY and CSO bring more novice yet complementary perspectives.

123
Validity: one word with a plurality of meanings 857

Table 1 Summary of the three discourses


Validity as a test Validity as an argument- Validity as a social
characteristic based evidentiary-chain imperative

Definition The degree to which the test The evidences presented to A bird’s eye view of
actually measures what it support or refute the assessment that
purports to measure meaning or interpretation foregrounds broader
assigned to assessment individual and societal
results issues
Characteristics Validity is a goal or a gold Validity is a journey on which Validity and validation
seal of approval one embarks to provide are matters of social
evidence supporting the accountability
interpretation of scores
Validity is Static Fluid Built-in
viewed as…
Focus of Individual tools can be Defensible interpretation of Individual and societal
evidence is considered valid, and the scores impact of assessment
on… validity can generalize to
the tool format («MCQs are
valid»)
Things made The quest for the holy grail of Validation approaches and Holistic and a priori
possible assessment; one tool that is standards consideration for
more valid than the others societal impact of
assessment
Validation A posteriori (mainly) A posteriori (mainly) A priori (mainly)
occurs…
Validation Psychometric Mostly psychometric Mostly expert judgement
data focused
on…

Results

Three different discourses were identified in the archival materials studied: (1) Validity as
a test characteristic, (2) Validity as an argument-based evidentiary-chain and (3) Validity
as a social imperative. We do not consider these to be the only possible conceptualizations
of validity, but these were the discourses that were observed most strongly in our analysis.
Further, we do not believe these discourses to be mutually exclusive given that we wit-
nessed instances where more than one discourse was present in a given publication.
Nevertheless, as analysis proceeded, these three discourses were the most distinguishable.
Validity as a test characteristic is underpinned by the notion that validity is an intrinsic
property of a tool and could, therefore, be seen as content and context independent.
Validity as an argument-based evidentiary-chain emphasizes the importance of supporting
the interpretation of assessment results with ongoing analysis such that validity does not
belong to the tool/instrument itself. The emphasis is on process-based validation (em-
phasizing the journey instead of the goal). Validity as a social imperative foregrounds the
consequences of assessment at the individual and societal levels, be they positive or
negative. Detailed descriptions follow, but see Table 1 for a summary of the discourses.

123
858 C. St-Onge et al.

Validity as a test characteristic

In this discourse, validity is often defined as ‘‘the degree to which the test actually mea-
sures what it purports to measure’’ (Anastasi 1988, p. 28). Taken literally and at face value,
such a definition treats claims of validity as statements about a characteristic of the tool
itself (i.e., something that inherently belongs to the tool). As such, validity as a property
inherent in a tool spans domains of content, context, and time. Often found associated with
this discourse is a concept illustrated by the statement: ‘‘a test is valid for anything with
which it correlates’’ (Guilford 1946, p. 429). Once a tool is branded as ‘valid’ using this
discourse, the tool is treated as though it retains that quality indefinitely.
Validity as a test characteristic, therefore, can be thought of as a ‘gold seal of approval’.
As an example, those employing this discourse would say that Multiple Choice Questions
(MCQs) are a valid measure of knowledge, full stop. Thus they can be defended as the go-
to format when creating written exams. Here are a few examples of claims about tools that
are framed as having achieved a ‘gold seal’ of validity:
The Job Descriptive Index (JDI), a validated survey, was used to measure levels of
job satisfaction. (Graeff et al. 2014, p. 15 -emphasis by authors)
Evidence has shown the JSE [Jefferson Scale of Empathy] to be a valid and reliable
measure of empathy in medical students and physicians in the context of healthcare.
(Van Winkle et al. 2013, p. 219 -emphasis by authors)
When one considers validity to be a quality of a tool, the door is opened for the possibility
that for any given domain (knowledge, skills, professionalism, etc.) there could be one
superior tool that could be shown to be the most valid. Thus, this discourse makes possible
the quest for ‘holy grails’ of assessment, a quest to identify the ‘best’ assessment tools,
independent of content or ability to-be-measured. For example, it is written that,
MCQ testing is the most efficient form of written assessment, being both reliable and
valid by broad coverage of content. (McCoubrie 2004, p. 711 -emphasis by authors)
Interestingly, we found that the discourse of validity as a test characteristic is judged
harshly by some who argue that this view is ‘antiquated’, ‘controversial’, or ‘lacking in
value’.
We often read about ‘‘validated instruments.’’ This conceptualization implies a
dichotomy—either the instrument is valid or it is not. This view is inaccurate. First,
we must remember that validity is a property of the inference, not the instrument.
Second, the validity of interpretations is always a matter of degree. An instrument’s
scores will reflect the underlying construct more accurately or less accurately but
never perfectly. (Cook and Beckman 2006, p. 166e10)
When considering categories of validity, it is understood that validity evidence exists
to various degrees, but there is no threshold at which an assessment is said to be
valid. (Beckman et al. 2004, p. 973)
However, the continued presence of this discourse of validity in the health professions
education literature suggests that it may fill a pragmatic need for individuals who require
‘off-the-shelf solutions’ (e.g., educators and administrators who do not have the desire,
knowledge, or resources to create ‘valid’ assessment programs, tools, or approaches de
novo). Using the discourse of validity as a test characteristic permits the possibility of

123
Validity: one word with a plurality of meanings 859

‘found’ solutions to overcome the challenges associated with assessing students and future
professionals using tools reported to be of high quality in a context of limited resources or
limited psychometric expertise.
… Developing a valid and reliable assessment of competence is not easy to achieve
with the resources available at the university level. (Roberts et al. 2006, p. 542 -
emphasis by authors)
…it is important that valid instruments are created so administrators can better assess
the educational needs of prospective physicians, their practices, and patient out-
comes… (Schulman and Wolfe 2000, p. 107 -emphasis by authors)
We can speculate about some of the effects of emphasizing validity as a test characteristic
over other discourses of validity. Validity as an immutable property of a test has the
potential to create a false sense of security for assessment practitioners, who may never
feel the need to question or re-evaluate an instrument’s ‘gold seal’. Using an MCQ exam
format, for example, because it is said to be ‘valid’ (McCoubrie 2004) without
consideration for item-writing guidelines such as those put forward in Haladyna et al.
2002), or the nature of the context in which the MCQs are used, is an example of such a
blind spot. Similarly, choosing to put in place an assessment approach like OSCEs or
MMIs without proper blueprinting strategies or without vigilance for problematic scenarios
may defeat the intended purpose of achieving a contextually meaningful assessment (Eva
and Macala 2014). Moreover, using tests beyond their original contexts or for purposes
other than the originally studied uses can have consequences. Gould (1996) eloquently
illustrates the dangers of employed ‘validated’ tests indiscriminately in his book The
Mismeasure of Man in which he documents how the IQ test, based on the premise that
intelligence can be quantified in a single, decontextualized score, has been used in novel
contexts (such as for immigration purposes) to draw inappropriate conclusions such as
labelling of entire ethno-cultural groups as ‘less intelligent’ based on test scores.

Validity as an argument-based evidentiary-chain

When this discourse is used, validity is framed as an argument-based evidentiary-chain and


defined as ‘‘the evidence presented to support or refute the meaning or interpretation
assigned to assessment results’’ (Downing 2003, p. 830). Though validity in this discourse
does sometimes focus on particular tools (as in the previous discourse), validity itself is
seen as highly contextual. The focus is on the valid interpretation of scores that can be
achieved via a validation process used to verify that there is sufficient evidence in each
administration of a test to support the interpretation of the assessment results in relation to
the underlying theory/expectations. Here, the adjective ‘validated’ never appears while the
noun ‘validation’ and the verb ‘to validate’ are common. This reflects the notion that it is
not the quality of the tool that is judged but rather the appropriateness of the uses of the
tool, and the interpretations and conclusions drawn from the examinees’ performance or
assessment scores given the way the assessment process was implemented. For example, a
validation process for a certification exam might aim to document that clinical simulations
were created carefully, implemented authentically, and standardized as well as ensuring
that only the candidates who have acquired the sought after competences pass and only the
candidates who do not master the competences fail the exam. Importantly, however, these
sources of evidence and the determination of validity of the certification examinations

123
860 C. St-Onge et al.

would apply only to one instance of its use. Subsequent usages and the results generated
would require repeated validation.
For those employing the discourse of validity as an argument-based evidentiary-chain,
there is an analogy to the scientific method in that ‘validation’ is used to provide evidence
to support or refute the use and interpretation of data/scores. There is also recognition that
conclusions may change as evidence continues to accumulate. It is thus characterized by
some as a journey on which one embarks after having defined the assessment purpose(s).
The goal is to collect as much evidence as possible in a validation process and to identify
variables/factors that inform the degree to which data produced by a particular test are
valid but also to understand and set limits on claims of validity.
Users of this discourse aim to create assessment strategies that are based on theories and
then evaluate if the observed results show evidence of expected manifestations of the
underlying theory. The use of script-concordance-testing (SCT) is an example of a theory-
based assessment strategy (Charlin et al. 2000) and the validation practices put in place for
SCT often reflect the premise that experts have more developed ‘illness scripts’ and thus
should perform better on the ill-defined tasks reflected in the SCT than more novice
examinees. While there is considerable evidence in that regard, the recent scrutiny of SCTs
(e.g., Lineberry et al. 2013) highlights the ‘argumentative’ nature of validation practices.
Thus, this discourse places validation approaches and standards at the forefront, and two
authors—Messick (1995) and Kane (2006)—are very frequently cited as anchor authorities
for this discourse. Validity as an argument-based evidentiary-chain thus creates a set of
rules and regulations to be applied if one wants to speak authoritatively about the quality of
the scores generated by an assessment.
Methods for evaluating the validity of results from psychometric assessments derive
from theories of psychology and educational assessment, and there is extensive
literature in these disciplines. (Cook and Beckman 2006)
This discourse appears to be strongly associated with formalized assessment institutions
such as; the Standards for Educational and Psychological Testing (AERA et al. 1999), the
Educational Testing Service, and others that regulate validity and validation practices.
Moreover, these institutions legitimize the role of highly qualified people who apply and
enforce the recommended practices.
In an attempt to establish a unified approach to validity, the American Psychological
Association published standards that integrate emerging concepts. These standards
readily translate to medical practice and research and provide a comprehensive
approach for assessing the validity of results derived from psychometric instruments.
(Cook and Beckman 2006, p. 166.e8)
An apparent consequence of an over-emphasis of this discourse of validity is that the
validation process would become a never-ending process (Bertrand and Blais 2004):
‘‘validity and assessment validation and revision is a never-ending cycle’’ (Beckman et al.
2009, p. 188). In addition, there appear to be no clear rules about how to weigh the
different forms of evidence for different score interpretations. Thus, one can remain
engaged in a continuous quality assurance process with the need to interpret and
incorporate each new piece of validity evidence collected and each new score
interpretation (Cook et al. 2015). Finally, one further consequence of emphasizing this
discourse is that little consideration may be given to content experts who feel that they
‘know’ what defines good performance in practice but who may become undervalued in
the assessment process because their judgement does not seem relevant or reliable in the

123
Validity: one word with a plurality of meanings 861

evidentiary-chain according to the more formalized validation frameworks. In other words,


the professionals who are responsible for regulating their own profession may experience
lessened capacity to determine contextually appropriate and important strategies for
assessment if too narrow a lens is placed on what counts as evidence of validity. Time and
resources are required to collect data to support an evidentiary-chain of validation. This
may prove difficult for those working at the front lines in clinical practice and education
settings. They may also face resistance when employing assessments that are seen as
‘subjective’ or ‘observational’ for which it is more difficult to assemble the mass of
evidence that can be gathered for more point-in-time testing strategies.

Validity as a social imperative

Validity as a social imperative is an emerging discourse with several components that,


when taken individually, may seem familiar to most readers. This discourse is newer and
informed by different kinds of expertise, perspectives, and stakeholders (administrators,
researchers, policy analyst, etc.) than the first two discourses. Validity as a social imper-
ative appeared in our archive as a socially driven perspective on assessment that includes
calls for deliberate consideration for the consequences of assessment at both individual and
societal levels. This discourse appears to be characterized by taking a ‘bird’s eye view’ of
assessment that foregrounds broader individual and societal issues and that goes beyond
the sole consideration of specific tools.
Giving attention to the consequences of a test is not necessarily unique to this emerging
discourse and can be found in writing by Cureton (1951), Cronbach (1971), and Messick
(1995), authors who are more commonly associated with the evidentiary-chain notion of
validity described above. To some degree, validity as a social imperative may be an
outgrowth of validity as an evidentiary-chain. Our argument for identifying it as a discrete
discourse arises from the observation that those who employ validity as a social imperative
seem to foreground social consequences of assessment throughout assessment develop-
ment and validation processes. By contrast, when included in the evidentiary-chain dis-
course, consequences of assessment are just one of many variables, usually a minor one if
considered at all. This discourse of validity as a social imperative also expands the idea of
consequences of assessment beyond learners to include impacts at a more macro societal
level.
For example,
…in the course of selecting the 10 % who are worthy of admission (and hence
guaranteed an esteemed and well-paid place in society), we are telling the other 90 %
that they are unworthy; that they are not good enough, that they have personal
failings. (Norman 2004)
For organisations that accredit and regulate medical education, ensuring that grad-
uates are fit to practise is essential to assuring patient safety. This is not a new issue,
nor is it unique to any particular country or health care system. Over the last decade,
however, it has become more visible because of several changes in medical edu-
cation and practice, all of which directly affect the delivery of safe health care.
(Swanson and Roberts 2016)
Because those who adopt this discourse of validity tend to align themselves with a
programmatic perspective on the purpose of assessment, proponents of this discourse seem
to give less attention to post hoc analyses. Writers today have put greater emphasis on

123
862 C. St-Onge et al.

conceptual planning of an assessment strategy and prioritizing a purposeful approach to


assessment by using tools and strategies a priori (prior to its administration) and
minimizing unintended consequences over analytic checking of the quality of assessment
results a posteriori, practices that focus on identifying and addressing issues post
administration. For example, there is also a shift in emphasis from data generated by
individual assessment tools to the way in which testing data are combined.
A combination of (near-) perfect instruments may result in a weaker programme than
a carefully combined set of perhaps less perfect components. In other words, it is not
only the quality of the building blocks that is relevant, but also the ways in which
they are combined. (Schuwirth and van der Vleuten 2012, p. 39)
This is not to say that post hoc psychometric analytic data are not important. However, they
seem unable to fully capture the complexity of assessment strategies favored by tenants of
this discourse, and as such, they must be considered in the context of striving for balance
between the costs of having learners submitted to the assessment and the potential for
benefits to be experienced by both society and the practitioner.
[…] the psychometric approach is considered to be too reductionist (Huddle and
Heudebert 2007) for the assessment of higher order competencies, such as the ability
to work in a team, professional behaviour, and self reflection which are increasingly
deemed to be essential for medical professionals but cannot be meaningfully
assessed detached from the authentic context (Kuper et al. 2007). (Berendonk et al.
2013, p. 560)
From the vantage point of a broad social-science perspective, traditional OSCE
validity research has been a bit narrow. … OSCEs are complex social events that are
highly contextual, potently formative and heavily influenced by sociological vari-
ables such as relations of power, economics and culture. (Hodges 2003, p. 253)
Test design is a compromise between measurement rigour and practicality. (Crossley
et al. 2002, p. 801)
Such emphasis prioritizes that assessment programs, from the student perspective, favour
learning and, from society’s perspective, put trainees in contexts in which they demonstrate
the skills needed to meet society’s needs. In essence, this discourse shifts the focus of
attention from the properties of the tool or the validation process to the desired purpose of
assessment for the learner and for society.
Assessment provides evidence that a learner has acquired knowledge and skills
within a field of instruction. Progression along the continuum of knowledge and
skills may require more complex, rigorous or differing types of assessment evidence.
(Andreatta and Gruppen 2009, p. 1029)
The educational value of assessment is easily underestimated. The nature and content
of assessment strongly influences the learning strategies that students adopt because
most learners are adept at spotting and meeting the requirements of an assessment.
(Crossley et al. 2002, p. 800)
Individuals that employ this discourse often work in training programs and organizations
that claim ownership of a complete, comprehensive program of assessment aligned with
the curriculum and the expectations of society for future practice (and in some cases
certification) of graduates prospectively (through improving assessment practices) rather

123
Validity: one word with a plurality of meanings 863

than retrospectively (through identifying weaknesses). The rise of Entrustable Professional


Activities (EPAs) is only one example of assessment that foregrounds consideration for
student learning and the eventual care provided to society. Moreover, this discourse
promotes concerted developments in assessment, thus creating the need for evaluation
committees or directors that guide programmatic assessment development rather than
leaving the selection of tools and approaches to the idiosyncratic views/abilities of those
responsible for individual courses, or in the hands of those who measure the validity of data
generated.
My own view is that it is surely not acceptable to make life changing decisions about
students’ future using instruments that simply look good (6). But that is one opinion
against another. (Norman 2015, pp. 300, 301)—emphasis by authors
It has been suggested in the literature that an overemphasis on validity as a social
imperative may ‘‘not only muddy the waters for most educators, it may actually lead to less
attention to the intended and unintended consequences of test use’’ (Shepard 1997, p. 13).
Typically, data related to this form of validity come from a variety of sources and may
include qualitative as well as quantitative information. Because there are no or very few
measurement or psychometric models that can be used to aggregate complex and often
longitudinal programs of assessment, there is the potential of neglecting the more
traditional, but nevertheless useful psychometric monitoring a posteriori, particularly at the
level of the quality of individual tools. If one attends only to the ‘big picture’ and loses
sight of monitoring the quality of data generated by specific assessment tools (e.g., item
analysis, reliability, pass rates, etc.), blind spots may develop around ‘unreliable’
examinations or unfair pass/fail cut scores. While in this discourse psychometric analysis is
not the central approach to ascertain validity, its use may support the goal of purposeful
assessment when used appropriately. It has been argued that robust approaches to the use
of qualitative data could be useful within this discourse (Kuper et al. 2007; Van Der
Vleuten et al. 2010), however methods to do so remain in their infancy.

Discussion

This study has focused on the use of the term ‘‘validity’’ within the specific context of
health professions education and assessment. Independent of the underlying conceptual-
ization, it is clear that validity is a commonly used -and highly loaded- term in health
professions education. This can be seen through considerable recent work that has
extensively critiqued reported validation practices both in the health professions education
literature and in the more general education psychology literature (Cook et al. 2013, 2014).
The purpose of our discourse analysis was to clarify some of the diversity that might arise
in order to help clarify underlying tensions or unarticulated assumptions. Validity seems to
signify, for some, that a tool can or even should be used since it has met a certain gold
standard. For others, it seems to speak to the process put in place to ensure the appro-
priateness of the score interpretation. For a third group, validity seems to be about the
considerations for the role and value of assessment for learners and society with a focus on
minimizing unintended consequences.
The power relations and benefits gained by adopting each of these discourses may
explain—in part—the observed discrepancies between the processes that are sometimes
adopted in published validation practices and the processes recommended by the guide-
lines generated and endorsed by testing organizations. To be credible to those employing

123
864 C. St-Onge et al.

the discourse of argument-based evidentiary-chain, one is expected to use a specific set of


terminology and refer to what are called ‘modern’ theories of validity, as illustrated by the
multiple reviews on the subject (Cizek et al. 2008, 2010; Cook et al. 2014, 2015; Wools
and Eggens 2013). Thus, one way of participating in this discourse is as a skill holder or a
person who has mastered the appropriate language and skills and, as such, lays claim to the
rules and regulations of validation (usually in reference to Kane and Messick). Such
individuals have the power to recommend and apply validation approaches. In other words,
this discourse makes certain specialized jobs possible for people who have a specific skill
set and knowledge base. It also appears to imply that only a select group of people have the
expertise to tackle a validation task. Recent systematic reviews document and critique a
‘lag’ in uptake of ‘modern validity theories’ (Cook et al. 2013, 2014). Where this discourse
is dominant, it is often implied that highly trained individuals are required to ensure the
ongoing validation of assessment tools. Consequently, when ‘novices’—with limited
formal training—are called upon to develop and monitor assessment programs, they may
be perceived as ‘outsiders’ or ‘impostors’ because they do not use the same language, may
not adeptly use sanctioned theories of validity, and may put forward evidence during a
validation process that is not conventional or convincing to a ‘professional’.
With new emerging discourses and conceptualizations of validity—such as validity as a
social imperative—comes the possibility of different roles and, as such, calls into question
who is now allowed (or not allowed) to judge what validity means (or which discourse is
considered legitimate) and ultimately what is considered ‘appropriate’ assessment. Users
of this third discourse might include policy makers, teachers, and curriculum and/or
assessment program specialists. Those who employ this discourse often do so by advo-
cating for those being tested (aiming to make sure that no harm is experienced by the
learners from the assessment) or for society in general (by making sure that the programs
of assessment are aligned with programs’ values and society’s needs) thus conferring on
this discourse an ethical quality. Clear boundary setting is a characteristic of dominant
scientific fields as argued by Gieryn (1983). In addition, shifts in culture, specific rules, or
power relations tend to be met with some resistance. As such, it is not surprising that some
individuals would react strongly to the thought of shifting the focus of validation, as
exemplified by the following review received about this work at the development stage
(i.e., in response to a grant application): ‘‘From a psychometric perspective, I do not
believe that [investigating conceptualizations of validity] can add anything new to our
knowledge. Validity is not based on sociolinguistic issues; it is based on empirical data and
modern/traditional psychometric methods.’’
While we did not encounter anyone laying claim to validity as a test characteristic, its
clear and continued presence in the literature suggests both that many accept this per-
spective and that it fulfills a need. We could see in our analysis two major roles emerging
in association with the discourse of validity as a test characteristic: consumers and pro-
ducers. In other words, this discourse favours a consumerism philosophy in which pro-
ducers (whether individuals or organizations) provide (and market) products (tests/
assessment strategies) to consumers who need ‘validated’ tools. Consumers can be indi-
viduals (such as professors using a ‘shelf exam’) or collectives (such as a society that
accepts a licensure exam as the gatekeeper of a profession). The consumers need to rely on
external sources to provide them with ‘pre-validated’ assessment tools and may not be able
or willing independently to question the validity of the data generated by those tools that
they have taken up. Adopting this form of validity discourse may also be attractive to
producers (individual researchers, developers, or organizations) when an assessment tool
carries their name and then can be used to demonstrate scholarly impact, branded, or

123
Validity: one word with a plurality of meanings 865

copyrighted for commercial purposes. This entrepreneurial dimension may create an


incentive to promote the discourse of validity as a test characteristic. Institutions may also
gain credibility or power from using this discourse when they can put together the
appropriate ingredients to yield sellable products (i.e., ‘validated assessment tools’). As
indicated in our results, this conceptualization of validity does seem to answer a pragmatic
need given that individuals might not have all the resources to create assessment de novo to
meet the highest standards of quality. Another or additional explanation is that the dis-
course was ‘imported’ from the clinical sciences in which the classical model of validity
(or the validity trinity of content, construct, and criterion validity) is still present in the
literature (Mokkink et al. 2012; Portney 2000) and in which we see validity as the property
of a clinical assessment tool.
Limitations of this work include that we chose to focus our discourse analysis on a
specific context—the scientific literature of health professions education. This literature is
quite extensive, perhaps in part because of the strong tradition in the health professions of
employing high stakes assessment from admissions to licensing and consequently placing
emphasis on validity. However, the concept of validity is also used in a wide variety of
other disciplinary fields as well as in lay contexts. The discourses identified here may or
may not be present in other fields and may or may not be taken up in other fields in the
ways we have described. Limiting our archive to published literature favoured the emer-
gence of discourses that are both relevant to the more scholarly issues in the field and
accepted broadly enough to have made it through a peer review process. There may be
other more marginal discourses, or discourses used by lay people or practitioners even
within the health professions that are not represented in the published literature but that
may, nevertheless, influence assessment development and monitoring practices. Inevitably
our analyses were informed by the team’s collective and individual backgrounds, expe-
riences, and expertise because they provided the lenses through which data were consid-
ered, which in this case included individuals with measurement or cognitive psychology
training, and junior and senior individuals with assessment experience and expertise.
Finally, our analysis has only hinted at the ways these discourses work in relation to power
and the legitimacy of individuals and institutions who align with the different discourses.
Further research is needed to better understand the potential impact of taking up different
discourses on the development and monitoring of assessment.

Conclusion

Validity has several different meanings in health professions education. What these
meanings have in common, perhaps, is an implicit understanding that validity, in some
form, is and should be at the heart of any discussion about assessment development and
quality monitoring. It is likely that the discourses observed in this study arise in relation to
usages of the concept of validity in a number of parallel disciplines and fields that influence
health professions education. It may be that changes in which discourse is seen as legit-
imate or dominant can be traced to changing relationships between the health professions
and other fields. Having mapped some of the dominant discourses within health profes-
sions education, future work in elucidating these relationships would be of value. We
believe that our study suggests that taking up validation practices requires a better
understanding of the diverse conceptions of validity and that each may fill very different
and sometimes disavowed needs for health professions educators, researchers, adminis-
trators, and their organizations. As such, our recommendation for those employing

123
866 C. St-Onge et al.

concepts of validity is to explicitly describe one’s conceptualizations before making


statements of truth about the worth and/or appropriateness of assessment tools and pro-
grams. Doing so will not eliminate the limitations of the discourse one uses, nor will it
avoid the implications for validity that would be relevant had a different discourse been
adopted. It will, however, increase the likelihood that the notion of validity is not adopted
free of critical reflection and might, therefore, help the field to bridge discrepancies and
tensions that are currently impacting upon the field.

Acknowledgments The authors would like to thank Catherine Côté for her help with data management in
NVivo, Tim Dubé, Ph.D. and anonymous reviewers for feedback on a previous version of this manuscript.
Funding for this project was provided by the Société des médecins de l’Université de Sherbrooke Research
Chair in Medical Education, held by Christina St-Onge.

References
AERA, Apa, & NCME (American Educational Research Association & National Council on Measurement
in Education) Joint Committee on Standards for Educational and Psychological Testing, A. P. A.
(1999). Standards for Educational and Psychological Testing. Washington, DC: AERA.
Anastasi, A. (1988). Psychological testing (Vol. 6th). New York: Macmillan.
Andreatta, P. B., & Gruppen, L. D. (2009). Conceptualising and classifying validity evidence for simulation.
Medical Education, 43(11), 1028–1035. doi:10.1111/j.1365-2923.2009.03454.x.
Beckman, T. J., Ghosh, A. K., Cook, D. A., Erwin, P. J., & Mandrekar, J. N. (2004). How reliable are
assessments of clinical teaching? A review of the published instruments. Journal of General Internal
Medicine, 19(9), 971–977. doi:10.1111/j.1525-1497.2004.40066.x.
Beckman, T. J., Mandrekar, J. N., Engstler, G. J., & Ficalora, R. D. (2009). Determining reliability of
clinical assessment scores in real time. Teaching and Learning in Medicine, 21(3), 188–194.
Berendonk, C., Stalmeijer, R. E., & Schuwirth, L. W. T. (2013). Expertise in performance assessment:
Assessors’ perspectives. Advances in Health Sciences Education, 18(4), 559–571.
Bertrand, R., & Blais, J.-G. (2004). Modèles de Mesure: L’Apport de la Théorie des Réponses aux Items
(Vol. 2004). Retrieved from https://books.google.com/books?hl=fr&lr=&id=3hPlCHaA7DoC&pgis=1.
Charlin, B., Roy, L., Brailovsky, C., Goulet, F., & van der Vleuten, C. (2000). The script concordance test:
A tool to assess the reflective clinician. Teachning and Learning in Medicine, 12(4), 189–195.
Cizek, G. J., Bowen, D., & Church, K. (2010). Sources of validity evidence for educational and psycho-
logical tests: A follow-up study. Educational and Psychological Measurement, 70(5), 732–743. doi:10.
1177/0013164410379323.
Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and
psychological tests. Educational and Psychological Measurement, 68(3), 397–412. doi:10.1177/
0013164407310130.
Cook, D. A., & Beckman, T. J. (2006). Current concepts in validity and reliability for psychometric
instruments: Theory and application. The American Journal of Medicine, 119(2), 166.e7–166.e16.
doi:10.1016/j.amjmed.2005.10.036.
Cook, D. A., Brydges, R., Ginsburg, S., & Hatala, R. (2015). A contemporary approach to validity argu-
ments: A practical guide to Kane’s framework. Medical Education, 49(6), 560–575.
Cook, D. A., Brydges, R., Zendejas, B., Hamstra, S. J., & Hatala, R. (2013). Technology-enhanced simu-
lation to assess health professionals: A systematic review of validity evidence, research methods, and
reporting quality. Academic Medicine: Journal of the Association of American Medical Colleges,
88(6), 872–883. doi:10.1097/ACM.0b013e31828ffdcf.
Cook, D. A., Zendejas, B., Hamstra, S. J., Hatala, R., & Brydges, R. (2014). What counts as validity
evidence? Examples and prevalence in a systematic review of simulation-based assessment. Advances
in Health Sciences Education: Theory and Practice, 19(2), 233–250. doi:10.1007/s10459-013-9458-4.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.,
pp. 443–507). Washington, DC: American Council on Education.
Crossley, J., Humphris, G., & Jolly, B. (2002). Assessing health professionals. Medical Education, 36,
800–804.
Cureton, E. E. (1951). Validity. In E. F. Lindquist (Ed.), Educational measurement (1st ed., pp. 621–694).
Washington, DC: American Council on Education.

123
Validity: one word with a plurality of meanings 867

Downing, S. M. (2003). Validity: On the meaningful interpretation of assessment data. Medical Education,
37, 830–837.
Eva, K. W., & Macala, C. (2014). Multiple mini-interview test characteristics:’Tis better to ask candidates to
recall than to imagine. Medical Education, 48(6), 604–613. doi:10.1111/medu.12402.
Gieryn, T. F. (1983). Boundary-work and the demarcation of science from non-science: Strains and interests
in professional ideologies of scientists. American Sociological Review, 48(6), 781–795.
Gould, S. J. (1996). The mismeasure of man. New York: WW Norton & Company.
Graeff, E. C., Leafman, J. S., Wallace, L., & Stewart, G. (2014). Job satisfaction levels of physician assistant
faculty in the United States. The Journal of Physician Assistant Education, 25(2), 15–20.
Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6(4),
427–438.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing
guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–334.
Hodges, B. D. (2003). Validity and the OSCE. Medical Teacher, 25(3), 250–254.
Hodges, B. D., Kuper, A., & Reeves, S. (2008). Discourse analysis. BMJ (Clinical Research Ed.), 337, a879.
doi:10.1136/bmj.a879.
Huddle, T. S., & Heudebert, G. R. (2007). Taking apart the art: The risk of anatomizing clinical competence.
Academic Medicine: Journal of the Association of American Medical Colleges, 82(6), 536–541. doi:10.
1097/ACM.0b013e3180555935.
Kane, M. (2006). Content-related validity evidence in test development. In S. M. Downing & T. M. Hala-
dyna (Eds.), Handbook of test development (pp. 131–153). Mahwah, NJ: Lawrence Erlbaum Associates
Publishers.
Kuper, A., Reeves, S., Albert, M., & Hodges, B. D. (2007). Assessment: Do we need to broaden our
methodological horizons? Medical Education, 41, 1121–1123.
Lineberry, M., Kreiter, C. D., & Bordage, G. (2013). Threats to validity in the use and interpretation of script
concordance test scores. Medical Education, 47(12), 1175–1183. doi:10.1111/medu.12283.
Lingard, L. (2009). What we see and don’t see when we look at ‘‘competence’’: Notes on a god term.
Advances in Health Sciences Education, 14, 625–628.
McCoubrie, P. (2004). Improving the fairness of multiple-choice questions: A literature review. Medical
Teacher, 26(8), 709–712.
Messick, S. (1995). Standards of validity and the validity of standards in performance assessment. Edu-
cational measurement: Issues and practice, 14(4), 5–8.
Mills, S. (2004). Discourse. London: Routledge.
Mokkink, L. B., Terwee, C. B., Patrick, D. L., Alonso, J., Stratford, P. W., Knol, D. L., et al. (2012). The
COSMIN checklist manual. Amsterdam: VU University Medical. doi:10.1186/1471-2288-10-22.
Norman, G. (2004). Editorial—The morality of medical school admissions. Advances in Health Sciences
Education, 9(2), 79–82. doi:10.1023/B:AHSE.0000027553.28703.cf.
Norman, G. (2015). Identifying the bad apples. Advances in Health Sciences Education, 20(2), 299–303.
doi:10.1007/s10459-015-9598-9.
Portney, L. G. (2000). Validity of measurements. In U. S. River (Ed.), Foundations of clinical research:
Applications to practice (Vol. 2, Chap. 6). NJ: Prentice Hall.
Roberts, C., Newble, D., Jolly, B., Reed, M., & Hampton, K. (2006). Assuring the quality of high-stakes
undergraduate assessments of clinical competence. Medical Teacher, 28(6), 535–543. doi:10.1080/
01421590600711187.
Schulman, J. A., & Wolfe, E. W. (2000). Development of a nutrition self-efficacy scale for prospective
physicians. Journal of Applied Measurement, 1(2), 107–130.
Schuwirth, L. W. T., & van der Vleuten, C. (2012). Programmatic assessment and Kane’s validity per-
spective. Medical Education, 46(1), 38–48. doi:10.1111/j.1365-2923.2011.04098.x.
Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educational Mea-
surement: Issues and Practice, 16(2), 5–8. doi:10.1111/j.1745-3992.1997.tb00585.x.
Swanson, D. B., & Roberts, T. E. (2016). Trends in national licensing examinations in medicine. Medical
Education, 50(1), 101–114. doi:10.1111/medu.12810.
Van Der Vleuten, C. P. M., Schuwirth, L. W. T., Scheele, F., Driessen, E. W., & Hodges, B. (2010). The
assessment of professional competence: Building blocks for theory development. Best Practice and
Research: Clinical Obstetrics and Gynaecology, 24(6), 703–719. doi:10.1016/j.bpobgyn.2010.04.001.
Van Winkle, L. J., La Salle, S., Richardson, L., Bjork, B. C., Burdick, P., Chandar, N., et al. (2013).
Challenging medical students to confront their biases: A case study simulation approach, 23(2),
217–224.
Wools, S., & Eggens, T. (2013). Systematic review on validation studies in medical education assessment.
In AERA annual meeting 2013. San Francisco.

123

You might also like