You are on page 1of 11

This article was downloaded by: 10.3.97.

143
On: 14 Nov 2023
Access details: subscription number
Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place, London SW1P 1WG, UK

The Routledge Handbook of Second Language Acquisition


and Language Testing

Paula Winke, Tineke Brunfaut

Validity in Language Assessment

Publication details
https://www.routledgehandbooks.com/doi/10.4324/9781351034784-3
Carol A. Chapelle
Published online on: 28 Dec 2020

How to cite :- Carol A. Chapelle. 28 Dec 2020, Validity in Language Assessment from: The Routledge
Handbook of Second Language Acquisition and Language Testing Routledge
Accessed on: 14 Nov 2023
https://www.routledgehandbooks.com/doi/10.4324/9781351034784-3

PLEASE SCROLL DOWN FOR DOCUMENT

Full terms and conditions of use: https://www.routledgehandbooks.com/legal-notices/terms

This Document PDF may be used for research, teaching and private study purposes. Any substantial or systematic reproductions,
re-distribution, re-selling, loan or sub-licensing, systematic supply or distribution in any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that the contents will be complete or
accurate or up to date. The publisher shall not be liable for an loss, actions, claims, proceedings, demand or costs or damages
whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.
2
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

Validity in Language Assessment


Carol A. Chapelle

Background
The field of SLA is largely concerned with describing second language knowledge and explain-
ing how it develops over time, but at least the latter goal requires empirical research. Writing
about empirical research in SLA, Ellis’s view in 2005 was that there had been “little real progress
in achieving the second goal because of a general failure to address how learners’ L2 knowledge
can be measured” (Ellis, 2005, p. 142). Because assessment of learners’ language knowledge,
performance, or some aspect of ability is central to second language acquisition research, valida-
tion should be of central importance for the credibility of research results. The need for careful
attention to assessment methods has heightened over the past decade with the open science initia-
tives that have made research instruments publically available and are encouraging the dissemi-
nation of research results among SLA researchers and beyond.
But how can a researcher validate assessment practices when the subject of validation in
language assessment itself includes a complex set of practices, statistical methods, frameworks,
and theoretical perspectives? A look for guidance from within the language assessment literature
results in at least a dozen books with almost as many frameworks, perspectives, and even more
pieces of advice for justifying the validity of test use. Moreover, their overarching concern is the
use of language assessments for decision-making about individuals or programs in, for example,
educational or professional contexts. Researchers in SLA can detect a dissonance between their
interests and those of language testers, who tend to work with the broadly defined constructs,
statistical techniques requiring large numbers of test takers, and considerations of the social con-
sequences of decision making. Moreover, the field of language testing does not express a con-
sensus validation method that can be readily applied by anyone willing to read a single book.
These issues raise questions about the use of conceptions of validity and validation in language
testing for investigating validity in SLA research. How can an SLA researcher begin to assess the
relevance of validation practices used in language assessment?
In this chapter I offer a basis for addressing this question by outlining some areas of general
agreement in language assessment about validation. Underlying this chapter is the perspective
stated by Douglas (2001):

In both language testing and SLA studies, practitioners make inferences about the state
of a learner’s interlanguage, i.e., the abilities underlying language performance, based on

11
Carol A. Chapelle

systematic observation of an elicited performance. It follows from this that both language
tests and SLA elicitation devices should be held to similar standards of validity.
(Douglas, 2001, p. 442)
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

The process of making inferences based on performance describes what users do with both
tests and assessments, and therefore I use these terms interchangeably in this chapter. In the
first part of this chapter I introduce the foundational concepts in educational and psychological
measurement that underlie the various presentations of validity in language assessment. I then
identify key issues for applying current validation practices in SLA research, and I offer advice
for practice.

Foundations of Validity
Common ground about validation in language testing today stems from a paper by Messick
(1989) that has been influential in educational and psychological measurement as well as in
language testing. Despite the fact that pre-Messick concepts, expressions, and perspectives still
appear in language testing, for the most part, Messick’s presentation of validity is accepted in lan-
guage testing today. Messick’s paper, which covers a range of issues from philosophy of science
to empirical research design and the values inherent in assessment, is useful for understanding
validity in language testing.
Messick defined validity as a unitary concept defined as “an overall evaluative judgment
of the degree to which empirical evidence and theoretical rationales support the adequacy and
appropriateness of interpretations and actions based on test scores” (Messick, 1989, p. 13).
Messick’s unitary view of validity contrasts with the idea that there are three or more types
of validity. Messick explained, “because content- and criterion-related evidence contribute to
score meaning, they have come to be recognized as aspects of construct validity. In a sense,
then, this leaves only one category, namely, construct-related evidence” (Messick, 1989, p.
20). Messick’s distinction between “types of validity” and “types of evidence” is disregarded
by some language testers (Weir, 2005). Nevertheless, the distinction is important because,
as Messick explained, the types-of-validities approach leads the validation process astray by
prompting researchers to see a test as having a certain type of validity rather than seeing a need
to present theoretical rationales and evidence for interpretations and uses of a test, in accord-
ance with his definition.
Construct validity is central to Messick’s definition of validity, because validation seeks evi-
dence for the construct meaning of the test score. From an assessment perspective, the technical
sense of the word construct is a meaningful interpretation of performance consistency. “The key
point is that in educational and psychological measurement inferences are drawn from scores, a
term used here in the most general sense of any coding or summarization of observed consist-
encies on a test, questionnaire, observation procedure, or other assessment device” (Messick,
1989, p. 14). Scores are interpreted to have meanings such as implicit grammatical knowledge,
vocabulary ability, or working memory, for example. Conducting validation research requires the
construct the test is intended to assess to be defined in sufficient detail to be used to guide devel-
opment of assessments and to formulate hypotheses about how those assessments will perform.
Validation research serves as an empirical check as to how well the construct theory accounts for,
or explains, results (Cronbach & Meehl, 1955; Cronbach, 1971).
Beyond construct validity, validation needs to take into account issues of relevance and utility,
value implications, and the social consequences of testing. These three other facets of validity are
not considered types of validity, but rather additional aspects of test interpretation and use that
researchers or language test developers need to include in a validity investigation. Messick’s four-
faceted definition expanded validity inquiry beyond the technical concerns normally associated

12
Validity in Language Assessment

with quantitative research methodology (Campbell & Fiske, 1959) and moved validity into the
domains of values and social practice.
By thinking outside the technical box, Messick engaged the philosophy of science in his dis-
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

cussion of validity, which in turn ignited debates over ontological, epistemological and axiologi-
cal aspects of validation in educational and psychological assessment (e.g., Borsboom, 2006;
Markus & Borsboom, 2013). Put simply, the ontological debate centers around whether the pro-
cess of testing seeks the truth about the individuals tested, or evidence about useful constructs.
The epistemological question raised is whether inferences from a given test can be proven to
be valid by demonstrating statistically that the response data fit the measurement model of the
construct, or whether inferences must be supported through a variety of evidence to be used in a
judgment about validity. The axiological dimension centers on whether the process of validation
should be considered as value-neutral (as the truth/proof side wishes) or as value-laden (as the
utility/evidence side would do). Messick’s use of the expression “evaluative judgment” signals
his affiliation with the utility/evidence side of the debate, which is largely the position taken in
the field of language testing.
Messick’s definition of validity has played a significant role in educational and psychologi-
cal testing as demonstrated by its overall acceptance by the Joint Committee of the American
Education Research Association, American Psychological Association, and the National Council
on Measurement in Education, which is responsible for producing and updating the Standards
for Educational and Psychological Testing (AERA, APA, & NCME, 2014), referred to as the
Standards hereafter. The Standards expresses a consensus view of these three professional organ-
izations, which arguably affect practice throughout the world (Zumbo, 2014), including the inter-
national field of language testing.
Bachman’s (1990) seminal work, Fundamental Considerations in Language Testing, drew
heavily on these central threads of Messick’s discussion of validity, and eventually, Messick’s
perspectives took root in language testing (e.g., McNamara & Roever, 2006). However, with
only a few exceptions (e.g., Chapelle, 1994, 1998; Douglas, 2001), Messick’s influence went
unnoticed in SLA. Even in language testing, Messick’s validity framework was recognized as
too complex to guide validation practices. Instead, Bachman’s work set into motion two decades
of framework building in language testing with attempts to put into practice the broad concerns
laid out by Messick (Bachman & Palmer, 1996, 2010; Norris, 2008; Weir, 2005). Perhaps the best
known, Bachman and Palmer’s (1996) test usefulness framework has been used to guide valida-
tion studies (Chapelle et al., 2003) and is still used to introduce concepts in language testing to
teachers.
Like language testers, researchers in other areas of educational and psychological measure-
ment are interested in applying Messick’s ideas (Newton & Shaw, 2014). To address this need,
argument-based validity has been developed primarily by Kane (1992, 2001, 2006, 2013) as a
pragmatic approach for putting into practice Messick’s perspective on validity. According to
Kane, “to validate an interpretation or use of measurements is to evaluate the rationale, or argu-
ment, for the claims being made, and this in turn requires a clear statement of the proposed
interpretations and uses and critical evaluation of these interpretations and uses” (Kane, 2006, p.
17). Kane outlined a process of validation that begins with a set of claims that the tester wants to
make about test interpretation and use, rather than with a set of criteria or qualities that have been
set by someone else. The process continues with research required to support the inferences that
need to be made to draw the conclusions, which serve as claims in the validity argument. Finally,
the results of research are interpreted in view of the degree of support they offer their respective
inferences, and an overall judgment is made about the validity of the claimed interpretations and
uses of the test. Argument-based validity, as presented by Kane across a number of papers from
the early 1990s, has been influential in language testing (Bachman, 2005; Chapelle et al., 2008,
2010; Cheng & Sun, 2015; Llosa, 2008; Norris, 2008; Youn, 2015).

13
Carol A. Chapelle

Key Concepts
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

Test content: The material appearing on the test used to elicit performance from the test taker.
Content appears as listening or reading passages, texts to be completed, or individual multiple-
choice items, for example.
Performance: The test takers’ spoken or written language (or other responses) produced in response
to test prompts.
Reliability: The consistency or stability of test scores. Reliability refers to test scores, not to tests.
Reliability can be estimated in a number of different ways, each of which takes into account dif-
ferent types of error (inconsistency).
Construct: A meaningful interpretation, or explanation, of performance consistency. Constructs can
be posited to be real entities or constructions created by testers, or a combination of the two.
Target context: The context of interest to test score users when they interpret test scores; the
domain where test takers are expected to be able to perform if they achieve high test scores.
Test use: Actions and decisions taken on the basis of test results (e.g., placement into an instructional
program, certification for a job, continuation of research, conclusion about a research question).
Consequences: Results emanating from test use, including increased knowledge in the field, stu-
dents placed at an appropriate level, education of teachers about important aspects of language.
Values: Judgments about importance, worth, or usefulness of aspects of testing including an
approach to construct definition, the selection of test content, the identification of research
participants.
Inference: A deduction or logical step taken in moving from grounds consisting of an observation or
accepted statement to a conclusion consisting of new knowledge.
Claim: A statement about the proposed interpretation for a particular purpose of testing that
requires support from evidence.
Warrant: A statement whose credibility needs to be substantiated to support an inference.

Key Issues
Key issues in applying argument-based validity in both language assessment and SLA research
stem from the understanding of validity articulated by Messick (1989). As the Standards put it,
“statements about validity should refer to particular interpretations for specified uses” of tests
(AERA, APA, & NCME, 2014, p. 1). Since it is the interpretations and uses of the test that are
the objects of validation, “it is incorrect to use the unqualified phrase ‘the validity of the test’”
(AERA, APA, & NCME, 2014, p. 1). In other words, validity is not treated as a quality or entity
that can be attributed to a test and carried with the test for all prospective uses and in all contexts.
Instead, validity refers to an argument that includes particular claims about the interpretations
and uses of the test. The key issues in validity are found in the concepts required to build such
arguments, the most basic of which is inference.

Inferences Underlie Assessment Use


An inference is a logical step taken in moving from an observation or accepted statement to a new
insight or understanding. The new understanding obtained from construct-related inferences in
language testing includes conclusions that the test user draws about the learner’s language knowl-
edge, performance, or ability (Chalhoub-Deville et al., 2006). SLA researchers are typically
interested in drawing inferences about precise aspects of learners’ knowledge such as implicit

14
Validity in Language Assessment

Table 2.1 Inferences and their meanings with the types of claims that each leads to in a validity argument

Inference Meaning in a Validity Argument General Claim Serving as


Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

Conclusion*

Consequence •• The test user believes that the stated positive •• Test score use results in
Implication consequences will result from intended test use. positive consequences.
Utilization •• The test user concludes that the test scores will •• The test scores are useful for
serve well for their stated use. their intended purpose.
Extrapolation •• The test user accepts that the score meaning •• The test scores represent the
extends to the target context. performance of interest in the
target context.
Explanation •• The test user surmises that the score meaning is •• The test scores reflect the
explained by the defined construct. intended construct.
Generalization •• The test user concludes that the test has •• Scores reflect performance
produced reliable scores in relevant research consistency.
contexts.
Evaluation •• The test user trusts that relevant test takers’ •• The test scores accurately
performance has been summarized accurately summarize relevant
in scores. performance.
Domain Definition •• The test user accepts that the analysis of •• Appropriate observations of
the domain of interest has been carried out test takers’ performance has
adequately to create relevant assessment tasks been obtained.
with appropriate content.

* A validity argument for a particular test interpretation and use would contain a tailored version of these general claims
to address the specific testing purpose.

knowledge of certain grammatical features; making precise inferences requires stating a precise
construct definition (Ellis, 2005). A construct-related inference, called explanation, is only one
of several inferences made when an assessment is interpreted and used (Chapelle et al., 2010).
Table 2.1 presents seven types of inferences that test users may make when they interpret test
scores for particular uses. The meaning of each inference is given in Table 2.1 in addition to the
wording of a general claim that each inference would lead to in a validity argument. For exam-
ple, when a generalization inference is made, the test user concludes that the test has produced
reliable scores, which allows for a claim that the scores reflect performance consistency. Since a
construct is a meaningful interpretation of performance consistency observed on an assessment,
this claim is needed to serve as the basis, or grounds, for the inference about the construct, which
is called the explanation inference. In language testing, it is not unusual to see validity arguments
with six or seven inferences, but the number and type of inferences depends on the assessment
and its purpose. Kane did not specify a certain number of inferences that validity arguments must
contain.

Validation Research Investigates Inferences


Identifying inferences that underlie interpretation and use of assessments is important because
validation research investigates the validity of these inferences. The distinction between investi-
gating the validity of the test and the validity of inferences is important because the former results
in problematic conclusions. For example, Erlam’s (2006) validation study of an elicited imitation
task as a measure of L2 implicit knowledge concluded that “results suggest that as a measure of
implicit language knowledge, [the test] has both validity and reliability” (p. 489). If the test itself
has the attributes of validity and reliability, can anyone create the grammaticality judgment test

15
Carol A. Chapelle

in the model of Erlam’s and use it for any purpose requiring a language test? Can it be used for
making university admissions decisions, precluding, for example, the need for the IELTS, which
was developed by the University of Cambridge Examination Syndicate in the UK?
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

Someone looking to SLA research for a convenient assessment for university admissions
should, of course, be advised to more carefully consider the adequacy of what the test measures
in view of the intended test use. The test itself does not have validity. If the prospective test user
is serious about validating the intended interpretations and uses, the user would need to identify
the claims and inferences underlying the new test use. The relevant claims would be something
like claims one through seven below. The inference leading to each claim is given in parentheses.

1. The EI test yields positive consequences for test takers and test users (consequence
implication).
2. The test scores are useful for distinguishing among students who are more and less ready
linguistically for university study (utilization).
3. The test scores reflect the levels of language performance that students are likely to display
in their academic work (extrapolation).
4. The scores reflect the varying levels of implicit language knowledge of the test takers
(explanation).
5. The scores reflect performance consistency (generalization).
6. The scores accurately summarize performance relevant to implicit language knowledge
(evaluation).
7. The domain of implicit language knowledge was adequately analyzed to create test tasks that
elicit relevant performance (domain definition).

The validation research driven by such claims could include the following: 1) Interviews with
students about their test preparation practices if the new test were instituted, 2) observations of the
performance of students who would be admitted based on EI test scores, 3) correlations of EI test
scores with scores on language performance in university classes, 4) retrospective interviews of test
taking processes, 5) estimation of internal consistency reliability taking into account the number
of test tasks, 6) item analysis to calculate difficulty, discrimination, and model fit, 7) focus group
conducted with content experts to explore the range of content coverage desired for a test. These are
examples of the variety of qualitative and quantitative methods that can be undertaken for valida-
tion. The choice of method and specific study design depends on the detail in the validity argument.

Research Results Show (Lack of) Support for Inferences


Validation results need to be interpreted in terms of what they mean for the inferences in a valid-
ity argument that specifies the claims, inferences, and warrants in need of support for the speci-
fied test use. Gutiérrez’s (2013) investigation of a grammaticality judgment test (GJT) illustrated
the need for the specifics of the validity argument to guide the interpretation of results. Gutiérrez
introduced the GTJs as being used in past decades of research “in SLA as a measure of learners’
linguistic ability in the L2” (p. 425). If a validation study is to be undertaken to assess the support
for the claim that a GJT assesses linguistic ability in the L2, the meaning of "linguistic ability in
the L2" needs to be defined more precisely, and the warrants underlying the inference to be made
need to be specified. However, in the paper, the logic linking linguistic ability in the L2 with
implicit and explicit grammatical knowledge is not presented. Nor is the precise validity claim
that is investigated in the study. The final paragraph of the discussion of the construct allows
readers to infer that the claim is about implicit knowledge because it indicates that previous
researchers have “expressed their concerns regarding timed GJTs as a valid measure of implicit
knowledge” (p. 429). Based on this introduction and the research that was carried out, the validity

16
Validity in Language Assessment

Scores reflect test takers’


CLAIM SERVING AS A
implicit grammatical
CONCLUSION
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

knowledge.

Warrant 1: All test items contribute to


assessment of a construct consisting of
one aspect of knowledge.
EXPLANATION INFERENCE
Warrant 2: Variation in response
conditions does not affect the
assessment of the construct.

Grammaticality judgment test


GROUNDS
scores are reliable.

Figure 2.1 Outline of an argument supporting the claim that scores on a grammaticality test
should be interpreted as an assessment of implicit grammatical knowledge.

claim with its inference, warrants and conclusion might be presented using the argument-based
validity notation outlined in Figure 2.1.
Figure 2.1 illustrates the logic underlying the study which investigates whether or not sup-
port is found for an inference underlying the claim that the scores on the GJT reflect test takers’
implicit grammatical knowledge. Specifically, the two warrants state the empirically testable
statements that, if supported by the data, would warrant the explanation inference leading to the
claim. The warrants are that all the test items contribute to assessment of a single construct and
that variation in response conditions does not affect the assessment of the construct. Support for
neither warrant was found. The lack of support for Warrant 1 was interpreted from the process-
ing differences found for the two types of items and from data indicating that the items with
grammatical and ungrammatical sentences loaded on different factors. The author concluded
that “differences may be interpreted as [indicating] these two types of sentences [are] measures
of implicit and explicit knowledge, respectively” (p. 444). The lack of support for Warrant 2 was
interpreted from the finding that “the two dimensions of time pressure and task stimulus signifi-
cantly affect the L2 learners’ performance on those tests” (p. 444).
If the study had been framed as shown in Figure 2.1 (as an argument containing statements
in need of support of a stated claim), interpretation of the results for the validity of the GJT as a
measure of implicit grammatical knowledge in SLA research could have been stated in the con-
clusion of the article. A clear statement about validity would be useful in the conclusion of an arti-
cle whose title is “The construct validity of grammaticality judgment tests as measures of implicit
and explicit knowledge.” Lacking such clarity, however, the author’s conclusion about construct
validity appears to be, “the results of the study also emphasize the need to verify whether or not
the instruments used in a study actually measure what they are intended to measure” (Gutiérrez,
2013, p. 445).

Recommendations for Practice


The issues of identifying inferences, planning relevant research, and making clear interpretations
are likely to increase in importance for SLA research going forward. These issues are not new
in the field (e.g., Bachman & Cohen, 1998; Chalhoub-Deville et al., 2006), but technology has

17
Carol A. Chapelle

expanded both the types of assessments used and their accessibility to a broad array of prospec-
tive users in the field and beyond. In this environment, where clarity is needed about the justified
interpretations and uses of assessments, individual researchers might be advised as follows:
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

• (Re)learn the basics. Fundamental characteristics of assessment and reporting of assessment


results need to be learned before undertaking assessment in research. Purpura et al. (2015)
provided a list of assessment-related errors commonly appearing in SLA research articles
(e.g., reporting the wrong reliability estimates). It will be difficult to move the field for-
ward without some common ground in assessment knowledge among researchers. In short,
research cannot be interpreted unless assessment results can be interpreted.
• Understand your purpose, audience and context. To write the specific statements required
for a validity argument, researchers need to be able to express the specific characteristics
of the interpretations and uses of the assessment tasks they are working with. Context-
specific statements are needed. Assessments do not attain validity by being used in previ-
ous studies.
• Explain the rationale for context-specific assessment decisions reported in research.
Rationales are needed for choices made during assessment construction, adaption, or adop-
tion; for sample selection; for the logic behind statistical calculations; and for the selec-
tion of precisely what results are reported. Rationales are key to dialogue about assessment
practices and they serve as educational tools for graduate students and for other researchers.
Writing rationales requires an understanding of the basics of assessment and the goals of
research. Rationales for interpretations and uses need to stem from rationales underlying test
development (Mislevy & Haertel, 2006; Nichols et al., 2015).
• Investigate statements rather than entities. Statements appear as claims and warrants in
validity arguments. Entities are multifaceted concepts such as reliability and validity. When
validity, or construct validity, as a noun phrase is investigated, the degree to which results
offer support is not obvious. Statements are needed because they express a proposition,
which the research results can be interpreted as supporting or not supporting.
• Interpret validation research results to elucidate the degree of support they provide for the
claims in the validity argument. Such interpretations can be made only to the extent that the
intended validity argument has been clearly stated.
• Examine the validity arguments in previous research. Progress is made in the study of com-
plex constructs and their development over time by a community of researchers who are able
to communicate about the detail of their research endeavor. Communication about assess-
ment requires the concepts and language that are up to the task in addition to researchers
who are able to use it.

In this era of open science, research results and instruments are made physically available to
anyone who wants to use them. The impetus has never been greater for addressing the theoreti-
cal, methodological, and ethical dimensions of assessment use in SLA research. The international
flow of information equally provides language testers and SLA researchers with access to and
responsibility for availing themselves of established practices in educational and psychological
testing. In the Standards, validity is defined as “the degree to which evidence and theory support
the interpretations of test scores for proposed uses of test scores” (AERA, APA, & NCME, 2014,
p. 1). If such widely-used statements of practice are not relevant to SLA research, alternative
frames of reference need to be articulated.
In a review article on validation in language assessment in 1999, I predicted that in the future
“language testing researchers [could] be expected to be more closely allied with these views [in
the Standards] than ever before. As a consequence, [I suggested that] for applied linguists who
think that ‘the validity of a language test’ is its correlation with another language test, now is a

18
Validity in Language Assessment

good time to reconsider” (Chapelle, 1999, p. 265). Twenty-plus years hence, evidence suggests
that SLA researchers are engaging with issues of validation for some assessments, as detailed in
Résvész and Brunfaut, Chapter 3, this volume. With this interest demonstrated by researchers and
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

more clarity about the concepts and tools available in argument-based validation today, the time
may be ideal to increase the clarity of validation practices in SLA research.

Recommended Readings
Chapelle, C. A. (forthcoming). Argument-based validation in testing and assessment. SAGE Publications.
Chapelle defines the concepts in assessment required to develop and use validity arguments by situating them
historically. She illustrates how these basic concepts in assessment are made more precise and contextually
relevant by developing them into claims, inferences and warrants in validity arguments.
Lissitz, R. W. (Ed.). (2009). The concept of validity: Revisions, new directions and applications. Information
Age Publishing, Inc.
Lissitz gathers some of the key validity researchers in educational and psychological testing to present
a variety of perspectives on validity. This may be a good starting point for researchers wishing to better
understanding the academic context of validation research.
Ercikan, K., & Pellegrino, J. W. (Eds.). (2017). Validation of score meaning in the next generation of
assessments. The National Council on Measurement in Education.
This volume explores the use of test takers’ process data in validation research. Process data (e.g., gathered
through think alouds, eye-tracking, and response latency data) are used in second language acquisition
research for making inferences about learning, language knowledge and strategic processes. This volume
introduces a different perspective on how such data can be used in validation research and in doing so may
suggest an expanded use for data that SLA researchers currently work with.

References
American Education Research Association, American Psychological Association, & The National Council
on Measurement in Education. (2014). Standards for educational and psychological testing. American
Education Research Association.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 2(1),
1–34. https​://do​i.org​/10.1​207/s​15434​311la​q0201​_1
Bachman, L. F., & Cohen, A. D. (Eds.). (1998). Second language acquisition and language testing interfaces.
Cambridge University Press.
Bachman, L. F., & Palmer, A. S. (1996). Language testing in practice. Oxford University Press.
Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Oxford University Press.
Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71, 425–440. https​://do​i.org​/10.
1​007/s​11336​-006-​1447-​6
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-
multimethod matrix. Psychological Bulletin, 56(2), 81–105. https://doi.org/10.1037/h0046016
Chalhoub-Deville, M., Chapelle, C. A., & Duff, P. (2006). Inference and generalizability in applied
linguistics: Multiple perspectives. John Benjamins Publishing.
Chapelle, C. (1994). Are C-tests valid measures for L2 vocabulary research? Second Language Research,
10(2), 157–187. https​://do​i.org​/10.1​177/0​26765​83940​10002​03
Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F. Bachman & A. D.
Cohen (Eds.), Second language acquisition and language testing interfaces (pp. 32–70). Cambridge
University Press.
Chapelle, C. A. (1999). Validity in language assessment. Annual Review of Applied Linguistics, 19, 254–
272. https​://do​i.org​/10.1​017/S​02671​90599​19013​5
Chapelle, C. A., Chung, Y.-R., Hegelheimer, V., Pendar, N., & Xu, J. (2010). Designing a computer-
delivered test of productive grammatical ability. Language Testing, 27(4), 443–469. https://doi.
org/10.1177/0265532210367633

19
Carol A. Chapelle

Chapelle, C. A., Enright, M. E., & Jamieson, J. (2008). Building a validity argument for the test of English
as a foreign language. Routledge.
Chapelle, C. A., Enright, M. E., & Jamieson, J. (2010). Does an argument-based approach to validity make a
Downloaded By: 10.3.97.143 At: 16:27 14 Nov 2023; For: 9781351034784, chapter2, 10.4324/9781351034784-3

difference? Educational Measurement: Issues and Practice, 29(1), 3–13. https​://do​i.org​/10.1​111/j​.1745​-


3992​.2009​.0016​5.x
Chapelle, C. A., Jamieson, J., & Hegelheimer, V. (2003). Validation of a web-based ESL test. Language
Testing, 20(4), 409–439. https​://do​i.org​/10.1​191/0​26553​2203l​t266o​a
Cheng, L., & Sun, Y. (2015). Interpreting the impact of the Ontario Secondary School Literacy Test on
second language students within an argument-based validation framework. Language Assessment
Quarterly, 12(1), 50–66. https​://do​i.org​/10.1​080/1​54343​03.20​14.98​1334
Cronbach, L. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp.
443–507). American Council on Education.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin,
52(4), 281–302. https://doi.org/10.1037/h0040957
Douglas, D. (2001). Performance consistency in second language acquisition and language testing: A
conceptual gap. Second Language Research, 17(4), 442–456. https​://do​i.org​/10.1​177/0​26765​83010​
17004​08
Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language a psychometric study.
Studies in Second Language Acquisition, 27(2), 141–172. https​://do​i.org​/10.1​017/S​02722​63105​05009​6
Erlam, R. (2006). Elicited imitation as a measure of L2 implicit knowledge: An empirical validation study.
Applied Linguistics, 27(3), 464–491. https://doi.org/10.1093/applin/aml001
Gutiérrez, X. (2013). The construct validity of grammaticality judgment tests as measures of implicit and
explicit knowledge. Studies in Second Language Acquisition, 35(3), 423–449. https​://do​i.org​/10.1​017/
S​02722​63113​00004​1
Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112(3), 527–535.
https​://do​i.org​/10.1​037/0​033-2​909.1​12.3.​527
Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319–
342. https​://do​i.org​/10.1​111/j​.1745​-3984​.2001​.tb01​130.x​
Kane, M. T. (2006). Validation. In R. Brennen (Ed.), Educational measurement (4th ed., pp. 17–64). Praeger
and Greenwood Publishing.
Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational
Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Llosa, L. (2008). Building and supporting a validity argument for a standards-based classroom assessment of
English proficiency based on teacher judgments. Educational Measurement: Issues and Practice, 27(3),
32–42. https​://do​i.org​/10.1​111/j​.1745​-3992​.2008​.0012​6.x
Markus, K. A., & Borsboom, D. (2013). Frontiers of test validity theory: Measurement, causation, and
meaning. Routledge.
McNamara, T., & Roever, C. (2006). Language testing: The social dimension. Blackwell Publishing.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). Macmillan
Publishing Company.
Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing.
Educational Measurement: Issues and Practice, 25, 6–20. https​://do​i.org​/10.1​111/j​.1745​-3992​.2006​.
0007​5.x
Newton, P. E., & Shaw, S. D. (2014). Validity in educational & psychological assessment. Sage Publications.
Nichols, P., Ferrara, S., & Lai, E. (2015). Principled design for efficacy: Design and development for the
next generation tests. In H. Jiao & R. W. Lissitz (Eds.), The next generation of testing: Common core
standards, smarter-balanced, PARCC, and the nationwide testing movement (pp. 49–81). Information
Age Publishing.
Norris, J. (2008). Validity evaluation in foreign language assessment. Peter Lang.
Purpura, J., Brown, J., & Schoonen, R. (2015). Improving the validity of quantitative measures in second
language research. In J. M. Norris, S. Ross, & R. Schoonen (Eds.), Improving and extending quantitative
reasoning in second language research: Currents in language learning (Vol. 2). Wiley-Blackwell.
https://doi.org/10.1111/lang.12112
Weir, C. (2005). Language testing and validation: An evidence-based approach. Palgrave Macmillan.
Youn, S. J. (2015). Validity argument for assessing L2 pragmatics in interaction using mixed methods.
Language Testing, 32(2), 199–225. https://doi.org/10.1177/0265532214557113
Zumbo, B. (2014). What role does, and should, the test Standards play outside of the United States of
America? Educational Measurement: Issues and Practice, 33(4), 4–12. https://doi.org/10.1111/
emip.12052

20

You might also like