Handbook of Research in Second Language Teaching A... - (Part VII Assessment and Testing)

VII
Assessment and Testing

Copyright © 2016. Taylor & Francis Group. All rights reserved.
Handbook of Research in Second Language Teaching and Learning : Volume III, edited by Eli Hinkel, Taylor & Francis Group, 2016. ProQuest Ebook
Central, http://ebookcentral.proquest.com/lib/vuw/detail.action?docID=5217842.
Created from vuw on 2023-05-07 19:38:38.
32
Social Dimensions of Assessment and Testing
Bernard Spolsky
What is social about language assessment or testing?1 While the word “social” derives from the Latin
word for allies, in modern usage it usually refers to human (or similar animal) group interaction.
In practice, however, in language assessment, it is much more limited in scope, referring usually to
the responses of a test taker to the instructions of a test (or the test maker) to perform in a specified
way,2 or to the ethics of using a test result. In most assessment of the language proficiency of candi-
dates (whether for control of education, qualification for employment, or eligibility for immigration
or citizenship), the central goal of the process is to produce and interpret a test score rather than to
analyze the detailed responses, something that is relevant to diagnostic testing (Spolsky, 1992). To
the extent that education, employment, and immigration all affect the life of the candidates, it has
become increasingly recognized that, as Shohamy (2001) has dramatically asserted, tests including
language tests are used to implement the power of the tester over the candidate.
This bring us to the realization that what we need to be talking about is the impact of a test, and in
particular its social, political, and economic impact. The classic example of a language test with social
impact is the shibboleth test in the Bible, where a dialectal variation was used to recognize members
of an enemy tribe, who were then killed. The modern equivalent are the language tests used (often
erroneously) to identify the country of origin of asylum seekers, a procedure that fails to recognize
that the subjects tested may have picked up the specific linguistic feature from a teacher or from
residence in a different area.
A more complex but equally powerful example of social impact was the early development of
English language tests intended to close a loophole in the 1924 Immigration Act passed in the US as a
result of the anti-foreigner feelings in the post-war period (Spolsky, 1995). The aim of the law was to
set national immigration quotas that would exclude prospective immigrants from regions other than
northern Europe, because the psychometrist Carl Brigham had given evidence in Congress about
the deleterious result of allowing non-Nordic immigrants to “contaminate the American gene pool.”
The loophole was an exception for applicants who were seeking admission to the US as students; a
student visa was automatic for anyone admitted to a US school, college, or university approved by
the Secretary of Labor.3 On being shown a certificate of admission, local US consuls were empowered
to issue a visa. Taking advantage, the number of foreign applicants to US institutions grew rapidly.
Worried by this development, in 1926 the Commissioner General of Immigration wrote a memo-
randum in which he stated that “many non-quota immigrant students gain admission to the United
States totally unfit, because of insufficient knowledge of the English language.” He went on to request
455
456 • Bernard Spolsky
that all schools “indicate in the certificate of admission the exact knowledge of English-language the
student must have before he can be accepted.” As a result, requests were made to the College Entrance
Examination Board to develop a test to measure knowledge of English; one such request for an
examination was made by the American Association of Collegiate Registrars in 1929.
The College Board set up a commission of university admissions officers and English teachers,
which agreed that such an examination was desirable and feasible but that candidates should pay the
full cost. A smaller commission proposed a detailed outline of the examination. It should have five
parts: true-false comprehension questions based on four one-paragraph passages, a longer passage
with questions that would demonstrate the ability to deal with hypothetical and adversative state-
ments, a direct dictation, and a 250- to 300-word composition. Reports should be given for each
part of the examination, and the examination papers should be sent to the colleges so they could
determine the level of work of which the candidate was capable. Although the test was developed at
the initiative of the US government, there was no government funding, but a grant of $5,000 from
the Carnegie Endowment for International Peace helped pay for development. While there were
problems in setting up the administration and supervision of the examination, some 30 candidates
took the exam in 1930, and most appeared to have done well. In the second year, 1931, 139 candidates
paid a fee of $10 to be examined in 17 countries, the largest number being 82 engineering students
in Moscow. In 1932, only 30 candidates at 12 centers took the examination, a result of the worldwide
depression. The following year, only 17 students took a revised version of the test, and in 1934, 20
candidates. In 1935, the grant was exhausted and the test was discontinued.4
A decade later, a US government agency, this time the Department of State, once again asked for
a test of English proficiency for foreign students. College Board set up a committee, which included
two linguists, J. Milton Cowan and Charles Fries,5 to design a test. The first form of the English
examination for foreign students was ready in 1947, but there were difficulties in administration.
The single version of the test was passed on to Educational Testing Service in 19486 and available for
purchase for a number of years, although there were no normative or validity studies.
For the next few years, English language testing for student visas was left to local consular offices,
and many different tests were used. However, by 1960, American academic and governmental circles
had all been persuaded of the insecurity and inadequacy of the available English language tests for
foreigners. In one of a number of major initiatives in the field of language policy (Spolsky, 2011),
Charles Ferguson of the Center for Applied Linguistics sponsored a meeting in Washington, DC, on
the topic, which led to a call for the development of a secure and up-to-date test (Center for Applied
Linguistics, 1961). The result of some years of effort backed by Ford Foundation funding was the
creation and implementation of the Test of English as a Foreign Language (TOEFL; Spolsky, 1995).7
The process had been complex, but the product was a test that was able to exploit the growing rec-
ognition of the demand for an objective measure of English language proficiency; by the end of the
20th century, this test competed with two other major industrial English language proficiency tests,
one developed by the University of Cambridge and the second offered by Pearson Education, an
international educational corporation that was aware of the huge profits to be made.8 The growth
of the testing industry changed the nature of the profession from cottage industry to consolidated
enterprise (Clark & Davidson, 1993). What then, we may ask, was the social impact of these mass
industrial tests?
My first suspicions arose when I thought about the use being made of English test results to deter-
mine admission to US universities. As someone responsible for such testing, I could see the purely
academic arguments for such a criterion: without adequate English, a foreign student would not be
able to follow lectures or read the required material. But which foreign students were likely to already
know enough English before they came to the US? Knowing what I did about the state of English
teaching in the world, it would be those who had attended the best schools, in other words those
Social Dimensions of Assessment and Testing • 457
whose families were rich enough or well-established enough to send their children to a good school.
The social effect of English language teaching, therefore, was to strengthen the power of the social,
economic, or political elite (Spolsky, 1967). If, instead of using knowledge of English as a gatekeeping
mechanism, we had concentrated on providing courses in the foreign language or adequate English
language preparation, we could have helped many bright students obtain access to an American
university education. It is true that this was an unintended consequence of English language testing,
but it reveals the social impact of a test.
Following the argument of Searle that examination questions are probably better interpreted as
requests for performance, Edelsky, Altwerger, Flores, Hudelson, and Jilbert (1983) argued that there
are many social situations in which some students are not willing to perform. In a classic sociolin-
guistic study, Labov (1970) showed that an adult interviewer inhibited the speech performance of
young black students, whose fluent proficiency only became obvious when he left the room. What
this adds up to is the social effect of the testing situation, one in which minority students are likely to
fail to perform at their level. Of course this refers not just to language tests but also to tests in general,
in particular to the high-stakes tests that lead to so many suicides in Japan and that completely failed
to raise the standards of education in the US when built into The No Child Left Behind Act of 2001.9
Language tests in particular have been used as a gatekeeping device, particularly in order to keep out
unwanted immigrants, refugees, and asylum seekers (Spolsky, 1997).
For many years, most scholarship in language testing chose to ignore sociolinguistic issues, choos-
ing rather to follow the approach of psychometrics, a field that had been established as a method of
overcoming the “unavoidable uncertainty” of examinations (Edgeworth, 1890) by doing everything
possible to reduce unreliability. While there were earlier complaints about the effects of testing (such
as Latham, 1877; Spolsky, 1981, 1984), the opening of the field to serious consideration of social
impact only followed Lyle Bachman’s (2005) acceptance of the inclusion of impact in the definition
of test validity by Messick (1980, 1989). This unlocked the doors of language testing conferences and
journals to consideration of the growing concern about social impact and ethical issues in the use of
language tests. The breakthrough, recognized in a number of papers, such as Davies (1997b), Farhady
(1999), Fulcher (1999), Hamp-Lyons (1997), Lynch (1997), Shohamy (1997), Spolsky (1997), and
Stansfield (1993), has now been usefully summarized in a Language Learning Monograph by McNa-
mara and Roever (2006), which traces the social dimension in testing and language testing and then
goes on to deal with the major trends in its development.
McNamara and Roever start by noting the recognition by Cronbach (Cronbach & Meehl, 1955)
of the importance of including in construct validation the social dimensions of testing, and the sub-
sequent inclusion by Messick (1980, 1989) in his definition of validity the social impact of test score
use. For language testing, the important breakthrough was the incorporation of notions of commu-
nicative language ability by Bachman (1990) into language testing theory.10 Bachman (2005, 2015),
by insisting on dealing with the impact of test use, makes the social dimension central to validation.
McNamara and Roever next consider two important aspects of the social dimension of lan-
guage testing, first the possibility of the testability of language proficiency as revealed by studies of
the sociopragmatics of face-to-face oral testing, and second the issues of bias and fairness. Bias and
fairness were tackled, they explain, by psychometrists who worked to demonstrate how to modify
test items to avoid discrimination against populations defined by, for example, gender. But the
language testing profession was starting to recognize wider social problems, and the result was the
development of codes of ethics and practice modeled on those used by the American Educational
Research Association and the American Psychological Association (APA).11 Each of the language
testing associations and some industrial testing agencies have developed their own guidelines for
ethical testing (Fulcher, 1999; International Language Testing Association, 2000; Spolsky, 2014;
Thrasher, 2004).
The issue of fairness has been discussed in a number of recent papers, responding to a proposal
by Xi (2010) that fairness is basic to validity, as the “social consequences, such as impartiality and
justice of actions and comparability of test consequences, are at the core of fairness.” Davies (2010)
disagrees: “I think the pursuit of fairness in language testing is vain: first because it is unattainable
and second because it is unnecessary. Everything Xi claims for fairness is covered by validity.”12 Her
claim, Davies says, adds nothing to an understanding of validity. Kane (2010) rather sees the two con-
cepts as asking the same basic question: “Are the proposed interpretations and uses of the test scores
appropriate for a population over some range of contexts?” Kunnan (2010) traces his own earlier
plea for adding fairness to validity studies. In a recent study of the philosophical basis for language
testing, Fulcher (2015) cites McNamara and Roever (2006) as providing a utilitarian argument for
fairness in validation.
Having described the addition of a social dimension to language testing theory and the con-
sequent concern of language testers and agencies for ethical issues raised by social impact of test
use, McNamara and Roever go on to deal with some specific issues. They start with the use of lan-
guage tests by governments for identification of group membership. Perhaps this question will now
be swamped by the larger phenomenon of mass refugees fleeing murder and persecution in Arab
countries, but as long as the numbers were more manageable, language tests were regularly used to
control immigration and to determine who was entitled to status as asylum seekers. The first and
best-known example was the Australian dictation test, when immigration officials were told to give
a dictation test in a language that an undesirable immigrant did not know (Davies, 1997a).13 Apart
from the shibboleth tests, more elaborate language or literacy tests have recently been used, they
report, to weed out ethnic Germans from Soviet countries seeking residence in Germany, to keep
down the number of Ingrian Finns trying to return to Finland from Russia, and in native title claims
in Australia. Recent studies have looked at this use of language tests to determine the eligibility of
refugees for status as asylum seekers (Eades, Fraser, Siegel, McNamara, & Baker, 2003), showing that
apart from ethical concerns, these tests regularly lack evidence of validity and reliability. Language
tests, McNamara and Roever (2006) point out, are also regularly used in deciding eligibility for citi-
zenship, with serious problems in setting criterial levels (Bessette, 2005). All of these official uses of
usually inadequate language tests for decisions about identity and citizenship emphasize the power of
tests as social instruments. In recent reviews, Hogan-Brun, Mar-Molinero, and Stevenson (2009) and
Saville (2013) continue to discuss the use of language tests in immigration and asylum decisions. A
recent paper by McNamara, Van Den Hazelkamp, and Verrips (2014) proposes a research agenda to
consider the basis for language testing as a method of assessing the origin of asylum seekers.
In the next chapter, McNamara and Roever (2006) survey briefly the social, political, and cultural
impact of school language testing, continuing a tradition which we noted earlier as beginning with
Latham (1877) and that is still active today with the current controversy over testing used to establish
centralized control of US education.14 There have been a number of recent developments in the area
of school language testing in which social arguments challenge the process. One has been the increas-
ing recognition of language diversity, bilingualism, and multilingualism. It is now widely accepted
by sociolinguists that multilingualism is a positive phenomenon (Blommaert, Leppänen, & Spotti,
2012), replacing some popular and earlier psychological assumptions that bilinguals were necessarily
less proficient in both languages.15 In a recent survey of European policies on language proficiency,
Van Avermaet (2009) praises the emphasis on multilingualism in the Common European Frame-
work of Reference for Languages (Council of Europe, 2001). This idea is echoed in the arguments
of Shohamy (2011) for bilingual and multilingual testing, presenting the notion that one result of
growing linguistic diversity is that children commonly have proficiency in more than one language.16
At the same time, concern about the growing power of industrial testing is leading to a backlash and
a growing resistance to testing.17 But after a full recognition of the attention that social dimensions
of language now receive since the expansion of linguistics beyond the constraints of Saussurean
linguistics, Young (2012) ends with a somewhat pessimistic prediction about the isolation from the
mainstream of supporters of ethical language testing.
It is true that many language testers no longer ignore the social dimension. Apart from papers
dealing specifically with social issues, the insistence of Bachman (2005, 2015) on the centrality of the
use of test scores to validation has been influential. There is a growing tendency to note the specific
social context of language testing studies. Illustrating this, about half of the papers in the two most
recent issues of Language Testing and of Language Assessment Quarterly show a concern about the
social dimension. While Fulcher (2015) is mainly concerned with the philosophy of language testing,
he does admit in the preface that “all testing has a social purpose” and that this purpose is commonly
to support a meritocracy. At the same time, he presents in his concluding chapter the arguments for
ethics and fairness that are now accepted by many language testers and finds in Robert Lado’s (1961)
philosophy of testing an echo of the optimistic values of human potential presented by Enlighten-
ment philosophers.
A selection of papers from the 2005 Language Testing Research Colloquium (Fox, Wesche, & Bayl-
iss, 2007) concludes with two papers that emphasize social aspects, one by McNamara (2007), which
calls for exploration of theories of the social context of language testing, and one by Shohamy (2007),
which reiterates the non-neutrality of language tests and the need for language testers to understand
the power of language tests in creating language policy and the need to acknowledge the positive
aspects of language diversity. The influence of the social dimension on current language testing is
also clear if one looks at the contents of a recent handbook (Fulcher & Davidson, 2013); it opens
with three articles on validity; in part three it includes four articles on the social uses of language
testing; and it concludes with four articles on ethics and language policy; thus, one-third of the book
is directly focused on social aspects, which are not ignored in other chapters.
Thus, my perhaps premature (Spolsky, 1977) claim that language testing had moved to a psycho-
linguistic-sociolinguistic stage is now fully justified, at least in the work of many language testing
theorists, if only weakly in implementation, hampered by the growth and profitability of industrial
testing and in the testing illiteracy of the general public (Taylor, 2009). But I share Young’s (2012)
pessimism: “Tests, like guns, can regularly appear to be misused. . . . I deplore the misuse of language
tests for management as a superficially attractive but fundamentally flawed policy” (Spolsky, 2013:
502–3). These dangers have become clearer as we look at the social aspects of language testing and
of industrial testing in general.
Notes
1. The use of the term “assessment” alongside “testing” is perhaps a recognition that there is estimation rather than measure-
ment involved, a more skeptical approach to dealing with human abilities.
2. Test questions are unlike normal conversational questions, where there are some well-formedness conditions (Searle,
1969): the person answering is assumed to know the answer, the person asking is assumed to want the answer, and the
asker is assumed to have a right to ask for this answer from this person. For example, if I ask someone how old he or she
is, it should be when I have a reason to ask and the right to ask, such as when I am trying to decide whether a young boy
is old enough to be counted for a Jewish prayer quorum. But in a test question, we assume that the person asking knows
the answer, and we want to find out if the test taker also knows it. Thus, tests are better considered as requests to perform.
3. This remains a serious problem: a story in the New York Times on 5 April 2016 reported the arrest in New Jersey of 21
people involved in student visa fraud.
4. And, therefore, it was not available for the socially relevant task of certifying the English proficiency of Jewish profession-
als escaping from Nazi persecution.
5. Fries brought a student, Robert Lado, with him to the meeting.
6. Established by the College Board to handle test development and administration, Educational Testing Service received
funding and tests from the Board.
7. Originally run by an independent board, secret negotiations persuaded the Ford Foundation to delay further funding
until the test was passed to the control of the College Board and the Educational Testing Service. Subsequently, the College
Board was persuaded to drop out.
8. The two major competitors of Educational Testing Service are now Cambridge English and Pearson Education.
9. This has now been replaced by the Every Student Succeeds Act of 2015, which lowers the number of and emphasis on tests.
10. Bachman (1990) essentially follows Canale and Swain (1980), who accepted Campbell and Wales (1970) in treating
sociolinguistic competence as distinct from linguistic competence, rather than as Hymes (1972, 1986) had seen it as an
alternative to the Chomsky model.
11. Despite the elaboration of a 200-page statement of Standards, the receptiveness of the APA to political and professional
rather than ethical considerations showed up in the involvement of psychologists in government-sponsored torture of
suspected terrorists.
12. I am reminded of a comment by an earlier Secretary of the University of Cambridge Local Examinations Syndicate that
it was enough if tests were “felt fair.”
13. McNamara and Roever (2006: 150) precede this with a fascinating selection of one-word shibboleth tests, including the
14th-century recognition and slaughter of non-Egyptians, the 1923 Japanese weeding out of Koreans, the 1937 massacre
of tens of thousands of Haitians by Dominicans, the Sri Lankan identification of Tamils, and the Lebanese using it to spot
Palestinians, concluding with the case of the Royal Canadian Mounted Police using physiological reactions to selected
lexicon (e.g. “fruit, trade, cruise”) to weed out homosexual recruits.
14. While the US Constitution leaves education to individual states, for a number of years the government has attempted to
influence it, at first through funding of programs and approaches it favored and more recently through required annual
testing in the Common Core of subjects.
15. A large number of research studies now support the claim of Ellen Bialystok about the cognitive advantages for chil-
dren and the elderly of bilingualism (Bialystok, 1991; Bialystok, Abutalebi, Bak, Burke, & Kroll, 2016; Thomas-Sunesson,
Hakuta, & Bialystok, 2016).
16. While researchers have drawn attention to the super-diversity of modern cities, there is good reason to believe that this is
not in fact new: many ancient and medieval cities were linguistically diverse.
17. See, for instance, reports of the National Center for Fair and Open Testing (http://www.fairtest.org/).
References
Bachman, Lyle F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.
Bachman, Lyle F. (2005). Building and supporting a case for test use. Language Assessment Quarterly: An International Journal,
2(1), 1–34.
Bachman, Lyle F. (2015). Justifying the use of language assessment: Linking test performance with consequences. JLTA Journal,
18, 2–22.
Bessette, Josee. (2005). Government French language training programs: Statutory civil servants’ experiences. Ottawa, ON: Uni-
versity of Ottowa.
Bialystok, Ellen. (1991). Language processing in bilingual children. Cambridge: Cambridge University Press.
Bialystok, Ellen, Abutalebi, Jubin, Bak, Thomas, H., Burke, Deborah, M., & Kroll, Judith F. (2016). Aging in two languages:
Implications for public health. Ageing Research Reviews, 27, 56–60.
Blommaert, Jan, Leppänen, Sirpa, & Spotti, Massimiliano. (2012). Endangering multilingualism. New York: Springer.
Campbell, R., & Wales, R. (1970). The study of language acquisition. In J. Lyons (Ed.), New Horizons in linguistics (pp. 241–260).
Harmondsworth, UK: Penguin Books.

Canale, Michael, & Swain, Merrill. (1980). Theoretical bases of communicative approaches to second language teaching and
testing. Applied Linguistics, 1(1), 1–47.
Center for Applied Linguistics. (1961). Testing the English proficiency of foreign students. Report of a conference sponsored by
the center for applied linguistics in cooperation with the institute of international education and the national association of
foreign student advisers. Washington, DC: Center for Applied Linguistics.
Clark, John L. D., & Davidson, Fred. (1993). Language-learning research: Cottage industry or consolidated enterprise. In A. O.
Hadley (Ed.), Research in language learning: Principles, process, and prospects (pp. 254–278). Lincolnwood, IL: National
Textbook Co.
Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching assessment. Cambridge:
Cambridge University Press.
Cronbach, Lee J., & Meehl, Paul E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.
Davies, Alan. (1997a). Australian immigrant gatekeeping through English language tests: How important is proficiency? In A.
Huhta, V. Kohonon, L. Kurki-Suonio & S. Luoma (Eds.), Current developments and alternatives in language assessment:
Proceedings of LTRC 96 (pp. 71–84). Jyväskylä: Kopijyva Oy, University of Jyväskylä.
Davies, Alan. (1997b). Introduction: The limits of ethics in language testing. Language Testing, 14(3), 235–241.
Davies, Alan. (2010). Test fairness: A response. Language Testing, 27(2), 171–176.
Eades, Diana, Fraser, Helen, Siegel, Jeff, McNamara, Tim, & Baker, Brett. (2003). Linguistic identification in the determination
of nationality: A preliminary report. Language Policy, 2(2), 179–199.
Edelsky, Carole, Altwerger, B., Flores, B., Hudelson, S., & Jilbert, K. (1983). Semilingualism and language deficit. Applied
Linguistics, 4, 1–22.
Edgeworth, Francis Ysidro. (1890). The element of chance in competitive examinations. Journal of the Royal Statistical Society,
53, 644–663.
Farhady, Hossein. (1999). Ethics in language testing. Modarres Quarterly Journal, 3(2 Special Issue: Languages), 19–42.
Fox, Janna, Wesche, Mari, & Bayliss, Doreen. (2007). Language testing reconsidered. Ottawa: University of Ottawa Press/Les
Presses de l’Université d’Ottawa.
Fulcher, Glenn. (1999). Ethics in language testing. TAE SIG Newsletter, 1(1), 1–4.
Fulcher, Glenn. (2015). Re-examining language testing: A philosophical and social inquiry. London: Routledge.
Fulcher, Glenn, & Davidson, Fred. (2013). The Routledge handbook of language testing. London: Routledge.
Hamp-Lyons, Liz. (1997). Washback, impact and validity: Ethical concerns. Language Testing, 14(3), 295–303.
Hogan-Brun, Gabrielle, Mar-Molinero, Clare, & Stevenson, Patrick. (2009). Discourses on language and integration: Critical
perspectives on language testing regimes in Europe (Vol. 33). Amsterdam and Philadelphia: John Benjamins.
Hymes, Dell. (1972). On communicative competence. In J. Pride & J. Holmes (Eds.), Sociolinguistics (pp. 269–293). Harmon-
sworth, England: Penguin Books.
Hymes, Dell. (1986). Models of the interaction of language and social life. In D. Hymes & J. J. Gumperz (Eds.), Directions in
sociolinguistics: The ethnography of communication (pp. 41–66). Oxford: Blackwell.
International Language Testing Association (Producer). (2000, February 2011). Code of ethics for ILTA. Retrieved from http://
www.iltaonline.com/code.pdf
Kane, Michael. (2010). Validity and fairness. Language Testing, 27(2), 177–182.
Kunnan, Antony John. (2010). Test fairness and Toulmin’s argument structure. Language Testing, 27(2), 183–189.
Labov, William. (1970). The logic of nonstandard English. In J. E. Alatis (Ed.), Report of the twentieth annual georgetown round
table meeting of languages and linguistics. Washington, DC: Georgetown University Press.
Lado, Robert. (1961). Language testing: The construction and use of foreign language tests: A teacher’s book. New York: McGraw
Hill Book Company.
Latham, H. (1877). On the action of examinations considered as a means of selection. Cambridge: Deighton, Bell and Company.
Lynch, Brian K. (1997). In search of the ethical test. Language Testing, 14(3), 315–327.
McNamara, Tim. (2007). Language testing: A question of context. In J. Fox, M. Wesche, D. Bayliss, L. Cheng, C. Turner &
C. Doe (Eds.), Language testing reconsidered (pp. 131–137). Ottawa: University of Ottawa Press.
McNamara, Tim, & Roever, Carsten. (2006). Language testing: The social dimension. Malden, MA & Oxford, UK: Blackwell Publishing.
McNamara, Tim, Van Den Hazelkamp, Carolien, & Verrips, Maaike. (2014). LADO as a language test: Issues of validity. Applied
Linguistics, 37(2), 262–283.
Messick, Samuel. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027.
Messick, Samuel. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan.
Saville, Nick. (2013). Language testing and immigration. In C. A. Chapelle (Ed.), The encyclopedia of applied linguistics. New
York: John Wiley and Sons.
Searle, John R. (1969). Speech acts: An essay in the philosophy of language. Cambridge: Cambridge University Press.
Shohamy, Elana. (1997). Testing methods, testing consequences: Are they ethical? Are they fair? Language Testing, 14(3), 340–349.
Shohamy, Elana. (2001). The power of tests: A critical perspective of the uses of language tests. London: Longman.
Shohamy, Elana. (2007). Tests as power tools: Looking back, looking forward. In J. Fox, M. Wesche, D. Bayliss, L. Cheng,
C. Turner & C. Doe (Eds.), Language testing reconsidered (pp. 141–152). Ottawa: University of Ottawa Press.
Shohamy, Elana. (2011). Assessing multilingual competencies: Adopting construct valid assessment policies. The Modern
Language Journal, 95(3), 418–429. doi:10.1111/j.1540–4781.2011.01210.x
Spolsky, Bernard. (1967). Do they know enough English? In D. Wigglesworth (Ed.), ATESL selected conference papers. Wash-
ington, DC: NAFSA Studies and Papers, English Language Series.
Spolsky, Bernard. (1977). Language testing: Art or science. In G. Nickel (Ed.), Proceedings of the fourth international congress
of applied linguistics (Vol. 3, pp. 7–28). Stuttgart: Hochschulverlag.
Spolsky, Bernard. (1981). Some ethical questions about language testing. In C. Klein-Braley & D. K. Stevenson (Eds.), Practice
and problems in language testing (pp. 5–30). Frankfurt am Main: Verlag Peter D. Lang.
Spolsky, Bernard. (1984). The uses of language tests: An ethical envoi. In C. Rivera (Ed.), Placement procedures in bilingual
education: Education and policy issues (pp. 3–7). Clevedon, UK: Multilingual Matters.
Spolsky, Bernard. (1992). The gentle art of diagnostic testing revisited. In E. Shohamy & R. Walton (Eds.), Language assessment
for feedback and other strategies (pp. 29–41). Washington, DC: National Foreign Language Center.
Spolsky, Bernard. (1995). Measured words: The development of objective language testing. Oxford: Oxford University Press.
Spolsky, Bernard. (1997). The ethics of gatekeeping tests: What have we learnt in a hundred years? Language Testing, 14(3),
242–247.
Spolsky, Bernard. (2011). Ferguson and fishman: Sociolinguistics and the sociology of language. In R. Wodak, B. Johnstone &
P. Kerswill (Eds.), The Sage handbook of sociolinguistics (pp. 11–23). London, UK: Sage Publications Ltd.
Spolsky, Bernard. (2013). Language testing and language management. In G. Fulcher & F. Davidson (Eds.), The Routledge
handbook of language testing (pp. 495–505). London: Routledge.
Spolsky, Bernard. (2014). The influence of ethics in language assessment. In A. J. Kunnan (Ed.), The companion to language
assessment (Vol. 2, pp. 1571–1585). Malden and Oxford: John Wiley & Sons.
Stansfield, Charles W. (1993). Ethics, standards, and professionalism in language testing. Issues in Applied Linguistics, 4(2),
189–206.
Taylor, Lynda. (2009). Developing assessment literacy. Annual Review of Applied Linguistics, 29, 21–36.
Thomas-Sunesson, Danielle, Hakuta, Kenji, & Bialystok, Ellen. (2016). Degree of bilingualism modifies executive control in
Hispanic children in the USA. International Journal of Bilingual Education and Bilingualism, 20, 1–10.
Thrasher, Randy. (2004). The role of a language testing code ethics in the establishment of a code of practice. Language Assess-
ment Quarterly, 1(2&3), 151–160.
Van Avermaet, Piet. (2009). Fortress Europe: Language policy regimes for immigration and citizenship. In G. Hogan-Brun,
C. Mar-Molinero & P. Stevenson (Eds.), Discourses on language and integration: Critical perspectives on language testing
regimes in Europe (pp. 15–43). Amsterdam: John Benjamins.
Xi, Xiaoming. (2010). How do we go about investigating test fairness? Language Testing, 27(2), 147–170.
Young, Richard F. (2012). Social dimensions of language testing. In G. Fulcher & F. Davidson (Eds.), The Routledge handbook
of language testing (pp. 178–193). London: Routledge.
33
The Practice of Language Assessment
Glenn Fulcher
Introduction
Language teachers and applied linguists are constantly engaged with language testing and assess-
ment. Whether informal or formal, assessment procedures are designed to collect evidence for deci-
sion making. In low-stakes classroom contexts, the evidence is used to provide formative feedback
to learners (Rea-Dickins, 2006), scaffold learning (Lantolf, 2009), evaluate the quality and pace of
learning, inform pedagogy, and feed into curriculum design (Shepard, 2000; Hill, 2012). In high-
stakes contexts, identifiable ‘test events’ produce scores that are used to make life-changing decisions
about the future of learners (Kunnan, 2013). In research, tests are used to generate data that are used
to address questions regarding second language acquisition (Chapelle, 1994).
When we use tests or assessments in each of these contexts, we make a number of generic assump-
tions. These are frequently subconscious, although they do occasionally become the focus of overt
scrutiny. The four headline assumptions are:
1. The tasks undertaken are relevant to the decision to be made, or the research question to be
addressed.
2. The responses elicited by the tasks are useful in making the decision or addressing the question.
3. The ‘score’ adequately summarizes the responses.
4. ‘Scores’ can be used as necessary but not always sufficient evidence for decisions.
(Fulcher, 2015, p. 2)
The word ‘score’ is placed within quotation marks because it is used to mean any reductive ren-
dering of a complex performance or set of responses to a number, letter, or verbal description
of what we infer learners can do as a result of our observation of their performance. The ver-
bal summaries may be informal teacher comments or notes, or be phrased in terms of ‘can do
statements’ or performance-level descriptors, as commonly found in standards or framework
documents (Fulcher, 2016).
The assumptions we make in all contexts lead to the claim that testing and assessment practices
need to be deliberate. They require careful thought, planning, and documenting. This is true of even
informal classroom assessment for learning, where the primary purpose is to provide formative feed-
back to learners that helps them to see the ‘gap’ between what they can currently do and their next
463
464 • Glenn Fulcher
goal. Thus, Black et al. (2003) suggest that planning in assessment for learning should focus on the
design of tasks that involve interaction between learners and feedback on performance from both
peers and teachers that is timely and informative. In addition, classroom management should pro-
vide time for learners to reflect on feedback and respond to it. Being deliberate in planning requires
a level of assessment literacy that includes the principles and concepts that guide practice, as well as
the practical skills in assessment construction (Fulcher, 2012a). In this chapter we focus on the latter,
as generic practices can improve the quality of testing and assessment among all professionals who
need to use language tests (Coniam, 2009). For the sake of brevity, from now on I will use the term
‘test’ to refer to any testing or assessment practice, formal or informal, as a superordinate.
The Language Test Design Cycle

The extent to which we can take the assumptions listed in the previous section for granted depends
upon the deliberate care with which we construct a test or assessment instrument. We can think of
this as the test design cycle, as illustrated in Figure 33.1.
Mislevy and Riconscente (2005) and Fulcher (2010; 2013) liken the process of building a test
by following a design cycle to the work of an architect—a metaphor that has been widely adopted
elsewhere (Green, 2014, pp. 26–29). The first thing an architect does is consider the purpose of the
building. Once we know the purpose, it is possible to consider its likely uses. From there it is possible
to start making other decisions, such as how large the building should be, what materials might be
Figure 33.1 The Test Design Cycle

(Fulcher, 2010, p. 94)
The Practice of Language Assessment • 465
used in its construction, and the internal layout of the structure. This all has to be done within the
resource and financial constraints that might apply. It must also meet any regulatory requirements of
the industry or state. Planning and building a test is no different. In what follows, I break down the
process represented in the test design cycle into four distinct types of activity. While these activities
are conceptually distinct, they are not all temporally separate, as they overlap from the very start of
the design process.
Activity 1: From Test Purpose to Test Tasks

The first questions we must ask ourselves are “What is this test/assessment for?” and “What decisions
are to be made on the basis of the evidence generated?” As Ingram (1968, p. 70) reminds us: “All tests
are for a purpose. A test that is made up without a clear idea of what it is for, is no good”. In other
words, the last step in the test cycle determines the initial statement of purpose.
Second, we need to know what the test criterion is. That is, the ultimate ‘real-world’ activities that
the test takers are expected to undertake. For example, a common criterion within an educational
setting for English language learners (ELLs) is learning about mathematics in a second language set-
ting (Chapin et al., 2009). In this case, the criterion would be the domain of mathematical English
required to gain access to school academic content.
Third, we may need to state what knowledge, skills, or abilities (KSAs) a learner would need
to achieve the target goal. In language testing, KSAs are frequently referred to as constructs: the
abstract abilities that underlie successful performance. Normally labelled with abstract nouns,
these might include specific levels of fluency in language use or performance with certain levels
of accuracy. When constructs are specified in standards documents, operational illustrations are
frequently provided to give indications of the task types that might be used to evaluate constructs.
Thus, the Common Core Standards for ELLs in the United States list one construct as the “inte-
gration of knowledge and ideas”. This is an abstract construct. One suggested task type at a very
general descriptive level is:
Translate quantitative or technical information expressed in words in a text into visual infor-
mation (e.g., a table or chart) and translate information expressed visually or mathematically
(e.g., in an equation) into words.
(Common Core State Standards, 2010, p. 62)
The example is an operationalization of the construct. It represents the claim that if learners are able
to translate information between text and tables/charts/equations, they are providing evidence that
they can “integrate knowledge and ideas”.

We can see from this example how the statement of purpose, the criterion, and the construct
definition lead directly into suggested operationalizations: the design of tasks that are directly
related to the test purpose. The rationale for the use of tasks becomes part of a validation nar-
rative that underpins why we think the first two assumptions we make about our test are war-
ranted. More technically, Chapelle (2008, p. 320) refers to this as a “validity argument from
test design”. Validity is the principal quality of a test which demonstrates that it meets industry
standards for safe use.
Consider the following assessment task School Trip (Bowland Maths, 2010), which illustrates the
first part of the test design cycle.
Mr. Richards, a teacher at Bosworth School, plans to take 30 pupils on a school trip. Here are
the places they could visit.
Figure 33.2 Options for a School Visit
The class votes on which place to visit. Here are the results:
Figure 33.3 Results of the Class Vote
Taking first and second choices into account, where do you think Mr. Richards should take
them? Explain how you decided.
In this task, learners are presented with data about options for a school trip in numerical and
tabular format. The question requires the learners to manipulate the data in order to arrive at a judg-
ment about which trip is most appropriate for the class. This judgment and the reasoning by which
it is made has to be explained in writing. The writing constitutes the evidence upon which we infer
whether (and to what degree) the respondent is able to “integrate knowledge and ideas” within this
domain. The testing process is valid to the extent that our inferences are sound and lead to correct
judgments and decisions. This is the core meaning of validity in testing.
The requirement to be explicit about test purpose cannot be stressed enough.

Consider the following two statements of purpose:
1. Upon entry to the country, the immediate needs of migrant workers with limited language
skills have been identified as (a) searching for suitable manual or semi-skilled work and
(b) accessing local social services and healthcare. In order to assess their ability to carry out the
necessary reading tasks to satisfy these needs, a reading test is required to screen new arrivals
with the purpose of planning and providing suitable language programmes.
2. Students in the college business stream are required to attend a course on negotiating contracts
and agreements for international companies. All communication should be interculturally
sensitive and avoid loss of face but achieve the primary goals of the business requirements.
Successful students (on this and content courses) will graduate with a certificate in intercul-
tural business management.
These statements are short and succinct, but they give a clear sense of the purpose of the assessment.
The statements are also indicative of the kind of content that we expect to find in a test, because they
specify the test criterion. Real-world activities may include reading job postings, reading social ser-
vice information leaflets (purpose 1), or taking part in business meetings and negotiating contracts/
agreements (purpose 2). The language that is required for each purpose is likely to be very different,
and it is unlikely that a validation narrative created to justify the tasks created for one purpose could
be easily translated into another.
Activity 2: Specification Writing and Iteration

We will return to the architecture metaphor. The first thing that an architect works on is a blueprint of
the building. In testing and assessment, the blueprint is called a test specification (Popham, 1978). The
test specification (often shortened to “test spec”) is the design document for a test and consists of an over-
all plan, together with specifications for each of the task types that may appear in the test (Davidson &
Lynch, 2002). We can think of this as the ‘structure’ and the ‘stuff’ that we are going to put into the space
created by the structure. Writing an explicit test spec forces the deliberateness of the design and develop-
ment process. Furthermore, without test and item specs, it is impossible to produce other forms of a test.
In fact, “the test” is actually the test spec—the blueprint. From the spec it is possible to create a number
of forms. The forms are generated by the same spec, and so are supposed to be parallel in the sense that
they test the same things, to the same level of difficulty, but have different content. This is often necessary
when test takers are taking a test at different times (for security purposes), or when a test is being given
to assess learning in pre- and post-test research settings. A sample specification for example 1 (migrant
workers) is provided in Fulcher (2010, pp. 139–146) as an example of how this might be done.
Apart from the task specs, the specifications also contain information that pertains to the overall
“look and feel” of the test—what we have referred to as the structure, and also the façade. Following
Mislevy et al. (2003), the other parts of a specification should include the following:
Evidence specification: The way in which the test takers are expected to respond to the tasks. This
may be closed response (such as multiple choice) or open response (such as writing, in the School Trip
task). It also includes information on how the response is to be scored and what the score means. This
part of the spec explicitly addresses headline assumptions 3 and 4. With reference to the construct
“Integration of Knowledge and Ideas” for use with the task School Trip, the following rubric for “clar-
ity and clearness of description” has been suggested:
Level 1: Describes the decision or calculations made, but this is incomplete and/or contains errors
Level 2: Describes the decision made and cost of the trip but with some errors
Level 3: Describes their decision-making process and methods but lacks clarity
Level 4: Clearly describes their decision-making process and their methods
There are problems with producing evidence models. The first is deciding how many levels (scores)
can be identified by human raters. The second is establishing the criteria that make one level differ-
ent from another. The third is putting these criteria into verbal descriptors that make sense to judges
and others who may see and use the scores. In the example above, the only real distinction between
levels 3 and 4 is the movement from “lacks clarity” in description to “clearly” describing. This begs
the question of what constitutes “clarity”. This is another construct that requires operationalization.
In practice, benchmark samples are frequently used to solve problems caused by vagueness and
imprecision in rubrics. These are selected from responses to tasks and used to illustrate typical per-
formances at each level. Judges are then expected to match future performances to a given level by
comparing it with the benchmark sample. This common strategy, should it work, is placed into the
validation narrative to support the inferences that may be drawn from the scores by linking the
meaning of each number to its benchmark. The following sample response to the School Trip task is
provided as a typical level 4 response because of its “clarity”.
Figure 33.4 Typical Level 4 Response

(Bowland Maths, 2010)
Assembly specification: Tests do not normally contain one task or even one task type. The assembly
spec tells the test producer how many items of each type are required to construct a complete form.
The purpose of the spec is to ensure that each form of the test (a) covers the same range of constructs
and domains of interest, (b) has enough tasks to collect reliable evidence regarding performance, and
(c) will take the same amount of time to complete.
Delivery specification: This spec states how the test will be delivered. This may include paper-based
or computer-mediated versions, but it also includes the administrative details that are so important
for test security. For example, how much space is to be left between desks, how many invigilators are
to be used relative to the number of candidates, instructions to test takers, or restrictions on addi-
tional help that might be available (e.g., dictionaries, Internet access, and so on).
Presentation specification: The presentation spec provides the small details about precisely what
a test taker will see when presented with the test tasks. This might include the font size, whether or
not the font size may be altered if the test is delivered by computer, where icons appear, or what col-
ors are used. These are actually very important details, because presentation decisions often impact
upon test takers with disabilities (e.g., visual impairment and font/color choices). Inappropriate use
of icons or navigation may slow down learners who take more time trying to find their way through
the test than responding to the tasks.
The process of spec writing begins as soon as the test purpose is being established. The purpose is
part of the specification, and so is the description of the test criterion, domains of interest, and con-
structs. These are all part of the general description of the test rationale. Each task type must also have
a spec, together with its associated evidence spec. These specs go through iterations and are revised
each time there is a change to a task type as the test designers refine their understanding of constructs
and how best to get useful evidence through test procedures. This happens most noticeably during
the process of prototyping, to which we turn next.
Activity 3: Prototyping and Piloting

Prototyping is the process of “trying out” prototype tasks to see if they work. Prototyping typically
involves asking small numbers of test takers (typically around five) to try out new tasks. As they are
doing the tasks, or immediately upon completion, they are asked to verbalize how they arrived at the
response they did and to describe any problems they faced in doing the task. Prototyping generates
two types of data for analysis: the responses to the task and the verbalization of how the participants
responded to the task. The data is studied to discover primarily if the way they respond to the task is
relevant to the construct that the designer intended to assess and whether the item type and format
is of suitable difficulty for the intended population (Nissan & Schedl, 2012, p. 283). Surprisingly,
most problems with tasks are discovered with no more than two or three iterations of this process,
with changes to the task (and the specification) at each iteration to remove the identified problems.
In contexts where learners with disabilities are expected to undertake tasks, it is essential that accom-
modations to mitigate the effects of a disability are designed and prototyped simultaneously (Abedi,
2012). Typical accommodations might include the provision of additional time, audio, amanuensis,
braille or large-print alternatives, screen magnification, or single-person test administration.
The fact that prototyping can be undertaken with so few participants means that it is feasible not
only for formal tests but also as a preliminary for the use of assessment tasks in schools. It should also
be stressed that prototyping is fundamental research. The test designer or teacher is asking a critical
validity question about the use of tasks and the meaning of the responses: do these tasks genuinely
generate useful construct-relevant evidence that allows successful inferencing and decision making?
When prototyping is undertaken in schools, in a context where teachers collaboratively write and
refine task specifications, this research enables the creation and evolution of a common understand-
ing of learning goals and the practical meaning of constructs.
Piloting is only undertaken when it is necessary to generate statistical data for a test by giving it to
a much larger sample of participants who are representative of the test-taking population. For this
an assembly model is required, and the pilot test is produced according to the delivery and presenta-
tion specification. The designers are asking the question: does this all work? (Kenyon & MacGregor,
2012). The data allows researchers to identify any task types that are too difficult or too easy and to
check that the timing of sections has been adequately estimated. Timing studies ensure that the test
administration is not “speeded”, as it is well-known that if learners do not have sufficient time to
complete a test in the time allocated, then they begin to guess, and guessing introduces score varia-
tion that is not related to the constructs of interest.
Activity 4: The Field Test and the Go/No-Go Decision

In high-stakes testing, when all iterations of the design process are complete, at least two full forms
of the test are given to a large sample of participants drawn from the test-taking population. At this
stage, the norms for the test are established. The administrative procedures and security features
are tested to ensure that all works well. At the end of the field test, the test developer can make an
informed “go/no-go” decision, which determines whether the test can now be used in live assessment
for its intended purpose.
The test roll-out is not the end of the process. Constant monitoring of test results is essential,
because the test population may change over time. As applied linguistic research improves the def-
inition of constructs and our understanding of communication, it becomes possible to generate
new operationalizations of constructs, which leads to new task types. Over time the test undergoes
upgrade retrofits as some task types are retired and new ones are introduced to better represent the
construct (Fulcher & Davidson, 2009).
Trends and Controversies

In recent years, the most significant trend in high-stakes testing has been the use of tests for second-
ary purposes for which they were not originally designed, sometimes referred to as “repurposing”
(Wendler & Powers, 2009).
I have claimed that without statements of purpose, none of our four headline assumptions regarding
score meaning make any sense at all. If I were to take a test designed to assess the readiness of well-edu-
cated high school students for university entry and use it for any of the three purposes used as examples
above, all four of my assumptions cease to be meaningful. This has been a given in test design for 100
years. As Rulon (1946, pp. 290–291) put it, “a test cannot be labeled as valid or not valid except for some
purpose.” Furthermore, the primary requirement of validation has been enshrined in testing standards
such as AERA Standard 1.1: “A rationale should be presented for each recommended interpretation and
use of test scores, together with a comprehensive summary of the evidence and theory bearing on the
intended use or interpretation” (AERA et al., 1999, p. 17). Within this view, any repurposing requires
the production of evidence and rationale to support the new use of the test, as made clear by Standard
1.4: “If a test is used in a way that has not been validated, it is incumbent on the user to justify the new
use, collecting new evidence if necessary” (AERA et al., 1999, p. 18). To the extent that there is a mis-
match between the old and new purpose, the test providers are required to undertake a change retrofit
and develop a new validation narrative/argument to support the new use (Fulcher & Davidson, 2009).
However, the trend to use language testing in policy contexts has led to uncontrolled repurposing,
fueled by the commercialization of education and the growth of the migration industry. The use of
language tests to implement immigration policies has now become commonplace around the world
(Kunnan, 2012), but with very few exceptions (such as the Occupational English Test for health
professionals, see McNamara, 1996), tests for these new purposes have not been developed. Most
governments have gone for “off the peg” solutions. For example, Read (2001, pp. 193–194) makes it
clear that while the International English Language Testing System (IELTS) test “should have been
properly re-validated to establish its appropriateness for the new purpose”, its selection for migra-
tion purposes in New Zealand was taken primarily on the basis of lack of cost to the government,
compared with designing a local test for this particular purpose.
In recent years, the use of the Common European Framework (CEFR) to set “standards” for
policy implementation has compounded the problem. The policy makers decide which CEFR level is
appropriate for a particular decision (Van Avermaet & Rocco, 2013, p. 15), and through the process of
‘mapping’ an existing test to CEFR levels, any test can claim to be relevant to policy implementation.
Similarly, it is possible to state a “level” that is appropriate for entry to a health profession and use a
test of academic English to make that decision based upon a standard-setting exercise (O’Neil et al.,
2007), or to use a test of general business English to make judgments about the readiness of military
personnel for operational deployment (Tannenbaum & Baron, 2010).
This has set the scene for the most profound controversy in modern language testing practice.
The IELTS was “design[ed] specifically to measure the English language skills of candidates intend-
ing to study in academic or training contexts in English-speaking countries” (Ingram, 2004, p. 18),
not to migrate and work in specific domains. The Test of English for International Communication
(TOEIC) was designed as a test of general workplace English (Schmitt, 2005), not to make decisions
about communicative readiness in the armed forces. For Ahern (2009), it is the commoditization of
testing and education. Shohamy and McNamara (2009) see it as the politicization of test use. For
Fulcher (2016), it is the subversion of validation theory, in which “validation” is reinterpreted as
“recognition” by political institutions. Research funded by the test owners, on the other hand, merely
recommends sensitivity to the needs of policy makers in order to maintain market share (Merrifield,
2012).
The issues at stake could not be greater for the practice of test design. The test design cycle
described in this chapter rests upon the assumption that sound inferences are built into tests at the
design and development stage: validity through design, told through a validation narrative that tells
the story of the decisions made, illustrated with the research and theory that led to design decisions.
The process is guided by a vision of the intended use and consequences of the test. Fulcher and
Davidson (2007, p. 51) call this “effect-driven testing”, in which “the ultimate test design decisions are
driven by the impacts that the test will have on stakeholders.” This is the meaning of the unbroken
cycle: test purpose and design provide the meaning to scores and justification to decision making.
Whereas, in a great deal of modern global practice, meaning is provided post-hoc once tests are
co-opted to serve the purposes of new social policy.
Perspectives
Validation Processes
The perspective adopted in this chapter is derived from the tradition that limits score meaning and
inference to the primary purpose for which a testing procedure is designed. This requires a statement
of purpose and the construction of a validation narrative around the design cycle. The narrative
may then be used in a validation argument (Kane, 2013) to justify the claimed meaning of the score.
For some tests, like the Test of English as a Foreign Language (TOEFL iBT), this is set out explicitly
(Chapelle, 2008) and contains two components.
The first component is the confirmatory research. That is, all the research focuses on investigating
and presenting the elements of the validation narrative that support the intended score meaning.
During the early stages of the test design cycle, the function of research is to support design deci-
sions and confirm those decisions as being useful for the test purpose. The second component is
non-confirmatory research. Usually carried out during piloting and field testing and later during live
testing, the function is to question the weakest parts of the validity argument (Haertel, 1999). Weak-
nesses lie in any claim that the score is affected by any factor that is not relevant to the purpose of the
test and its intended construct of assessment. This is summarized using the term construct-irrelevant
variance (Messick, 1989). For example, if it is suspected that the content of the test has little relevance
to the domain in which the test takers are expected to use language, a comparative study of test tasks,
topics, language use, and processes with those of the criterion could be undertaken. If it is found that
the test content as set out in the specifications is not sufficiently representative of the criterion, then
there are grounds for questioning the validity of the test.
This perspective reinforces the view that test design is a research activity, rather than a mundane
task that practitioners are expected to undertake with little time or resources.
Substantive Validation
Test designers usually assume that test takers arrive at a correct answer, or perform in particular ways,
for the reasons the designer had in mind when a task was created. We often take it for granted that we
have the ability to “see into the mind” of the test taker. The rationale for prototyping is to check with
real test takers whether or not this is the case. The fundamental question of substantive validation is:
do the test takers respond to the task in the way that the task designer thinks they will respond? The
methodology of choice is concurrent or retrospective protocol analysis, or PA for short (Bowles, 2010).
The purpose of these studies during the design process is to build in non-confirmatory research
that challenges the assumptions of the task designers. In the published literature, one of the most
informative PA studies is that of Buck (1991), in which he presented a series of listening texts and
tasks to learners to discover how they responded. When analyzing the verbal protocols, Buck discov-
ered that there were radically different interpretations of questions depending upon cultural back-
ground and personal knowledge. Construct-irrelevant factors could lead to “incorrect” responses for
perfectly good reasons. One illustrative example will suffice.
Text Extract
My friend Susan was living in West Africa and while she was living there she had a problem with
burglars. For a period of about two months every Sunday night someone was breaking into her
house through her bedroom window and was stealing something very small from her house
and she tried many things to prevent this from happening like putting bars over the windows
and hiring a guard to watch the window. And still, every Sunday night, somehow someone came
through the window and stole something.
Question
What did Susan do about the problem?
One student responded: “she employed a guardman and jammed the window closed with a bar”
(Buck, 1991, p. 74). The second part of this response would have been considered “incorrect”, but
the protocol revealed that the Japanese participant was only familiar with sliding windows and had
imagined that Susan had used something to stop them from opening. If this problem is present with
young adults, it is exacerbated when attempting to assess young learners, as in the following example:
Which is the odd one out?
(a) Eggs
(b) Flowers
(c) Vegetables
(d) Trees
Pupil: Trees.
Teacher: Why do you say that?
Pupil: They’re the only ones I can’t put in a fridge.
Social Responsibility and Accountability

The statistical evidence in piloting also allows researchers to check that no tasks discriminate against
an identifiable (or protected) sub-group of the population. All test designers and users should be
aware of the fact that if scores are sensitive to age, gender, disability, race, religion, and a number of
other protected characteristics, litigation is a potential consequence of the use of non-valid scores
(Fulcher, 2014). The practical lesson is that as the stakes become higher, the amount of effort put into
deliberate design, research, and validation becomes more onerous.
Conclusion and Future Directions

Much of our current practice draws directly on innovations that took place during World War I
(Fulcher, 2012b). These practices have evolved slowly over the 20th century and have been docu-
mented in guidelines and textbooks for each generation. These practices are reflected in the test
design cycle discussed in this chapter. Two issues are of particular importance at the present time and
will occupy researchers for some time to come.
The first is the ecological sensitivity of a test or assessment process to the communities that are
impacted by its use. One question we must ask is whether tests are better developed locally by edu-
cational practitioners or provided by large testing agencies that may not be aware of local contexts
within which their tests are used. Part of the question is political and involves asking whether local
or external mandates best serve the assessment needs of particular communities (Fulcher, 2012a, pp.
2–3). Of course, stakeholder views may always be sought, and local involvement in large-scale testing
studies of washback and impact may influence future revisions of a test (e.g. Green, 2007). This is
a matter of research, and perhaps consultation, but not democratic involvement (Shohamy, 2001).
Next is an epistemic question that requires research. To what extent might locally designed tests
prove more or less useful for institutions, regions, or countries, compared with international tests, for
their own particular decision-making contexts? This key validity question (Messick, 1989, pp. 14–15)
must be investigated for each context of use. Nevertheless, generalized models of investigation may
emerge, but these do not as yet exist.
The second under-researched area is the role of test design in the continuing professional develop-
ment (CPD) of teachers. Davidson and Lynch (2002, pp. 98–120) were the first to consider the group
dynamics of teachers in test design contexts, looking particularly at the decision-making processes
involved in constructing test specifications. They showed that spec development workshops gener-
ated an environment in which individuals negotiated their understanding of constructs and learn-
ing objectives in order to converge at a common understanding. Within language programmes at
particular institutions, the professional development benefits of local test design may be immense,
through agreeing upon constructs, to designing tasks that can be shared for teaching and assessment,
and constructing common assessment criteria. One example of how such collaboration fosters pro-
fessional development is provided by Fulcher (2010, pp. 159–171). The transcript of a group discus-
sion on the value of a particular task shows how a common understanding of its suitability is arrived
at. The group actually decides that the task, which was at first considered innovative, in fact does not
appropriately assess the listening construct in which they are interested. The task is discarded. The
field does not currently have sufficient studies of the dynamics of test design, how designers arrive at
decisions, and how the process supports improved pedagogy.
Practical test design requires human decisions informed by empirical research and theoretical
understanding. The narrative of the decisions backed up by the research and rationales constitutes a
validity argument by design. The narrative makes clear the claims we wish to make for score mean-
ing and the use to which scores are put. The process of test design is one of discovery. It is a research
enterprise, but one that requires group cohesion and interaction which may, in and of itself, be
extremely valuable in professional development and team building. The end product is ideally a test
that is sensitive to the constructs of interest, relevant to the learners and the domains in which they
will use language, and useful for decision making.
Acknowledgements
I would like to thank Dr. Daniel Pead of the University of Nottingham and Bowland Maths for per-
mission to reproduce the School Trip task.
References
Abedi, J. (2012). Validity issues in designing accommodations for English language learners. In G. Fulcher & F. Davidson
(Eds.), The Routledge Handbook of Language Testing (pp. 48–61). London and New York: Routledge.
Ahern, S. (2009). ‘Like cars or breakfast cereal’: IELTS and the trade in education and immigration. TESOL in Context 19(1),
39–51.
American Educational Research Association (AERA), American Psychological Association (APA) and National Council on
Measurement in Education (NCME). (1999). Standards for Educational and Psychological Testing. Washington, DC: AERA.
Black, P., Harrison, C., Lee, C., Marshall, B. and William, D. (2003). Assessment for Learning: Putting It into Practice. Bucking-
ham, UK: Open University Press.
Bowland Maths. (2010). School Trip. UK: Bowland Maths. Available online: http://www.bowlandmaths.org.uk/index.html
Bowles, M. A. (2010). The Think-aloud Controversy in Second Language Research. London and New York: Routledge.
Buck, G. (1991). The testing of listening comprehension: An introspective study. Language Testing 8(1), 67–91.
Chapelle, C. A. (1994). Are C-tests valid measures for L2 vocabulary research? Second Language Research 10(2), 157–187.
Chapelle, C. A. (2008). The TOEFL validity argument. In C. A. Chapelle, M. K. Enright & J. M. Jamieson (Eds.), Building a
Validity Argument for the Test of English as a Foreign Language (pp. 319–352). London, England: Routledge.
Chapin, S. H., O’Coner, C. and Anderson, N. C. (2009). Classroom Discussions: Using Math Talk to Help Students Learning,
Grades K-6. Sausalito, CA: Math Solutions Publications.
Common Core State Standards for English Language Arts. (2010). Washington, DC: US Government. Available online: http://
www.corestandards.org/ELA-Literacy/
Coniam, D. (2009). Investigating the quality of teacher-produced tests for EFL students and the effects of training in test
development principles and practices on improving test quality. System 37(2), 226–242.
Davidson, F. and Lynch, B. K. (2002). Testcraft: A Teacher’s Guide to Writing and Using Language Test Specifications. New Haven
and London: Yale University Press.
Fulcher, G. (2010). Practical Language Testing. London: Hodder Education.
Fulcher, G. (2012a). Assessment literacy for the language classroom. Language Assessment Quarterly 9(2), 113–132.
Fulcher, G. (2012b). Scoring performance tests. In G. Fulcher & F. Davidson (Eds.), The Routledge Handbook of Language Test-
ing (pp. 378–392). London and New York: Routledge.
Fulcher, G. (2013). Test design and retrofit. In C. A. Chapelle (Ed.), The Encyclopedia of Applied Linguistics (pp. 5809–5817).
Malden, MA: Wiley Blackwell.
Fulcher, G. (2014). Language testing in the dock. In A. J. Kunnan (Ed.), The Companion to Language Testing (pp. 1553–1570).
London: Wiley-Blackwell.
Fulcher, G. (2015). Re-examining Language Testing: A Philosophical and Social Inquiry. London and New York: Routledge.
Fulcher, G. (2016). Standards and frameworks. In J. Banerjee & D. Tsagari (Eds.), Handbook of Second Language Assessment
(pp. 29–44). Berlin: De Gruyter.
Fulcher, G. and Davidson, F. (2007). Language Testing and Assessment: An Advanced Resource Book. London and New York:
Routledge.
Fulcher, G. and Davidson, F. (2009). Test architecture, test retrofit. Language Testing 26(1), 123–144.
Green, A. (2007). IELTS Washback in Context: Preparation for Academic Writing in Higher Education. Cambridge: Cambridge
University Press.
Green, A. (2014). Exploring Language Assessment and Testing. London and New York: Routledge.
Haertel, E. H. (1999). Validity arguments for high-stakes testing: In search of evidence. Educational Measurement: Issues and
Practice 18(4), 5–9.
Hill, K. (2012). Classroom-Based Assessment in the School Foreign Language Classroom. Bern: Peter Lang.
Ingram, E. (1968). Attainment and diagnostic testing. In A. Davies (Ed.), Language Testing Symposium: A Psycholinguistic
Approach (pp. 70–97). Oxford: Oxford University Press.
Ingram, D. (2004). Towards language. Babel, 39(2), 16–24.
Kane, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement 50(1), 1–73.
Kenyon, D. and MacGregor, D. (2012). Pre-operational testing. In G. Fulcher & F. Davidson (Eds.), The Routledge Handbook
of Language Testing (pp. 295–306). London and New York: Routledge.
Kunnan, A. (2012). Language assessment for immigration and citizenship. In G. Fulcher & F. Davidson (Eds.), The Routledge
Handbook of Language Testing (pp. 162–177). London and New York: Routledge.
Kunnan, A. (2013). High stakes language testing. In C. A. Chapelle (Ed.), The Encyclopedia of Applied Linguistics. Malden, MA:
Wiley Blackwell.
Lantolf, J. P. (2009). Dynamic assessment: The dialectic integration of instruction and assessment. Language Teaching, 42(3),
355–368.
McNamara, T. F. (1996). Measuring Second Language Performance. London: Longman.
Merrifield, G. (2012). The use of IELTS for assessing immigration eligibility in Australia, New Zealand, Canada and the United
Kingdom. IELTS Research Reports 13 (pp. 1–32). Australia: IDP and the British Council.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (pp. 13–103). New York: American Council on
Education/Macmillan.
Mislevy, R. J., Almond, R. G. and Lukas, J. F. (2003). A Brief Introduction to Evidence-centered Design. Research Report RR-03–
16. Princeton, NJ: Educational Testing Service.
Mislevy, R. J. and Riconscente, M. M. (2005). Evidence-Centered Assessment Design: Layers, Structures, and Terminology. Menlo
Park, CA: SRI International.
Nissan, S. and Schedl, M. (2012). Prototyping new item types. In G. Fulcher & F. Davidson (Eds.), The Routledge Handbook of
Language Testing (pp. 281–294). London and New York: Routledge.
O’Neil, T. R., Buckendahl, C. W., Plake, B. S. and Taylor, L. (2007). Recommending a nursing specific passing standard for the
IELTS examination. Language Assessment Quarterly, 4(4), 295–317.
Popham, J. (1978). Criterion-referenced Measurement. Englewood Cliffs, NJ: Prentice-Hall.
Read, J. (2001). The policy context of English testing for immigrants. In C. Elder, A. Brown, E. Grove, K. Hill, N. Iwashita,
T. Lumley, T. McNamara, & K. O’Loughlin (Eds.), Experimenting with Uncertainty: Essays in Honour of Alan Davies
(pp. 191–199). Cambridge: Cambridge University Press.
Rea-Dickins, P. (2006). Currents and eddies in the discourse of assessment: A learning-focused interpretation. International
Journal of Applied Linguistics 16(2), 163–188.
Rulon, P. J. (1946). On the validity of educational tests. Harvard Educational Review 16, 290–296.
Schmitt, D. (2005). Test of English for International Communication (TOEIC). In S. Stoynoff & C. A. Chapelle (Eds.), ESOL
Tests and Testing (pp. 100–102). Washington, DC: TESOL.
Shepard, L. (2000). The role of assessment in a learning culture. Educational Researcher 29(7), 4–14.
Shohamy, E. (2001). Democratic assessment as alternative. Language Testing 18(4), 373–391.
Shohamy, E. and McNamara, T. (2009). Language tests for citizenship, immigration, and asylum. Language Assessment Quar-
terly 6(1), 1–5.
Tannenbaum, R. J. and Baron, P. A. (2010). Mapping TOEIC Test Scores to the STANAG 6001 Language Proficiency Levels.
Research Monograph 10–11. Princeton, NJ: Educational Testing Service.
Van Avermaet, P. and Rocco, L. (2013). Language testing and access. In E. Galaczi & C. J. Weir (Eds.), Exploring Language
Frameworks (pp. 11–44). Cambridge: Cambridge University Press.
Wendler, C. and Powers, D. (2009). What Does It mean to Repurpose a Test? R&D Connections No. 9. Princeton, NJ: Educational
Testing Service. Available online: http://www.ets.org/Media/Research/pdf/RD_Connections9.pdf
34
Large-Scale Language Assessment
Empirical Studies
Antony John Kunnan
Introduction
Large-scale language assessments, as described by Kunnan and Grabowski (2013, p. 305), are used
for multiple purposes in multiple contexts. The multiple purposes include monitoring student prog-
ress, diagnosing student capabilities and weaknesses, selecting test takers for entrance to college and
university programs, admitting applicants to careers and for advancement of careers, and admitting
applicants for immigration and citizenship. The multiple contexts include school, college, university,
workplace, immigration, and citizenship. Such a range of purposes and contexts therefore requires
the development of language assessments at various proficiency levels (from beginning or novice
to advanced or expert level) and content appropriate to a range of test takers (from young learners
to adults). While all this complexity may not be involved in any one assessment, understanding the
specific purpose, context, proficiency level, and age range would be critical in the development of an
assessment and the research that would be required to support the claims of an assessment.
At the school level, these assessments could be used to provide diagnostic information to all
stakeholders (e.g., teachers, students, parents, and school administrators), to offer information
regarding admissibility of test takers to colleges and universities, and to ensure accountability to
stakeholders. For instance, the California Standards Test, which was developed by a private contrac-
tor for the Department of Education in California, was administered to students from grades 3 to
8 to monitor student progress and provide diagnostic feedback to students, parents, and school
administrators until 2014.1 School exit examinations include language examinations along with
science, mathematics, and social studies. In India, the Indian Certificate for School Examinations
taken at the end of grades 10 and 12 include English and many Indian, Asian, and European lan-
guages. In California, the California High School Exit Exam was used as certification of schooling
until 2016. The German Abitur and the French la Baccalaureate are used as screening and selection
instruments for college and university entrance. In China, the National Matriculation English Test
serves as the university entrance test of English that makes inferences about test takers’ English
proficiency, along with scores from university entrance examinations in other secondary school
subjects (Cheng & Qi, 2006). Other well-known tests include the Test of English as a Foreign Lan-
guage internet-Based (iBT), the International English Language Testing System (IELTS), and the
Pearson Test of English (PTE). They are administered to international students whose first language
is not English and who are seeking admission to English-medium colleges and universities in
476
Large-Scale Language Assessment • 477
the US, Canada, and the UK. China’s College English Test (CET) is administered to all non-English-
major undergraduates to ensure that these students reach the required English levels specified in the
National College English Teaching Syllabuses.
As varied as their purposes are, large-scale assessments are typically standardized (similar tasks,
administration, scoring, and reporting), generally norm-referenced (NRT) in school, college, and
university contexts (test takers are rank-ordered based on the performance of test taker cohort),
predominantly selected response type (such as multiple-choice, true-false, fill-in-the-blank type
response formats), with high volume (the test is administered to hundreds or thousands or more),
and high stakes (decision making is based on test scores, and the resulting career or life paths are
not easily reversed). These defining characteristics mean that the validity, reliability, and fairness of
such large-scale assessment are critical. Therefore, rigorous and thorough studies in these areas are
needed on an ongoing basis to ensure test quality. Moreover, because large-scale assessments are used
for making important decisions, it is essential that such assessments lead to beneficial consequences
for all stakeholders.
Standards for Educational Assessments

The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014 and previous
editions; Standards hereafter) have argued that test developers, test publishers, and test score users
of large-scale assessments particularly, in addition to policy makers, need to understand and adhere
to Standards that have been jointly articulated by assessment and measurement experts from three
sponsoring agencies in the US—the American Educational Research Association, the American Psy-
chological Association, and the National Council for Measurement in Education. Over the decades,
the Standards have been extremely influential among assessment agencies in the US, Canada, the
UK, and overseas. In many cases, the Standards have been adopted or adapted every time they have
been published.
The 2014 Standards have three parts: Foundations, Operations, and Testing Applications. In Part 1,
the chapters and associated standards are “Validity,” “Reliability/Precision and Errors in Measure-
ment,” and “Fairness in Testing”; in Part 2, the chapters and associated standards are “Test Design
and Development,” “Scores, Scales, Norms, Score Linking, and Cut Scores,” “Test Administration,
Scoring, Reporting, and Interpretation,” “Supporting Documentation for Tests,” “The Rights and
Responsibilities of Test Takers,” and “The Rights and Responsibilities of Test Users”; and in Part 3, the
chapters and associated standards are “Psychological Testing and Assessment,” “Workplace Testing
and Credentialing,” “Educational Testing and Assessment,” and “Uses of Tests for Program Evalua-
tion, Policy Studies, and Accountability.” Here is a listing of Standards and Clusters associated with
the three foundational chapters.
Validity
Standard 1.0 Clear articulation of each intended test score interpretation for a specified use
should be set forth, and appropriate validity evidence in support of each intended interpreta-
tions should be provided.
Cluster 1: Establishing intended uses and interpretations,
Cluster 2: Issues regarding samples and settings used in validation,
Cluster 3: Validity evidence (content-oriented evidence, evidence regarding cognitive processes,
evidence regarding internal structure, evidence regarding relationships with conceptually
related constructs, and evidence regarding relationships with criteria, and evidence based on
consequences of tests.
478 • Antony John Kunnan
Reliability/Precision and Errors in Measurement

Standard 2.0 Appropriate evidence of reliability/precision should be provided for the interpreta-
tion for each intended score use.
Cluster 1: Specification for replications of the testing procedure,
Cluster 2: Evaluating reliability/precision,
Cluster 3: Reliability/Generalizability coefficients,
Cluster 4: Factors affecting reliability/precision,
Cluster 5: Standard errors of measurement,
Cluster 6: Decision consistency,
Cluster 7: Reliability/Precision of group means,
Cluster 8: Documenting reliability/precision.
Fairness in Testing
Standard 3.0 All steps in the testing process, including test design, validation, development,
administration, and scoring procedures, should be designed in such a manner as to minimize
construct-irrelevant variance and to promote valid score interpretations for the intended uses
for all examinees in the intended population.
Cluster 1: Test design, development, administration, and scoring procedures that minimize bar-
riers to valid score interpretations for the widest possible range of individuals and relevant
subgroups,
Cluster 2: Validity of test score interpretations for intended users for the intended examinee
population,
Cluster 3: Accommodations to remove construct-irrelevant barriers and support valid interpreta-
tions of scores for their intended uses,
Cluster 4: When testing requires the use of an interpreter, the interpreter should follow standard-
ized procedures and, to the extent feasible, be sufficiently fluent in the language and content of
the test and the examinee’s native language and culture to translate the test and related testing
materials and to explain the examinee’s test responses as necessary.
Based on the Standards, what obligations do large-scale assessment developers have to their
stakeholders? The first is to conduct systematic and comprehensive studies on an ongoing basis
on validation, reliability, and fairness; and the second is to assess whether an assessment’s decision
making has beneficial consequences in terms of washback for teaching and learning and to the
community. The next section examines some important studies that have addressed these mat-
ters under the headings (1) validity, (2) reliability, and (3) fairness, and (4) decision making and
consequences.
Empirical Studies
Validity
There has long been a recognition that rigorous validation processes and practices are needed to
safeguard score interpretation and use. As Kunnan (2013) argued:
necessary systematic validation research studies ought to be conducted by test development

agencies to show to the test users (test takers, test score users, and other consumers of test
information) that the tests, their score interpretations, decision and related claims can be
defended [and such] comprehensive test evaluations need to be conducted on a regular basis
for the particular purpose for which the tests are being used.
(p. 5)
The concepts of validity and that of validation (the process of addressing validity) have changed
dramatically over the decades: (1) in the 1950s and 1960s validation was conceptualized as a trinitarian
approach of providing evidence of an assessment in terms of content validity, criterion validity, and
construct validity; (2) in the 1990s, Messick (1989) proposed a unitarian approach to construct valid-
ity with criterion validity and consequential validity evidence contributing to construct validity;
(3) in the 2000s, Kane, Mislevy, and Bachman variously proposed an argument-based validity approach.
However, statistical procedures used for validation studies have not changed as correlations, regres-
sion, and exploratory and confirmatory factor analyses have remained the most proven ones.
But for systematic and comprehensive validation studies to be carried out, first there needs to be
a practical validation framework to guide test developers and researchers in conducting their work.
Chapelle, Enright, and Jamieson (2008) suggested that such a framework would offer guidance in
terms of how to prioritize and integrate both theoretical and empirical evidence into a coherent
validity argument, and how much and what types of evidence are needed to support such an argu-
ment. Inspired by the development of an argument-based approach to test validation in educational
measurement (Kane, 1992; Kane, Crooks, & Cohen, 1999), major advancements in validation frame-
works in language assessment were made in the 2000s (Bachman, 2005; Bachman & Palmer, 2010).
The principal characteristics of an argument-based approach were summed up by Chapelle and Voss
(2014, p. 1085) as follows:
(1) the interpretative argument that the test developer specifies in order to identify the various
components of meaning that the test score is intended to have and its uses;
(2) the concepts of claims and inferences that are used as the basic building blocks in an interpre-
tative argument; and
(3) the use of the interpretative argument as a frame for gathering validity evidence.
One example of this approach is the argument-based validity research conducted for iBT mainly prior
to the launch of the assessment. The findings of these studies were synthesized into a single book titled
Building a Validity Argument for the Test of English as a Foreign Language, edited by Chapelle et al. (2008).
It describes and illustrates an approach to synthesizing multiple types of evidence into an integrated
validity argument for the proposed interpretation and use of test scores. Chapelle et al. (2008) presented
the validity argument by first stating an interpretive argument containing the following six claims:
(1) that the tasks on the test were appropriate for providing relevant observations of performance
from the examinees on relevant tasks in academic domain;
(2) that the evaluation of examinee’s performance resulted in accurate and relevant summaries (test
score) of the important characteristics of the performance;
(3) that the observed scores were sufficiently consistent to generalize to a universe of expected
scores; the generalization inference is critical for standardized assessments because it needs to
be established that students’ test scores are comparable no matter which test form they take,
where they take the test, or who scores their responses;
(4) that the consistency of the expected scores can be explained by the construct of academic lan-
guage proficiency;
(5) that the construct of academic language ability predicts a target score indicating performance in
the academic context; and
(6) that the meaning of the scores is interpretable by test users, who therefore use it appropriately;
and that the test will have a positive influence on how English is taught. (p. 16)
This was the first illustration of the argument-based validation approach. Though the argument
presented did not include any rebuttals or counter-arguments, the explicitly stated warrants and
assumptions underlying the iBT validity argument made it possible for other researchers to critically
evaluate the argument’s coherence and the quality and sufficiency of its evidence.
A more recent study by Kunnan and Carr (2015), using a more traditional validation approach,
investigated the comparability of two English language proficiency tests—the General English Pro-
ficiency Test—A (GEPT-A) and iBT. The GEPT-A is a standardized assessment used in Taiwan for
assessing a test taker’s English language ability. Data was collected from 184 test takers in both Taiwan
and the United States. The instruments used were item-level participant test performance response
data from the GEPT-A reading section, scores on GEPT-A Writing Task 1, iBT reading, writing,
speaking, and listening scaled scores, and participant responses to a background information survey
and five questions involving their perceptions of the GEPT-A. Two specific analyses were conducted:
First, a content analysis was performed on the passages in the GEPT-A form and on the sample iBT
reading passages published by Educational Testing Services, Princeton, for test preparation purposes.
This analysis included using Coh-Metrix (McNamara, Graesser, McCarthy, & Cai, 2014) to analyze
the cohesion, syntax, and vocabulary used in the reading passages. Furthermore, a task analysis of the
construct coverage, scope, and task formats used in the reading comprehension questions on the
two tests was also conducted. Second, an analysis of the participant responses on the tests from
the two tests was also conducted.
The results of the text analysis showed the reading passages on the two tests are comparable in
many ways but differ in several key regards. The task analysis revealed that the construct coverage,
item scope, and task formats of the two tests are clearly distinct. Analysis of participant responses
indicated that the GEPT-A has good reliability and that reading comprehension items tend to func-
tion quite well. Scores on the GEPT-A and iBT are highly intercorrelated with each other. Exploratory
and confirmatory factor analyses of the test score data indicated that the two tests appeared to both
be measuring reading and writing ability but emphasized different aspects of the reading construct—
that is, the different construct definitions for the two tests are reflected in the results of the factor
analyses. The constructs of the two assessments are, therefore, not entirely similar.
Reliability
Reliability or consistency (for norm-referenced tests) and dependability (for criterion-referenced

tests) of performance and scoring are well-known markers of a high-quality assessment. Reliability
or dependability procedures could be used to check the internal consistency of all test items in a
section of a test assessing the same construct, or to examine the inter-rater reliability among raters
assessing extended writing or speaking, or, in more contemporary contexts, to investigate the reli-
ability of human raters with automated ratings.
Reliability estimates for norm-referenced tests can be generated using Classical True Score Theory
(e.g., correlations, Cronbach’s alpha, etc.), Item Response Theory, and Generalizability Theory. Com-
monly used estimates include inter-rater consistency, which is typically reported as a correlation but
should be adjusted with the Spearman-Brown prophecy formula to be used as a reliability index,
and if there are more than two ratings, Fisher Z-transformation should be applied to each correla-
tion, then averaged together, and the result should be retransformed to a correlation coefficient.
Another approach is to use Cronbach’s alpha as an estimate of internal consistency with multiple
scores for multiple raters, multiple tasks or items, or for subscales in an analytical scoring scale.
A more informative approach is to use generalizability theory to estimate the consistency of scor-
ing. This approach offers the advantage of identifying facets of the measurement process contribut-
ing most to reliability or unreliability, such as items or tasks, raters, and occasions particularly for
extended writing or speaking tasks. Many-faceted Rasch theory can also provide similar information;
in addition, it can provide ability estimates for test takers and difficulty or severity estimates for each
rater or rubric scale.
For criterion-referenced estimates of score dependability of scores and decision consistency,
Hambelton, Swaminathan, Algina, and Coulson (1978) using generalizability theory identified three
different concepts: (1) agreement of placement classification decisions, (2) agreement of decisions at
cut scores, and (3) dependability of domain scores (see Kunnan, 1992, for more details).
Kunnan (1992) investigated placement decisions made by the ESL program in four class levels
(33A, 33B, 33C, and 35) at the University of California, Los Angeles based on the New ESL place-
ment examination. He found that agreement indices (po and K) differed across cut scores; they were
highest for the lowest cut score group (po at .98 for the 33A level group) and lowest for the highest
cut score group (po at .86 for the 35 level group). Agreement indices at the cut scores was generally
acceptable across groups (po at .98 to .94). The dependability estimates for the four ESL class groups
were different: lowest for the 33C and 35 groups (Brennan’s Φ .30 and .48, respectively). Thus, the
ESL placement test was not dependable to the same extent for all groups, possibly because there were
not enough test items at all levels; if specifications and items were carefully laddered, there would have
been a better chance of assessing student ability at the different levels. In terms of actual classification of
individual students, there were problems as well. Using students’ total scores instead of specific section
scores for listening, grammar, reading, and writing resulted in misclassification of many students.
Specifically, if section scores were used, 9 students (out of 390; 2.30%) in the listening class would
have been placed differently into higher or lower levels, 41 students in the reading class (10.51%)
would have been placed differently, and 58 students in the grammar class (14.87%) would have been
placed differently. Thus, a total of 108 cases of misclassification occurred due to total cut scores.
But two recent studies conducted by in-house researchers at Educational Testing Services, Princeton,
found positive results. Wang, Eignor, and Enright (2008) reported the significant effect of academic
placement for the total iBT scores as well as for the four skills of listening, reading, speaking, and
writing. Similarly, Papageorgiou and Cho (2014) conducted a study to test the claim that scores were
able to place “students in English-language programs so that they are matched with level-appropriate
instruction” (www.ets.org/toefl_junior). Data collected included TOEFL Junior Standard total scores
of 92 ESL students in two secondary schools in the US and Europe, as well as their ESL teachers’ sug-
gestions on their placement levels, which were made within four weeks after the TOEFL administra-
tion. Teacher-suggested ESL placement levels were used as a dependable variable and regressed on
TOEFL Junior total scores using an ordinal logistic regression model. Statistical analyses revealed a
strong correlation between test scores and the teacher-assigned ESL levels. Moreover, the results from
the logistic regression analysis indicated a great deal of overlap between the teacher-assigned ESL
levels and the levels predicted from the TOEFL Junior Standard scores. The findings thus provided
some preliminary evidence to support the use of TOEFL Junior Standard as an initial screening tool
for ESL placement.
Fairness
As fairness has become an increasingly important concern, and is well documented in the 2014 Stan-
dards, assessment developers are obligated to provide comparable and equitable treatment during
all phases of the assessment process to all test takers. Anticipating the 2014 Standards, Holland and
Wainer (1993) outlined three relevant aspects that are routinely used to ensure test fairness:
(1) Detailed reviews of test items by subject matter experts and members of the major sub-
groups in society (gender, ethnic, and linguistic) that, in prospect, will be represented in the
examinee population; (2) Comparisons of the predictive validity of the test done separately for
each of the major subgroups of examinees; and (3) Extensive statistical analyses of the relative
performance of major subgroups of examinees on individual test items.
(pp. xiii)
Sensitivity Review
Holland and Wainer’s (1993) first aspect suggested that assessment developers ought to be respon-
sible for developing tests that measure the intended constructs and to eliminate or minimize con-
struct-irrelevant characteristics in the tests in the form of linguistic, cognitive, cultural, physical, or
other characteristics. This would require them to identify and eliminate language, symbols, words
and phrases, content, and stereotyping of characters that are offensive or insulting to test takers of
different racial, ethnic, gender, or other groups.
A well-documented sensitivity review process of all assessment materials was in place at Educa-
tional Testing Service, Princeton, as early as the 1980s based on the ETS Test Sensitivity Review Process
(1980) and the ETS Standards for Quality and Fairness (1987). When these standards were translated
into practice, Ramsey (1993) documented that it required (1) ETS documents to be balanced, not
foster stereotypes, and not contain ethnocentric or gender-based assumptions; and (2) ETS tests not
to be offensive when viewed from a test taker’s perspective, not contain controversial material, and
not be elitist and ethnocentric. Sensitivity reviews of tests used in most US school and college dis-
tricts include additional aspects such as cultures, religions, socio-economic groups, and disabilities.
Hambelton and Rogers (1995) identified five areas in their review form: fairness, content bias,
language bias, item structure and format bias, and stereotyping. The way the review form works is
as follows: each reviewer would receive a reading passage and related items (testlet) or other types
(such as individual items or prompts). This testlet would then be examined in terms of the checklist
items. The checklist information from all of the reviewers would then be collated and organized into
a spreadsheet for further descriptive statistical analyses. If a testlet received checks indicating bias on
more than 25% of the checklist items, then such items were flagged for modification, revision, or
deletion.
Test Performance
Holland and Wainer’s second and third aspects suggested that scores of different subgroups should
ideally have similar predictive validity with an external criterion (such as workplace or university
performance) for these subgroups. When comparisons of subgroups is conducted and there is a sig-
nificant difference, then the assessment developer has to be sure that the difference is not due to the
assessment itself. Specifically, if scores among subgroups are significantly different, then assessment
developers and score users are responsible for examining the evidence for validity of score interpre-
tations for intended uses for test takers from subgroups.
Zumbo (2007) also makes this point succinctly:
If the average test scores for such groups (e.g., men vs. women, Blacks vs. Whites) were found
to be different, then the question arose as to whether the difference reflected bias in the test.
Given that a test comprises items, questions soon emerged about which specific items might
be the source of such bias.
(p. 224)
To put it another way, when a test is fair, the expectation is that test takers in all subgroups will
fare equally well. However, when test score differences occur among test taker subgroups (say, by
gender, race/ethnicity, native language), of similar ability (as determined and matched by the total
score), then the question is why did the differences occur? Were the differences due to differences in
ability of interest or were the differences due to test items? Specifically, were the differences due to
test content (reading passage, listening script), test format (multiple-choice, constructed responses),
or some other irrelevant variable (say, typing skills) that were unfair to some groups?
The 2014 Standards articulated Standard 3.6, which is relevant here:
Where credible evidence indicates that test scores may differ in meaning for relevant sub-
groups in the intended examinee population, test developers and/or users are responsible for
examining the evidence for validity of score interpretations for intended uses for individuals
from those subgroups. What constitutes a significant difference in subgroup score and what
actions are taken in response to such differences may be defined by applicable laws.
(p. 85)
An examination of such evidence to determine whether the score interpretations are the same or
not for different subgroups is called Differential Item Functioning (DIF) or Differential Testlet Func-
tioning (DTF) analysis. Once an item or testlet is identified as a DIF item or a DTF testlet, content
experts need to identify the source of the DIF/DTF, whether it is content, test format, or any form
of bias.
DIF and DTF are psychometric procedures involving statistical methods (such as regression, Item
Response Theory, and non-parametric SIB test, etc.) that detect statistical bias at the test item or
testlet level for two comparable subgroups of test takers, matched with respect to the construct being
measured by the test. As many subgroups can be formed, conducting DIF/DTF analyses in an explor-
atory mode may not yield useful results, especially in answering questions regarding the source of the
DIF/DTF. Thus, many researchers (Roussos & Stout, 2004; Zumbo, 2007) have recently argued that
a purposeful hypothesis-driven approach to identifying subgroups for DIF/DTF analyses will most
likely yield answers to questions regarding the source of the DIF/DTF. In other words, when DIF
research is exploratory and not based on hypotheses and some items are flagged for DIF, the source
of the DIF is often unclear. This leaves open the question of whether such DIF items with no clear
source of explanation are really biased items.

Relevant hypotheses for an assessment might be subgroups of test takers in terms of their native
languages, such as Indo-European (IE) and Non-Indian European (NIE) language subgroups rel-
evant for international assessments. Thus, hypotheses could be stated as a null hypothesis, “There is
no DIF or DTF in the IE or NIE subgroups,” or as a directional hypothesis, “There is DIF/DTF for
the IE group (or NIE group).” A popular line of investigation is to examine DIF by gender subgroups.
Another hypothesis that may be relevant could be subgroups formed by age groupings (e.g., ages
16–25, 26–40, 41–60, 61 and above). These hypotheses may be based on previous research, theo-
retical foundation, anecdotal evidence, or a researcher’s or test developer’s hunches. The subsequent
DIF/DTF analysis would reveal whether test performance of the two subgroups at the item or testlet
level is affected by their grouping even when bands of test takers who have similar overall scores are
matched and analyzed. As mentioned earlier, items and testlets that display DIF/DTF will then need
to be examined by content experts for the source of the DIF/DTF, whether it is content, test format,
or any form of bias. If the source of the DIF/DTF is determined to be construct-irrelevant variance,
then the item or testlet will need to be revised or removed. In terms of empirical studies, Ferne and
Rupp (2007) provided an excellent synthesis of 15 years (1990–2005) of research on DIF in language
testing. They reviewed 27 articles in terms of the tests and learner groups used in the studies, along
with DIF detection methods applied, the reporting of DIF effects, and explanations for and conse-
quences drawn from DIF results.
There have been recent debates on what constitutes fairness and how and whether fairness is
related to bias. As the debates are too numerous to list, here is a sample of the debates: “What does
test bias have to do with fairness?” (Elder, 1997), the relationship between test fairness, test bias, and
DIF (Kunnan, 2007), “How do we go about investigating test fairness?” (Xi, 2010), fairness and the
Toulmin argument structure (Kunnan, 2010), and fairness and justice (Kunnan, 2014; McNamara &
Ryan, 2011).
Decision Making and Consequences

As Kunnan (2013) argued, it is both important and necessary to examine whether the consequences
of assessment scores and decisions made based on them are beneficial to test takers, the community,
and society in general. In school, college, and university contexts, the immediate consequence is to
the test taker (the learner) and then to the teacher, and then further to the teaching-learning process.
This effect on the teaching-learning process is called washback (i.e., the effect of language assess-
ments on several micro-levels of language teaching and learning inside the classroom) (McNamara,
2000). A number of empirical studies conducted in a variety of contexts have helped reveal the
complex relationship between washback, specifically, assessment, and teaching and learning (Cheng
& Watanabe, 2004; Wall, 2005). For example, Qi (2005) examined why the National Matriculation
English Test (NMET) in China failed to bring about the intended washback effects of promoting
changes in English language teaching in schools. Qualitative data were collected from interviews and
questionnaires from four groups of stakeholders: eight NMET developers, six English inspectors,
388 teachers, and 986 students. Qi found that NMET’s failure to achieve its intended washback was
mainly due to the conflicting functions that the test serves, namely, the selection function and the
function of promoting change. This conflict makes the NMET “a powerful trigger for teaching to the
test, but an ineffective agent for changing teaching and learning in the way intended by its construc-
tors and the policymakers” (p. 142).
Besides washback effects, empirical studies are also necessary to check whether the use of large-
scale language assessment scores actually serve their intended or sometimes unintended purposes
and whether decisions made on the basis of assessment scores have a positive impact on stakehold-
ers. Even when an assessment is shown to be serving its intended purpose, empirical studies still
need to be conducted to see if scores used as the sole criterion in decision making are adequate and
appropriate.
More recently, O’Loughlin (2011), in his investigation into the interpretation and use of IELTS
scores for admission purposes at an Australian university, found that IELTS scores were not used to
guide English language learning except for undergraduate students, who were sometimes admitted
with an overall band score of less than 6.5. This finding suggested that there had been very few benefi-
cial educational consequences of the test. He explored how the IELTS score was used in the selection
of international ESL students within one faculty in an Australian university, and the knowledge and
beliefs that administrative and academic staff had about the test. Data gathered included relevant
university policy and procedures documents as well as statistics related to English entry, a question-
naire administered to 20 staff, and follow-up interviews with 12 selected staff. With regard to the way
IELTS scores were used in admission, the study found that (1) there was neither a principled basis
for originally establishing IELTS minimum entry scores nor further tracking of student success to
validate entry requirements; and (2) applicants’ IELTS scores were not considered in relation to other
relevant individual factors as recommended in the guidelines listed in the IELTS Handbook (2007),
including age, motivation, educational and cultural background, first language, and language learn-
ing history. Both findings indicated that the interpretation and use of IELTS test scores in this context
were invalid. O’Loughlin thus proposes that a major change needs to be made in admission policy
and procedure in Australian universities, which would allow test scores to be interpreted in relation
to other relevant information about applicants. Meanwhile, all university stakeholders, including
university policy makers, admission staff, as well as academic staff, need to be better educated in score
interpretation and use and the other accepted measures of proficiency.
One particular type of large-scale assessment that is heavily influenced by legislation and thus car-
ries implicit political values is immigration and citizenship tests. In 2009, a collection of 12 case stud-
ies edited by Extra and Spotti (2009) on language testing and citizenship was published under the
title Language Testing, Migration and Citizenship: Cross-National Perspectives on Integration Regimes.
One particularly critical case within the European context concerned the Netherlands, where there
is a demand for “cultural and linguistic homogenization at the national level” (Extra & Spotti, 2009,
p. 125). It is not surprising that against such an ideological background, newcomers to the Nether-
lands have to pass three stages of testing regimes (i.e., admission to the country, civic integration after
arrival, and naturalization to citizenship). Extra and Spotti (2009) outlined, analyzed, and evaluated
these testing regimes by taking into account both the historical and phenomenology of these regimes.
For example, their evaluation of the first part of the admission test, which is supposed to measure
test takers’ knowledge of Dutch society, reveals that it is actually a hidden language test, because all
questions are in Dutch and all answers also have to be provided in Dutch. With regard to the second
part of the admission test, which is a computerized phone test of listening and speaking skills at the
CEFR A1 minus level, external judgments, offered by a group of four experts in linguistics, testing,
and speech technology, were cited. Concerning the quality of the test, the experts concluded that
“there was not enough evidence that the proposed phone test would be valid and reliable” (p. 132).
Outside of the European context, Kunnan (2009) took a critical look at the US Naturalization Test,
which millions of citizenship applicants have been subjected to over the past two decades. Applying
the “Test Context Framework” (Kunnan, 2008), which expands the scope of the examination of tests
to include the political, economic, and legal contexts, Kunnan investigated whether the language
requirement and the test, including its purpose, content, administration, and consequence, provide
the impetus for citizenship applicants to integrate or develop a civic nationalism. A detailed analysis
of the test, including its development, content, as well as problems, led Kunnan to argue that (a) nei-
ther the test requirement nor its purpose is meaningful; (b) the test fails in its attempt to assess Eng-
lish language ability and knowledge of US history and government; and (c) the test may not have the
intended positive consequence of bringing about civic nationalism and social integration. He thus
concluded that the Naturalization Test is an undue burden on non-English-speaking immigrants and
only creates a barrier for them to acquire citizenship, and calls for a replacement of the test with an
educational program in English language and US history and government.
Australia is another country where immigration has always been a central issue. In 2007, a new
formal testing regime for people wishing to gain citizenship was launched in Australia. This new citizen-
ship is administered in English by computer and takes the form of 20 multiple-choice questions on
Australian institutions, customs, history, and values (Piller & McNamara, 2007). As the test also aims
to satisfy the legislative requirement that the applicant must demonstrate a ‘basic knowledge’ of
English, the language and literacy demand of the test becomes a crucial issue. Piller and McNamara
(2007) carried out a number of analyses to assess the linguistic difficulty of the resource booklet.
First, a comparison between the language of the booklet with some widely used definitions of ‘basic
English’, such as the first two levels of the CEFR, revealed that the booklet’s language is well above the
so-called basic level. Second, a lexical analysis using the web-based lexical profiler Web VP showed
that about two-thirds of the content words in the booklet are beyond the level of the top 1,000 most
frequent words in English, and that only 16.23% of the words in the booklet are of Anglo-Saxon ori-
gin. Both findings suggested the relatively abstract and complex nature of the material. Furthermore,
the lexical density of the booklet language reaches 57%, which is at the very upper end of the range
for written texts. These findings led the researchers to conclude that “the resource booklet Becoming
an Australian Citizen is certainly out of the reach of a basic user of English and would present dif-
ficulties for many native speakers of English with limited education and/or limited familiarity with
texts of this type” (p. 1).
Recommendations
These recommendation ideas are written for assessment developers and to all the stakeholders
involved in large-scale assessments. Although these recommendations are listed under different
headings, there is some overlap among them.
Public Justification and Reasoning

The most important recommendation that can be made for large-scale assessment is the area of
public justification and reasoning. For fairness and justice to work in an assessment context, public
justification or public reasoning should be part of the process of engaging all stakeholders. Philoso-
pher John Rawls (1960) argued that it was necessary to justify policy judgments to fellow citizens so
that public consensus could be reached. He also suggested the use of the methodology of “reflective
equilibrium” to help in the public justification process. In this methodology, initial ideas, beliefs, or
theories are subjected to reason, reflection, and revision until the ideas, beliefs, or theories reach a
state of equilibrium in public justification. Economist Amartya Sen (2012) went a step further and
argued that public reasoning of a government’s policy was essential to convince the public of a new
policy. This obviously should also apply to assessment policy as well, where the relevant assessment
bodies and assessment developers will offer a full account of the issues related to an assessment, such
as purpose (selection, placement, etc.), development (item specifications, trialling, revisions, etc.),
operations (item test banking, assembly of test, etc.), research findings (validity, reliability, fairness,
and decision making), and consequences (beneficial or not) of the assessment. These issues need to
be brought to the attention of the community through descriptions, research, and findings in public
forums in a variety of venues that could include conferences, seminars, open town hall meetings, and
regular newsletters, to name a few. Of course, in both cases of public justification and public reason-
ing, it is assumed that the governments and bodies that have put assessments in place through public
policy (in school, college, university, workplace, immigration, and citizenship) are not authoritarian
and unwilling to hear community members’ concerns, if any.
Framework and Agenda for Research

It is imperative that agencies that develop and administer large-scale assessments develop a frame-
work and agenda for ongoing research, not an ad hoc plan to satisfy a particular client or customer
or satisfy an assessment standard required by funding bodies. Generally, assessment agencies are
willing to spend the time and resources to develop and launch an assessment, but they allocate fewer
resources for research and development after the assessment has been launched. Specific plans, pro-
cedures, and steps need to be in place for research on validity, reliability, and fairness. In addition, if
accommodations (such as extended time) are necessary for test takers with disabilities, research in
this area has to be conducted in order to provide the appropriate type of accommodation.
Assessment Development and Research

It may be unnecessary to make this recommendation, but given the proliferation of assessments
and organizations who are entering this business, it may be salutary. Assessment development and
research staff need to be qualified in language assessment issues, preferably through a university
program, and then trained in systematic ways of developing assessments and researching the validity,
reliability, and fairness of assessments, and relating all of them to the consequences of assessments.
In addition, staff should have the ability to go to public forums to offer public justification and rea-
soning of the assessments they have developed and researched.
Score Users and Community Members

Assessment score users are members of the larger community who make decisions about test takers.
Such score users could be school, college, or university teachers who are responsible for placing stu-
dents into programs, workplace officials who make decisions regarding careers and promotion, and
immigration and citizenship officials who decide on applicants’ mobility and residence. They need
to understand how to read and interpret scores (scores, grades, descriptors, etc.) and understand the
limitations of scores (standard errors, cut scores, reliability of scores, etc.). In addition, they need to
be able to translate score reports that are often technical to administrators, parents, and community
members in public forums and town hall meetings. On occasion, they should also be prepared to
provide depositions in courts related to the strengths of their assessments and contest challenges
regarding any weaknesses of their assessments.
Oversight Bodies
As of now, assessments and assessment agencies are not governed by the rules and regulations of a
state regulatory board or oversight body. This means that assessments do not have to be certified to
be valid, reliable, or fair, but they can enter the marketplace and determine the lives of test takers.
The burden therefore is on assessment agencies to self-monitor and self-review their assessments.
Although we are not advocating setting up a regulatory authority within a particular country or
internationally, meeting the Standards (AERA, APA, & NCME, 2014) in a phased manner could be
one way of assuring stakeholders of the quality of assessments. Additionally, national or interna-
tional organizations like ALTE, EALTA, ILTA, and the like can be given a role in certifying the quality
of assessments through their elected officials. A totally unregulated marketplace with assessments
developed by unqualified and untrained staff, in some cases, with only a business model in mind, is
a very scary situation for all stakeholders.
Conclusion
As we have shown through discussions and empirical studies of validity, reliability, and fairness of
large-scale assessments, large-scale language assessments are not the ideal way to assess language
ability. This is because in many cases the intended purposes may not be well served, and unintended
consequences may play a bigger role than anticipated. But in this age where several hundred or
thousand test takers have to be assessed in a short period of time and results and decisions have to
be made quickly, there is probably no better way than what we have. Therefore, our goal has to be to
make better large-scale assessments. In this regard, we have suggested some recommendations that
can build the base for quality large-scale assessments even with its present limitations.
The most important recommendation of all those made is the one regarding public justifica-
tion or public reasoning. We argue that this should be a fundamental requirement of all assessment
agencies that develop, administer, and use large-scale assessments. As we have seen, many large-scale
assessments have the power to alter the careers and lives of test takers, and if this is done with poor-
quality assessments, the detrimental effects of such action will be long-lasting and irreversible. By
requiring assessment agencies to publicly account for their assessments, we feel this will ensure the
road to quality assessments.
Acknowledgment
I would like to thank Hang Li from Shanghai Jiatong University for her reference work for this
chapter.
Note
1. Starting in 2016, the state will administer Smarter Balanced Assessments, computer-based tests aligned with the state’s new
standards for English language arts/literacy and math called California Assessment of Student Performance and Progress
(CAASPP).
References
American Educational Research Association, American Psychological Association, and National Center for Measurement in
Education. (2014). Standards for educational and psychological testing. Washington, DC: Author.
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly, 19, 453–476.
Bachman, L. F., & Palmer, A. (2010). Language assessment in practice. Oxford, UK: Oxford University Press.
Chapelle, C. A., Enright, M. K., & Jamieson, J. M. (Eds.). (2008). Building a validity argument for the test of English as a foreign
language. New York, NY: Routledge.
Chapelle, C. A., & Voss, E. (2014). Evaluation of language tests through validation research. In A. J. Kunnan (Ed.), The com-
panion to language assessment (pp. 1081–1097). Chichester, UK: Wiley.
Cheng, L., & Qi, L. (2006). Description and examination of the national matriculation English test. Language Assessment
Quarterly, 3, 53–70.
Cheng, L., & Watanabe, Y. (Ed.). (2004). Washback in language testing. Mahwah, NJ: Lawrence Erlbaum.
Educational testing service test sensitivity review process. (1980). Princeton, NJ: Author.
Educational testing service standards for quality and fairness. (1987). Princeton, NJ: Author.
Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14, 261–277.
Extra, G., & Spotti, M. (2009). Testing regimes for newcomers in the Netherlands. In G. Extra, M. Spotti, & P. van Avermaet
(Eds.), Language testing, migration and citizenship: Cross-national perspectives on integration regimes (pp. 1–33). London:
Continuum.
Ferne, T., & Rupp, T. (2007). A synthesis of 15 years of research on DIF in language testing. Language Assessment Quarterly,
4, 113–148.
Hambelton, R., & Rogers, J. (1995). Test bias. Thousand Oaks, CA: Sage.
Hambelton, R.K., Swaminathan, H., Algina, J., & Coulson, D.B. (1978). Criterion-referenced testing and measurement: A
review of technical issues and developments. Review of Educational Research, 48, 1–47.
Holland, P., & Wainer, H. (Ed.). (1993). Differential item functioning. Mahwah, NJ: Lawrence Erlbaum.
International English Language System Handbook. (2007). Cambridge, UK: Cambridge Language Assessment.
Kane, M. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535.
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice,
18, 5–17.
Kunnan, A. J. (1992). An investigation of a criterion-referenced test using G-theory, and factor and cluster analyses. Language
Testing, 18, 30–49.
Kunnan, A. J. (2007). Test fairness, test bias, and DIF. Language Assessment Quarterly, 4, 109–112.
Kunnan, A. J. (2008). Towards a model of test evaluation: Using the Test Fairness and Wider Context frameworks. In L. Taylor &
C. Weir (Eds.), Multilingualism and assessment: Achieving transparency, assuring quality, sustaining diversity. Papers from
the ALTE Conference in Berlin, Germany (pp. 229–251). Cambridge: Cambridge University Press.
Kunnan, A. J. (2009). Testing for citizenship: The U.S. naturalization test. Language Assessment Quarterly, 6, 80–97.
Kunnan, A. J. (2010). Fairness and the Toulmin argument structure. Language Testing, 27, 183–189.
Kunnan, A. J. (2013). High-stakes language testing. In C. Chapelle (Ed.), The encyclopedia of applied linguistics (pp. 1–6).
Oxford, UK: Blackwell.
Kunnan, A. J. (2014). Fairness and justice in language assessment. In A. J. Kunnan (Ed.), The companion to language assessment
(pp. 1098–1114). Chichester, UK: Wiley.
Kunnan, A. J., & Carr, N. (2015). A comparability study between the general English proficiency test—advanced and the
internet-based test of English as a foreign language. Research Report No. 6. Taipei: Language Training and Testing Center.
Kunnan, A. J., & Grabowski, K. (2013). Large-scale second language assessment. In M. Celce-Murcia, D. M. Brinton, & M. A.
Snow (Eds.), Teaching English as a second or foreign language (4th ed., pp. 304–319). Boston, MA: National Geographic
Learning/Cengage Learning.
McNamara, T. (2000). Language testing. Oxford, UK: Oxford University Press.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-
Metrix. Cambridge, MA: Cambridge University Press.
McNamara, T., & Ryan, K. (2011). Fairness versus justice in language testing: The place of English literacy in the Australian
citizenship test. Language Assessment Quarterly, 8, 161–178.
Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York, NY: Macmillan.
O’Loughlin, K. (2011). The interpretation and use of proficiency test scores in university selection: How valid and ethical are
they? Language Assessment Quarterly, 8, 146–160.
Papageorgiou, S., & Cho, Y. (2014). An investigation of the use of TOEFL® Junior™ Standard scores for ESL placement deci-
sions in secondary education. Language Testing, 31, 223–239.
Piller, I., & McNamara, T. (2007). Assessment of the language level of the August 2007 draft of the resource booklet “Becoming an
Australian citizen”. Report prepared for the Federation of Ethnic Communities. Councils of Australia (FECCA). Curtin,
ACT: FECCA.
Qi, L. (2005). Stakeholders’ conflicting aims undermine the washback function of a high-stakes test. Language Testing, 22,
142–173.
Ramsey, D. (1993). Sensitivity review. In P. Holland & H. Wainer (Ed.), Differential item functioning (pp. 367–388). Mahwah,
NJ: Lawrence Erlbaum.
Rawls, J. (1960). Fairness as justice. Cambridge, MA: Harvard University Press.
Roussos, L., & Stout, W. (2004). Differential item functioning analysis. In R. Kaplan (Ed.), The Sage Handbook of quantitative
methodology for the social sciences (pp. 107–116). Thousand Oaks, CA: Sage.
Sen, A. (2012). The idea of justice. Cambridge, MA: Harvard University Press.
Wall, D. (2005). The impact of high-stakes examinations on classroom teaching. Cambridge: Cambridge University Press.
Wang, L., Eignor, D., & Enright, M. K. (2008). A final analysis. In C. A. Chapelle, M. K. Enright, & J. M. Jamieson (Eds.), Build-
ing a validity argument for the test of English as a foreign language (pp. 259–318). New York, NY: Routledge.
Xi, X. (2010). How do we go about investigating test fairness? Language Testing, 27, 147–170.
Zumbo, B. (2007). Three generations of DIF analyses. Language Assessment Quarterly, 4, 223–233.
35
Fifteen Ways to Improve Classroom Assessment
James Dean Brown and Jonathan Trace
Introduction
Assessment is just as essential to the language classroom as any teaching activity or task, yet much of the
literature on second language classroom assessment tends to limit itself to definitions of assessment,
differentiating between planned and informal measures, and discussing the issues involved in carry-
ing out assessment. While we draw on many of these concepts and findings, our primary goal in this
chapter is to help language teachers plan, design, and carry out assessment with their students from
a practical standpoint. Hence, the title of this chapter, fifteen ways to improve classroom assessment.
Classroom assessment has been defined as “any reflection by teachers (and/or learners) on the
qualities of a learner’s (or group of learners’) work and the use of that information by teachers (and/
or learners) for teaching, learning (feedback), reporting, management or socialization purposes”
(Hill & McNamara, 2012, p. 396). While some authors have focused on these concepts in terms of
planned assessments (e.g., Leung & Mohan, 2004; McNamara, 2001), others have associated class-
room assessment with more reflective or formative measures that blend assessment and instruction
(e.g., Rea-Dickins, 2006; Torrance & Pryor, 1998). Our approach, like Hill and McNamara (2012),
includes facets of both of these notions, taking the position that all classroom assessment should be
based on careful planning and carried out systematically regardless of the particular approach being
used (e.g., Popham, 2008). [For a wide variety of examples of classroom tests developed by language
teachers, see Brown, 1998, 2013a.]
Hill and McNamara (2012) outline four steps that teachers engage in when using classroom assess-
ment: (a) planning, (b) framing (or awareness-raising), (c) conducting, and (d) using assessment.
Brown (2013b) divides classroom testing into three different steps: (a) test writing, (b) development,
and (c) validation. Inspired by both articles and what we have learned from the literature, we will
organize this chapter around the five major steps shown in Figure 35.1, each of which has three parts.
Planning Assessment
Know Your Options

We believe that teachers who want their assessments to be effective and useful for student learning
should make every effort to plan and integrate assessment activities directly into their curriculum
from the very beginning by first knowing their options. Many language teachers are not aware that
490
Fifteen Ways to Improve Classroom Assessment • 491
Planning Assessment
Know Your Options Match Assessment to Learning Promote Learning with Assessment
Writing Items
Create the Best Possible Items Write Enough Items Check the Items Again
Compiling the Test

Organize the Items Create Scoring Tools Proofread the Complete Test
Using the Test

Plan the Test Administration Give Students Feedback Use the Feedback Yourself
Improving the Test

Analyze the Items Check Reliability Check Validity
Figure 35.1 Five Steps in Classroom Assessment
they have far more assessment options than instructors in other fields (see Brookhart, 1999; Brown,
2013b, in press). For example, math teachers typically present their students solely with numerical
problems to solve. Granted, they use different types of problems, but that is really only one type of
test item.1 In contrast, language teachers can use at least a dozen different types of items across four
categories: (a) selected-response items, (b) productive-response items, (c) personalized-response
items, and (d) individualized-response items. Each of these provides a different kind of “assessment
opportunity” (Rea-Dickins, 2001, p. 433) for teachers to gather information about their students’
learning.
Selected-Response Items
Selected-response items require learners to choose from a set of responses rather than produce one
of their own. These include multiple-choice, true-false, and matching tasks, all of which tend to be
used to measure knowledge related to knowledge (e.g., grammar, vocabulary) or to receptive skills
(listening or reading). When properly constructed, such items can be useful for gathering informa-
tion in a quick, reliable, and objective way, which makes them ideal for classroom contexts when
teachers need a quick but broad measure of student learning (e.g., daily quizzes, diagnostic tests).
Because they are primarily limited to passive knowledge of the language (e.g., comprehension, rec-
ognition), they are less effective for assessing complex, productive forms of language (e.g., speaking,
pragmatics), and thus are best paired with other response categories when teachers want a more
complete picture of their learners’ abilities.
While they are relatively easy to administer and score, selected-response items tend to be difficult
to construct. Because they are typically scored right/wrong, they need to be designed so that multiple
possible correct answers are avoided, which limits what these sorts of items can do. For example,
they may not work well for potentially subjective material like inferences, which can lead to multiple,
arguably correct responses to the same question (e.g., Buck, 2001). Other issues, such as limiting the
guessing factor and choosing a correct number of fittingly appropriate distractors (Rodriguez, 2005),
also make these items challenging to write.
492 • James Dean Brown and Jonathan Trace
Productive-Response Items
Productive-response items have the advantage of requiring students to actually produce a response
in the target language. These items include fill-in or cloze-type items (which usually require exam-
inees to supply words or phrases in blanks), short-answer items (that involve longer, constructed
responses), and performance items (that can range from relatively short essays to more elaborate
tasks in the language, e.g., making reservations on the phone, giving a presentation, role-plays).
Unlike fill-in or short-answer items, performance items require longer responses and can simu-
late genuine language use specifically related to what learners are doing in the classroom (Shohamy,
1995) and take into account language abilities related to target language uses that learners will most
likely need in real life (Bachman & Palmer, 1996, 2010). That said, not all performance assessments
are necessarily authentic (see Kane, Crooks, & Cohen, 1999; Wiggins, 1993, 1998), but they certainly
come closer than selected-response item types.
However, performance assessments can be challenging to design because (a) appropriately difficult
and complex tasks that also elicit the kinds of real-world language performances needed are difficult
to create (see Brown, Hudson, Norris, & Bonk, 2002; Norris, Brown, Hudson, & Yoshioka, 1998);
(b) administering and scoring is difficult for large classes, though using a rubric or checklist may help
(Davis & Kondo-Brown, 2012); and (c) careful planning is needed to ensure that performance assess-
ments produce reliable and valid scores (Kane et al., 1999). However, the advantages they offer in
terms of flexibility and matching assessment to instruction may make performance assessments ideal
for classroom uses. (For much more on performance items, see Brown, 2004b; Norris et al., 1998.)
Personalized-Response Items
Personalized-response items involve teachers setting broad guidelines for the task(s) involved, but
leaving the learners to determine the actual content. Common examples include portfolios (e.g.,
see Cresswell, 2000; Graves, 1992; Hirvel & Pierson, 2000), teacher-student conferences and inter-
views (e.g., Jang, Dunlop, Park, & van der Boom, 2015), and self/peer assessment (e.g., Butler & Lee,
2010; Dippold, 2009; Matsuno, 2009; Ross, 2006). This category of assessment often involves stu-
dents reflecting on their learning, and as such, it tends to be paired with other forms of assessment,
particularly performance items. In most cases, rather than focusing on teacher-oriented assessment
(e.g., making determinations about the mastery of content), personalized-response items focus on
the students’ development and their awareness and understanding of that development process.
Because personalized-response items tend to be learner-centered, teachers need to ensure beforehand
that their learners (a) are aware of the purpose or function of these assessments in the classroom (Rea-
Dickins, 2006; Torrance & Pryor, 1998) and (b) have sufficient language ability (or can draw on a shared
L1) to profit from them. Putting assessment into the hands of the students without clear directions
and goals can lead to learners misunderstanding or devaluing the process, which may in turn affect the
reliability of the scores and undermine the assessment altogether. Fox and Hartwick (2011) report that,
when using portfolios in the classroom, learners are often unaware of how these are used to track and
reflect on their own development and that, without a clear purpose, students may only be going through
the motions (of, say, gathering samples of their work) without being aware of why they are doing so.
Additionally, one major challenge for teachers using these assessments is maintaining clear stan-
dards of reliability and validity as personalized-response items are often mistaken for informal
approaches to measuring student learning (Brown & Hudson, 1998). Similar to productive-response
items, the use of rubrics or checklists can help both teachers and learners by establishing clear assess-
ment guidelines (Davis & Kondo-Brown, 2012).
Individualized-Response Items
Individualized-response assessment is not so much a category of items as three approaches to assess-
ment (Heritage, 2010). Like personalized-response items, these tend to focus on measuring learning
processes rather than products, and they are ideal for using assessment to affect teaching and learn-
ing as it is happening. Individualized assessments are often contrasted with summative assessments
(e.g., achievement tests), which are aimed at identifying how well the students have mastered a par-
ticular skill or outcome. The three main individualized-response approaches to date are continuous
(or formative) assessment, dynamic assessment, and differential assessment.
As the name implies, continuous assessment is the ongoing process of checking in with learn-
ers during instruction, typically through the use of probing questions or comprehension checks
(e.g., calling on students individually in the class), and it resembles what teachers naturally do
in class. Such questions are designed not to score learners but rather to provide feedback and
reinforce learning, as well as to help the teacher gauge the effectiveness of instruction and adjust
the speed or content of the class if needed. Continuous assessment has been discussed in terms of its
usefulness in classrooms (Black & William, 2009; Colby-Kelly & Turner, 2007; Leung & Mohan,
2004), as well as its reliability and validity (Clapham, 2000; Leung, 2004) and its relationship
with summative forms of assessment (Carless, 2005; Davison & Leung, 2009; Popham, 2008;
Stiggins, 2002).
Dynamic assessment is another option for teachers looking to provide rich, individualized
feedback to students about their learning processes and development. Rooted in sociocultural
models of language learning and drawing on Vygotsky’s (1978) Zone of Proximal Development,
dynamic assessment uses interaction and negotiation to measure a learner’s language ability in
linguistic terms, as well as foster learner autonomy and development through internalizing feed-
back (Poehner, 2007). Learners typically work individually with a mediator who provides ongoing
feedback, usually in the form of leading questions or recasts (e.g., see Anton, 2009; Hamp-Lyons
& Tavares, 2011; and Poehner & Lantolf, 2005). Unlike most forms of assessment, which are car-
ried out in relative isolation, dynamic assessment acknowledges that language is a social process
(McNamara, 2001).
Differential assessment takes into account that learners may prefer different learning styles. Rich-
ards and Schmidt (2010) define a learning style as:
a particular way of learning preferred by a learner. Learners approach learning in different

ways, and an activity that words with a learner whose learning style favours a visual mode of
learning, may not be as successful with a learner who prefers auditory or kinesthetic modes of
learning.
(p. 331)
Differential assessment recognizes such differences and requires teachers to begin by determining
their students’ learning preferences, which can be done with a tool like the questionnaire available
at http://www.businessballs.com/freepdfmaterials/vak_learning_styles_questionnaire.pdf. Teachers
can then provide assessment options to students that suit their preferences (see Stefanakis & Meier,
2010). Clearly, differential assessment involves the extra steps of determining students’ learning pref-
erences and developing multiple forms of assessment for each preference. Additionally, questions of
the comparability of the different forms may arise. However, tailoring assessment to the visual, audi-
tory, or kinesthetic learning styles of individual students is seen by supporters as a more equitable
and accurate reflection of what students can do.
Using Multiple Options

In short, teachers should consider the many options available to them in terms of what form their
assessments can take in the classroom and how these response types differ in the ways they can be
used and interpreted to foster student learning. It might help to think of these response types as
points along a single continuum of student learning. At one end of the continuum are the product-
oriented measures like multiple-choice items or cloze tests. In the middle of the continuum are tasks,
portfolios, and interviews, and at the other end are continuous, differential, and dynamic forms
of assessment. Thus, the continuum gradually shifts from narrow product-oriented measures to
broader process-based measures intended to affect the actual learning processes as they are hap-
pening. In more direct terms, are you interested in what your students are learning or how they are
learning, or both?
Match Assessment to Learning

Another way to increase the effectiveness of your classroom assessments is to make sure that they
are firmly linked directly to your students’ learning material and experiences. If you have clear and
detailed student learning outcomes (SLOs) for your course, matching assessments with student
learning will probably be much easier. If not, then laying out the course syllabus, textbook, and les-
son plans directly related to each assessment can help you match your assessments to your students’
learning (Hall, Webber, Varley, Young, & Dorman, 1997).
The bottom line is that you should carefully examine what you are assessing each time you do it
and compare what you are measuring to what you are teaching. If there is one single rule that will
help you match assessment to learning it is: test what you teach. There is no reason to assess the stu-
dents differently from the way they learned the material. For example, if the students learned to use
functions like greetings, seeking information, and giving information in pair work, then it would make
most sense and be fairer to test them similarly in pairs rather than in some multiple-choice format.
Brown (2013a) makes the point that classroom assessments are really just a special case of classroom
activity or materials development. By special case, he was suggesting that assessment was distinct
from classroom activities and materials only in that assessment furnishes purposeful feedback, or:
a way of observing or scoring the students’ performances and giving feedback in the form of
a score or other information (e.g., notes in the margin, written prose reactions, oral critiques,
teacher conferences) that can enlighten the students and teachers about the effectiveness of the
language learning and teaching involved.
(p. x)
Aligning tests with student learning can help you justify your classroom assessment practices,
especially as they relate to program-level or externally mandated outcomes. Most teachers are well
aware of the standards for student learning that exist beyond their classrooms (e.g., state or nation-
wide outcomes, evidence for accreditation), and while there is typically a large degree of freedom in
how they do classroom assessment, teachers sometimes need to be able to defend their individual
practices for accountability purposes (Brindley, 1998, 2001; Norris, 2006). For example, the benefits
of continuous assessment in terms of flexibility and focus on learning can be difficult to reconcile
when program-level outcomes are expected to be judged primarily by summative measures. Teasdale
and Leung (2000) note that such approaches to assessment may reflect what is happening in the
classroom, but this does not necessarily mean that learners have mastered the more generalized skills
required by influences outside of the classroom. Teachers therefore need to pay attention to how their
assessments reflect what is happening in their classroom, but also to how their assessments generalize
to broader requirements.
Promote Learning with Assessment

Assessment is often an afterthought in a teacher’s busy schedule, and as a result, it may get short
shrift. The attitude that the course is ending, so I should write a final exam serves nobody well, least of
all the students and their learning. Instead, assessment should be viewed as an ongoing and planned
process within the curriculum rather than something relegated to the end. Assessment is just as
important as any other aspect of a teacher’s job, largely because the feedback provided by assessment
is such an integral part of gaining language knowledge and skills.
The term assessment for learning has been used by several authors (e.g., Black & William, 1998;
Klenowski, 2009), most often in the context of informal assessments, particularly formative assess-
ment (Rea-Dickins, 2001, p. 453). Rea-Dickins and others (e.g., Clarke, 1998; Sadler, 1989) note that
one of the benefits of promoting learning through assessment is that it can raise student awareness of
the purposes and uses of assessment. Teachers need to inform learners about not only how they will
be tested but also how their scores will be used, what kinds of feedback they will receive, and what
they can do with that feedback. Making learners part of the assessment process is essential to making
this happen (Brown, S., 2004), and this can occur only through careful planning.
Finally, teachers often overlook the fact that they can use assessment feedback to create positive
washback (i.e., the positive effects of assessment on any learning associated with it) by shaping the
students’ expectations and behaviors (Brown, 2004a; Clarke, 1998; Rea-Dickins, 2001, 2006). For
example, when the first author of this chapter was teaching in China in the 1980s, he found that
the students largely believed in grammar-translation methods because that was what they had been
exposed to. As a result, the students’ had excellent knowledge about English grammar and vocabulary,
but they could not do anything with the language. In response, he and his colleagues used assessment
(e.g., role-play and group assessments, classroom student presentations) and its washback effect to
help shift the students’ views on language learning toward communicative approaches—and for the
most part it worked.
Writing Items
Create the Best Possible Items

Creating assessment items is not an ability that people are born with. It takes time and practice, and
yet it is an important skill to learn because no classroom assessment can be any better than the items
upon which it is based. The first step in creating the best possible items is to start reading up on how
to write good-quality items, and much has been written about this process in the form of item writ-
ing guidelines (see Brown, 2005; Haladyna, Downing, & Rodriguez, 2002; Kline, 1986).
Another sound approach is through the use of item specifications (Brown & Hudson, 2002;
Davidson & Lynch, 2002). Item specifications are a way of defining each part of an item by describing
the specific attributes of the prompt (i.e., what the student is given, what the items look like) and the
response (i.e., how the items are scored, how performance indicators are described, categorized, and
organized). They might also include a general description or sample items (Davidson & Lynch, 2002;
Popham, 1981) or more detailed information like timing and definitions for what is being measured
(Bachman & Palmer, 1996). When several types of items or performances are used within a single
assessment, it may be necessary to also create test specifications (Brown & Hudson, 2002; Norris et al.,
1998), which specify the broader test organization, item types, etc. Both of these processes work best
when they are produced by a team of teachers working together (Davidson & Lynch, 2002) or when
involving a peer review process (Brown & Hudson, 2002).
Write Enough Items

When it comes to actually writing the items, you should make sure you allot plenty of time, not only
so you can write good-quality items but also so you will have time to write a sufficient number of
items. It has long been recognized by measurement specialists that more items provide better infor-
mation than fewer items. As a rule, then, using more items to test a single concept or skill is a better
idea than using fewer items. Indeed, if you write too few items, you may not get a good assessment
of how your students can perform.
In addition, experienced test writers know to always write more items than they will ultimately
need so they can later get rid of those items that do not function well. This is certainly the case in
standardized tests (as described in Purpura, Brown, & Schoonen, 2015, pp. 61–62), but it holds true
in classroom tests as well (Brown & Hudson, 1998).
But, how many items are enough? To answer this question for any particular assessment, teachers
should consider the following context-specific factors: (a) the number of SLOs to be assessed, (b) the
number of students in the class (especially, in performance assessment contexts, where students need
to be tested individually), (c) the time allotted for the test, (d) the resources available (e.g., technol-
ogy, authentic materials), and (e) the item types used.
Check the Items Again

As important as it is to write good items, it is just as crucial to check that those items ended up the
way you intended them to be in terms of their function, format, and appearance. This means that
you need to do some very careful item proofreading. If you find problems, you can then eliminate
particularly bad items or rewrite those that can be saved. As with any proofreading step, it is good
practice to allow some time to pass (whether that be a few days or a weekend), and then proofread
the items again. Doing so may help you to see the items with fresh eyes. It can even change your
frame of reference from that of a teacher to a test taker, which is very useful indeed.
Compiling the Test
Organize the Items

A collection of items does not a test make; that is, creating a test is much more than simply
putting the items together. Among other things, the items must be organized logically so that
students will not get confused. For example, when different item formats are used in the same
test (e.g., true/false, short-answer), the simplest approach may be to group items with the same
format together, say true/false with true/false, short-answer with short-answer, etc. Alternatively,
it may make more sense to group items that are testing the same language point (e.g., preposi-
tion items, greetings) or skill (e.g., listening, grammar) together. Additionally, teachers want to
be careful to group items related to a particular listening or reading passage together. The trick is
to make all of these organizational ideas happen at the same time, which is where the art of test
development comes in.
Item independence is also a consideration when organizing items and tasks into a test. While
writing items with different formats or tapping into different topics, language points, or skills, unin-
tentional clues may surface in the wording of one item that helps students answer another item (e.g.,
Yen, 1993). Another potential problem is that the ability to answer one item can provide some stu-
dents with a clue in their answer to that item that will help them to answer another item.
Organizing performance items is generally simpler, especially when there are only one or two
items or tasks for the entire test. Nonetheless, item independence remains an issue. In a situation
where scoring is based on a rubric with multiple categories, it is important that each of the rubric
categories be independent from the others. For example, Janssen, Meier, and Trace (2015) found
that two of the categories on their writing rubric, language use and vocabulary, were difficult to
distinguish when assigning scores. This created potential fairness and validity problems for their test
because some students were potentially being rewarded (or penalized) twice for the same kinds of
performance.
Create Scoring Tools

In many situations, teachers will need to rely on some kind of scoring tool like a rubric or checklist.
Having such a tool and knowing how to use it are quite different things, and there is an extensive
body of research on the importance of training raters to use rubrics in large-scale assessment (e.g.,
Lumley & McNamara, 1995; Weigle, 1994). While classroom teachers should be more familiar with
what they are assessing and with the performance indicators on their rubrics, it is still important to
familiarize or re-familiarize yourself with the procedures, recording equipment, and rubric before
you begin testing. If sample performances are available (e.g., essay or video recorded samples from
a previous class or another teacher, or even alternatives like TED.com video presentations, textbook
scripted dialogues, or performances by willing colleagues or former students), then you might want
to sit down ahead of time and practice scoring some of those performances.
Proofread the Complete Test

Once a complete quiz or test is assembled, proofreading the entire document is essential. All items,
rubrics, materials, and anything else that is presented to students need to be carefully checked for
misspellings, formatting errors, formatting inconsistencies, inadvertent clues, etc. As with proof-
reading items, putting some time between when the test is first compiled and when it is reviewed
can help in detecting subtle errors or typos. One other useful strategy is to put the test on the floor
and literally stand above it and look down at it. Since you will probably not be able to read the text,
you will tend to address more global formatting questions: Are any items split across pages? Is each
reading passage visible while answering the items related to it? Are there directions at the beginning
of each set of items?
Another sound practice is to sit down and take the test yourself, because this can provide you with
insights into what your students will encounter, give you a sense of how long the test will take, and
show you how the items flow from one to the next.
Using the Test
Plan the Test Administration

As mentioned earlier, one important component of classroom assessment is making sure that learn-
ers are aware that they are being assessed (Clarke, 1998; Hill & McNamara, 2012; Rea-Dickins, 2001).
In most formal assessment situations, this issue is explicitly included by scheduling a special time
for the test and announcing it ahead of time. However, awareness raising can also be related to how
teachers approach the actual administration of a test.
One important consideration is the atmosphere of the room. Students are often anxious before
and during assessment, so the teacher should take care to set a calm and professional tone for the
entire process. Classroom teachers can also take advantage of being in familiar surroundings during
their assessments, and, since assessments should probably reflect what regularly happens in the class-
room anyway, teachers should aim to make them as much like regular classroom activities as possible.
During the class before a test, you might ask students if they have any questions about it. Dis-
tributing a study guide—or even more effective, the actual scoring rubric—is an excellent way to
establish students’ expectations and to clarify any areas that might be ambiguous—either within the
rubric/study guide itself or in the students’ understanding of the material that you taught.
During the administration, be careful to treat everyone exactly the same (to promote fairness).
This may mean thinking about issues that do not ordinarily cross your mind like answering all ques-
tions that students have (no matter how odd they may seem at the time), answering them in the
order that students raised their hands, collecting the papers at the end of the test in the same order
they were passed out, and not giving any students extra help or extra time during the test. All in all,
students should feel that they have been respected and that they were all treated fairly.
After the test, it may also be helpful to spend some time in class giving students feedback and dis-
cussing how the test went. This session can be used to help learners reflect on their own experiences
but also to provide the teacher with valuable feedback about the administration, item quality, or scor-
ing issues that may otherwise be missed. While such sessions can descend into student complaining,
ideally with the amount of planning that has gone into the assessment, complaints will be few and
far between, or at least, if complaints do arise, you will be prepared with the information that you
need to justify the effectiveness of the test (Popham, 2010). If used properly, talking with students
about the test can be used in a democratic approach (Shohamy, 2001), which will make assessment
less one-sided and more of a collective and collaborative effort.
Give Students Feedback

Clearly then, feedback to the learners about their performances is one of the most important func-
tions of assessment. Whether in the form of numerical scores, oral comments in student-teacher
conferences, or corrective recasts, feedback is crucial for effective learning as a way of showing learn-
ers where they stand in their language abilities, informing them of their strengths and weaknesses
vis-à-vis specific learning points, or giving them a sense of what steps they (and their teacher) can
take to reinforce and promote their language development.
Feedback in assessment is most often associated with scores of some kind (e.g., percentages, rat-
ings on a scale), which indicate a particular level of performance or amount learned in a set of lan-
guage materials or classroom activities. Handing back a test with nothing more than a score on it is
certainly efficient, but it assumes that the students understand what their scores mean. While it may
(or may not) be self-evident to teachers what a particular score or rating indicates, it may be neces-
sary to explain to students what their scores mean and why they received the score they did.
While reporting percentage scores does show students the proportion of material they learned, it
remains a very limited approach—one that seems to waste the potential for rich, detailed, and con-
structive feedback that the classroom context provides. Hill and McNamara (2012) explain a number
of different types of written or oral feedback that can be used by classroom teachers in three catego-
ries: (a) confirmatory, (b) explanatory, and (c) corrective feedback (pp. 406–408). Numerical scores
generally provide only the first type because the feedback is designed to either confirm or reject the
correctness of the answers or performances. The other two categories—explanatory and corrective—
are more forward-looking approaches tied to student development. The explanatory approach uses
positive feedback as a way of highlighting specific examples of success, while the corrective approach
tends to be directed at where learners fell short of expectations. (For more on the uses, implications,
and outcomes of feedback in classroom contexts, see Hyland & Hyland, 2006; Lightbown & Spada,
1990; Lyster & Ranta, 1997.)
More qualitative approaches to feedback that incorporate explanatory or corrective functions can
evolve into a form of personalized, supplementary instruction given on a student-by-student basis.
Conferences, written feedback (e.g., corrections, leading questions), and other forms of interactive
feedback can be used to identify difficulties, gauge and adjust learning strategies, and identify miscon-
ceptions that students may have about their particular strengths or weaknesses (Doe, 2011; Fox & Hart-
wick, 2011). Rubrics can be useful for these purposes (e.g., Stevens & Levi, 2005), as well as to clearly
define the specific kinds of performances that are expected of students and match numerical scores to
feedback in the form of qualitative descriptions (in words) of performances (e.g., Crusan, 2010).
Note that feedback is not necessarily limited to something that happens after the test. The simple
process of distributing rubrics prior to an assessment can have the benefit of providing learners with
expectations of performance (Torrance & Pryor, 1998; Tunstall & Gipps, 1996). Combined with
feedback after the assessment, learners can then evaluate how well their particular learning strategies
worked in preparing them for the test, which in turn can help them determine whether or not to
continue using those strategies.
Use the Feedback Yourself

Teachers should also consider using the feedback gained from assessment for their own purposes in
making decisions, refining curriculum, and promoting learning within their classes. Most teachers
already keep track of scores for record-keeping and grading purposes, and they ideally have some
sense of how those scores match up with the specific SLOs of the course. Assessment can also be used
to help teachers adjust their instruction and learning goals based on how students are progressing
through a course—especially if diagnostic or continuous assessments are being used. Treating learn-
ers like individuals means recognizing that they will have different starting points when they enter
the class and will likewise progress differently through the material, sometimes contrary to expecta-
tions. Teachers should therefore consider adjusting their teaching to reflect the changing needs of
the learners based on their performances on assessments. For example, in the case of diagnostic tests
given at the beginning of the term, the results can show you (and the students) what they already
know or can do (and so do not need to be taught) as well as any potential and unexpected gaps
in their language abilities that do need to be addressed. However, effective use of such diagnostic
information requires that the curriculum be structured flexibly in a way that allows for changes to be
quickly implemented (Rea-Dickins, 2006; Xu & Liu, 2009).
Three other feedback strategies may prove useful as well:
1. Sharing descriptive statistics, like the average and the range of scores for the class’s overall
performance can help students understand their performances (see Brown, 2009). However,
you should emphasize that scores on classroom tests need to be interpreted on an individual
basis (i.e., that you are not interested in ranking student performances, see Popham, 1978).
Nonetheless, showing students how all the scores fit together can be a useful way of framing
discussions of performance, fairness, and values in assessment.
2. Reviewing any sections of the test that seemed to be consistently problematic for a majority
of students.
3. Providing students with a longitudinal sense of their learning by having them record and
track their scores over time such that each student (and the teacher) can refer to them when-
ever needed.
Improving the Test
Analyze the Items

Even after a test is administered and the grades or feedback are given to the students, the assessment
process should not be considered finished. It should instead be viewed as an iterative process, because
the information made available from using an assessment can reveal: (a) how well it functioned,
(b) ways to improve the test, (c) how score interpretations can be used to better promote learning,
(d) how well the individual items functioned, (e) how well the test met expectations (e.g., were the
items or tasks more or less challenging than expected; did the tasks elicit the kinds of language skills
they were intended to), (f) how useful and reliable the feedback was, and (g) what problems arose
during the administration (e.g., was the audio recording too soft; was the test rushed?).
While different assessment types require different methods of improvement, some strategies
are common to almost any classroom assessment. The first step is to simply reflect on the process,
perhaps even putting into writing your thoughts (especially on the issues raised in (a) to (g) in
the previous paragraph) immediately after administering, scoring, and returning the tests to the
students. A second useful step is to have your learners also reflect on their experiences with the
assessment procedure overall and with specific items, as well as on what their own expectations and
experiences were before and after the test. Third, keeping a written list of the questions students had
during or after the test can help identify places of apparent confusion. And finally, formal methods
also exist for analyzing classroom assessment items statistically (e.g., checking descriptive statistics,
checking item statistics like the difference index and B index; for more on these statistics, see Brown,
2003, 2005). We recognize that such quantitative analyses are often impractical for busy teachers,
even though such analyses would provide useful information. However, making revisions to the
test items, to the administration procedures, and to the scoring methods while all of this informa-
tion is still fresh in your mind is possible and will prove well worth the effort. Sadly, if you do not
revise until the next time you need to use the test, you may find yourself making exactly the same
mistakes again.
Check Reliability
Regardless of the type(s) of tests involved, you need to ensure that your assessments are produc-
ing scores that are reliable and valid for your purposes (Brown & Hudson, 1998). Reliability is the
degree to which scores on a test are consistent, which in classroom contexts typically means how
well a test consistently identifies learners who have mastered (or not mastered) the content or skills
being assessed. Formal approaches to checking reliability can be done by calculating a K-R21 reli-
ability coefficient for any items that are scored right or wrong and thereby produce a value indicat-
ing the proportion of reliable variation in your scores (see Brown, 2005, 2012). However, again, we
recognize that mathematically determining test reliability may seem challenging and unreasonably
time-consuming for some.
In addition, Kane et al. (1999) note that there is often a conflict between authenticity and stan-
dardization in performance assessment. Following Fitzpatrick and Morrison (1971), they refer to the
problem that real-life language use contains a high degree of variability, yet for reliability purposes
teachers often want to set limits on how a performance should be carried out. The authors argue that
when too many restrictions are placed on a task in the name of reliability, it becomes more difficult
to equate performance under these reduced conditions to performance in real life. In most forms
of assessment other than selected-response items, reliability and authenticity seem to be at odds
because of the need to measure students individually, but according to standards that can be judged
as fair and consistent for the entire class (Brown, H. D., 2003).
Teachers can take several paths to achieve a good balance between authenticity and reliability in
the classroom, many of which will come very naturally to teachers. Kane et al. (1999) first suggested
that one of the keys to creating consistency in assessment is to start with how the test is adminis-
tered and presented to the learners. Are the instructions clear and error-free? Is the language of the
instructions appropriate for the level of the learners so that there is no confusion about what they
are required to do? Are the directions or components of the task similar to what the students have
already experienced in their regular classroom activities? Several strategies may help teachers accom-
plish these goals.
First, good directions will help in establishing consistency: make sure that the directions stand
out (i.e., are in bold type or italicized), are clearly written, and are delivered to all of the students in
the same way.
Second, planning or scripting interactions ahead of time can also be an effective way to approach
assessments that involve spontaneous language use between a teacher and a student (e.g., confer-
ences, role-plays). In these cases, while teachers may have a rubric prepared for scoring purposes,
when teachers are part of the assessment process itself, it can be challenging to keep track of every-
thing that is going on and at the same time ensure consistency and fairness (Rea-Dickins, 2001), but
every practical effort must be made to do so.
Third, in less formalized assessments, teachers often need to establish a protocol for keeping track
of student performances. By its very definition, formative assessment should be “a planned process”
(Popham, 2008, p. 6). Thus, teachers should consider establishing protocols for gathering informa-
tion about how well students are following and applying their classroom experiences in a way that
assesses all students consistently.
Fourth, as mentioned above, rubrics are a useful way of scoring productive language tasks and
for providing feedback, and both of these uses are actually ways of increasing the consistency of
measurement and ensuring that all learners are graded in the same way based on the same criteria.
However, even the best rubrics are subject to potential bias on the part of whoever is using them, so
it may be useful to employ multiple raters as a way of improving (or at least checking) the reliability
of the resulting scores (Trace, Janssen, & Meier, 2015). Since convincing another teacher to help
assess your students may prove impossible, having students do peer assessments (e.g., during class
presentations) may be a valuable way of checking the reliability of their scoring and yours. While you
would be wise not to rely completely on student scoring, checking the consistency of students’ scores
with each other and on average with your scores can be very enlightening.
Check Validity
Unlike reliability, the validity of your scores has to do with the degree to which you are measuring
what you think you are measuring and, especially in classroom assessment, the degree to which what
you are measuring is related to what the students are learning. This brings our chapter full circle:
recall that, early on, we argued that you should plan your assessments so that they will match and
promote learning. Those issues are clearly still important when thinking about the validity of the
scores you derive from assessments and the decisions you base on those scores.
It is also crucial to examine the degree to which you are testing specifically what you think you are
testing. For example, if you are testing listening with a video-recorded lecture that the students use to
answer written multiple-choice questions, you must realize that you are assessing not only listening
but also reading (and probably also their testwiseness, their ability to guess, etc.). If you want to test
listening ability pure and simple, then perhaps you should think about how you are teaching it. If in
fact you are teaching listening through a series of lecture listening/multiple-choice activities, then
at least your assessments should match what students are learning. However, if you are teaching the
students to listen to sets of directions in order to draw routes on a map, then you should probably
test the students in that same way.
To address the overarching concerns of validity, teachers can try answering eight relatively straight-
forward questions (expanded here from Brown, 2012, p. 106):
1. How much does the content and format of my items match the SLOs of the class?
2. To what degree do the content and format of my items match the material I covered in class?
3. Will my students think my test items match the SLOs and material I am teaching them?
4. How well do my course SLOs meet the students’ needs?
5. To what degree do my test scores show that my students are learning something in my course?
6. How well are my assessments promoting learning in my course?
7. How do the values that underlie my test scores match my values? My students’ values? Their
parents’ values? My employer’s values?
8. What are the consequences of the decisions I base on my test scores for my students, their
parents, me, my employer?
Your responses to these questions are unlikely to be black-and-white, yes-or-no answers because
such issues are often matters of degree, and even a more-or-less positive answer to any one of these
questions does not indicate that your uses of the scores are valid. However, taken together, if you can
answer more positively than negatively to those questions that are most germane to your teaching sit-
uation, then you can argue to yourself, to your students, to their parents, and even to your employer
that the scores, feedback, and decisions that you derive from your assessments are likely valid.
Conclusions
By now, it should be obvious that we feel that sound classroom assessment is much more than just
scores and grades, but rather is a tool that teachers can use to support and promote their students’
learning. Thus, while it is relatively simple to measure someone’s use of language in a purely static
assessment, we think that teachers should seriously consider ways to use their assessment items and
tasks to affect their students’ future performances and displays of language ability (Poehner, 2007).
To extend the difference between authenticity and performance pointed out by Kane et al. (1999), it
is fine to give learners a task that is very true to real life and matches actual language use perfectly, but
if this same task cannot inform you as a teacher and, more importantly, inform your learners about
how they used the language, where their particular strengths and weaknesses lie, and what the next
step is in their development, then authenticity means very little when it comes to learning. Instead,
we believe that assessments should provide teachers with the connections they need among their
SLOs, materials, and instruction to maximize not only their students’ assessment performances but
also their learning and development. We believe that this is true of all classroom assessments, not just
those labeled as personalized- and individualized-response assessments.
We also believe that learning through assessment can only happen when the learners are treated
as part of the process, and not just the object of assessment (Popham, 2008). As Rea-Dickins (2006)
points out, this happens through building awareness with your students about the purposes, uses, and
outcomes of your tests. Building such awareness can take the forms of negotiating and consulting with
students about how they are assessed, acknowledging and reflecting their language needs and values in
assessment, and making assessment about them rather than about a score that perhaps has little or no
direct meaning otherwise. Without involving the learners, assessment risks becoming a meaningless,
isolated, and going-through-the-motions kind of process—for students and teachers alike.
Note
1. Here an item will be defined as the smallest unit of a classroom assessment procedure that produces information in the
form of a score or other feedback.
References
Anton, M. (2009). Dynamic assessment of advanced second language learners. Foreign Language Annals, 42(3), 576–598.
Bachman, L., & Palmer, A. (1996). Language testing in practice. New York: Oxford University.
Bachman, L., & Palmer, A. (2010). Language assessment in practice. New York: Oxford University.
Black, P., & William, D. (1998). Assessment and classroom learning. Assessment in Education, 5(1), 7–74.
Black, P., & William, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and
Accountability, 21(1), 5–31.
Brindley, G. (1998). Outcomes-based assessment and reporting in language learning programmes: A review of the issues.
Language Testing, 15(1), 45–85.
Brindley, G. (2001). Outcomes-based assessment in practice: Some examples and emerging insights. Language Testing, 18(4),
393–407. doi:10.1177/026553220101800405
Brookhart, S. M. (1999). The art and science of classroom assessment: The missing part of pedagogy. ASHE-ERIC Higher
education report (Vo. 27, No. 1). Washington, DC: The George Washington University, Graduate School of Education
and Human Development.
Brown, H. D. (2003). Language assessment: Principles and classroom practices. White Plains, NY: Pearson Longman.
Brown, J. D. (Ed.). (1998). New ways of classroom assessment. Alexandria, VA: Teachers of English to Speakers of Other Lan-
guages.
Brown, J. D. (2003). Questions and answers about language testing statistics: Criterion-referenced item analysis (The dif-
ference index and B-index). SHIKEN: The JALT Testing & Evaluation SIG Newsletter, 7(3), 18–24). Accessed online
September 29, 2013 at http://jalt.org/test/bro_18.htm
Brown, J. D. (2004a). Grade inflation, standardized tests, and the case for on-campus language testing. In D. Douglas (Ed.),
English language testing in U.S. colleges and universities (2nd ed., completely new) (pp. 37–56). Washington, DC: NAFSA.
Brown, J. D. (2004b). Performance assessment: Existing literature and directions for research. Second Language Studies, 22(2),
91–139.
Brown, J. D. (2005). Testing in language programs: A comprehensive guide to English language assessment (New edition). New
York: McGraw-Hill.
Brown, J. D. (2009). Using a spreadsheet program to record, organize, analyze, and understand your classroom assessments. In
C. Coombe, P. Davidson, & D. Lloyd (Eds.), The fundamentals of language assessment: A practical guide for teachers
(2nd edition) (pp. 59–70). Dubai, UAE: TESOL Arabia.
Brown, J. D. (2012). What teachers need to know about test analysis. In C. Coombe, S. J. Stoynoff, P. Davidson, & B. O’Sullivan
(Eds.), The Cambridge guide to language assessment (pp. 105–112). Cambridge: Cambridge University.
Brown, J. D. (Ed.). (2013a). New ways of classroom assessment, revised. Alexandria, VA: Teachers of English to Speakers of Other
Languages.
Brown, J. D. (2013b). Statistics corner. Questions and answers about language testing statistics: Solutions to problems teachers
have with classroom testing. Shiken Research Bulletin, 17(2), 27–33. Available at http://teval.jalt.org/sites/teval.jalt.org/
files/SRB-17–2-Full.pdf
Brown, J. D. (In press). Assessments in ELT: Teacher options and making pedagogically sound choices. In W. A. Renandya & H.
P. Widodo (Eds.), English language teaching today: Building a closer link between theory and practice. New York: Springer
International.
Brown, J. D., & Hudson, T. (1998). Alternatives in language assessment. TESOL Quarterly, 32(4), 653–675.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing. Cambridge: Cambridge University.
Brown, J. D., Hudson, T., Norris, J. M., & Bonk, W. J. (2002). An investigation of second language task-based performance assess-
ments. Honolulu, HI: University of Hawaii at Manoa National Foreign Language Resource Center.
Brown, S. (2004). Assessment for learning. Learning and Teaching in Higher Education, 1(1), 81–89.
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University.
Butler, Y. G., & Lee, J. (2010). The effects of self-assessment among young learners of English. Language Testing, 27(1), 5–31.
Carless, D. (2005). Prospects for the implementation of assessment for learning. Assessment in Education: Principles, Policy &
Practice, 12(1), 39–54.
Clapham, C. (2000). Assessment and testing. Annual Review of Applied Linguistics, 20, 147–161.
Clarke, S. (1998). Targeting assessment in the primary classroom. Bristol, UK: Hodder & Stoughton.
Colby-Kelly, C., & Turner, C. E. (2007). AFL research in the L2 classroom and evidence of usefulness: Taking formative assess-
ment to the next level 1. Canadian Modern Language Review, 64(1), 9–37.
Cresswell, A. (2000). The role of portfolios in the assessment of student writing on an EAP course. In G. Blue, J. Milton, &
J. Saville (Eds.), Assessing English for academic purposes (pp. 205–220). Oxford, UK: Peter Lang.
Crusan, D. (2010). Assessment in the second language writing classroom. Ann Arbor, MI: The University of Michigan.
Davidson, F., & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing and using language test specifications. New Haven,
CT: Yale University.
Davis, L., & Kondo-Brown, K. (2012). Assessing student performance: Types and uses of rubrics. In J. D. Brown (Ed.), Develop-
ing, using, and analyzing rubrics in language assessment with case studies in Asian-Pacific languages (pp. 33–56). Honolulu,
HI: University of Hawaii at Manoa National Foreign Language Resource Center.
Davison, C., & Leung, C. (2009). Current issues in English language teacher-based assessment. TESOL quarterly, 43(3), 393–415.
Dippold, D. (2009). Peer feedback through blogs: Student and teacher perceptions in an advanced German class. ReCALL,
21(1), 18–36.
Doe, C. (2011). The integrations of diagnostic assessment into classroom instruction. In D. Tsagari & I. Csepes (Eds.),
Classroom-based language assessment (pp. 63–76). Frankfurt am Main, Germany: Peter Lang.
Fitzpatrick, R., & Morrison, E. J. (1971). Performance and product evaluation. In R. L. Thorndike (Ed.), Educational measure-
ment (2nd ed.) (pp. 237–270). Washington, DC: American Council on Education.
Fox, J., & Hartwick, P. (2011). Taking a diagnostic turn: Reinventing the portfolio in EAP instruction. In D. Tsagari & I. Csepes
(Eds.), Classroom-based language assessment (pp. 47–62). Frankfurt am Main, Germany: Peter Lang.
Graves, D. (1992). Portfolios: Keep a good idea growing. In D. Graves & B. Sunstein (Eds.), Portfolio portraits (pp. 1–12).
Portsmouth, NJ: Heinemann.
Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for class-
room assessment. Applied measurement in education, 15(3), 309–333.
Hall, K., Webber, B., Varley, S., Young, V., & Dorman, P. (1997). A study of teacher assessment at key stage 1. Cambridge Journal
of Education, 27(1), 107–122.
Hamp-Lyons, L., & Tavares, N. (2011). Interactive assessment: A dialogic and collaborative approach to assessing learners’
oral language. In D. Tsagari & I. Csepes (Eds.), Classroom-based language assessment (pp. 29–46). Frankfurt am Main,
Germany: Peter Lang.
Heritage, M. (Ed.). (2010). Formative assessment: Making it happen in the classroom. Thousand Oaks, CA: Corwin.
Hill, K., & McNamara, T. (2012). Developing a comprehensive, empirically based research framework for classroom-based
assessment. Language Testing, 29(3), 395–420. doi:10.1177/0265532211428317
Hirvel, A., & Pierson, H. (2000). Portfolios: Vehicles for authentic self-assessment. In G. Ekbatani & H. Pierson (Eds.), Learner-
directed assessment in ESL (pp. 105–126). Mahwah, NJ: Lawrence Erlbaum Associates.
Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83–101.
Jang, E. E., Dunlop, M., Park, G., & van der Boom, E. H. (2015). How do young students with different profiles of reading
skill mastery, perceived ability, and goal orientation respond to holistic diagnostic feedback? Language Testing, 32(3),
359–383. doi:10.1177/0265532215570924
Janssen, G., Meier, V., & Trace, J. (2015). Building a better rubric: Towards a more robust description of academic writing
proficiency. Assessing Writing. Available at http://dx.doi.org/10.1016/j.asw.2015.07.002
Kane, M., Crooks, T., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice,
18(2), 5–17.
Klenowski, V. (2009). Assessment for learning revisited: An Asian-Pacific perspective. Assessment in Education: Principles,
Policy, and Practice, 16(3), 263–268.

Kline, P. (1986). A handbook of test construction: Introduction to psychometric design. London: Methuen.
Leung, C. (2004). Developing formative teacher assessment: Knowledge, practice, and change. Language Assessment Quarterly,
1(1), 19–41. doi:10.1207/s15434311laq0101_3
Leung, C., & Mohan, B. (2004). Teacher formative assessment and talk in classroom contexts: Assessment as discourse and
assessment of discourse. Language Testing, 21(3), 335–359. doi:10.1191/0265532204lt287oa
Lightbown, P. M., & Spada, N. (1990). Focus-on-form and corrective feedback in communicative language teaching. Studies
in Second Language Acquisition, 12(4), 429–448.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing, 12(1),
54–71.
Lyster, R., & Ranta, L. (1997). Corrective feedback and learner uptake. Studies in Second Language Acquisition, 19(1), 37–66.
Matsuno, S. (2009). Self-, peer-, and teacher-assessments in Japanese university EFL writing classrooms. Language Testing,
26(1), 75–100.
McNamara, T. (2001). Language assessment as social practice: Challenges for research. Language Testing, 18(4), 333–349.
doi:10.1177/026553220101800402
Norris, J. M. (2006). The why (and how) of assessing student learning outcomes in college foreign language programs. Modern
Language Journal, 90(4), 576–583.
Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998). Designing second language performance assessments. Honolulu,
HI: University of Hawaii at Manoa National Foreign Language Resource Center.
Poehner, M. E. (2007). Beyond the test: L2 dynamic assessment and the transcendence of mediated learning. Modern Language
Journal, 9(3), 323–340.
Poehner, M. E., & Lantolf, J. P. (2005). Dynamic assessment in the language classroom. Language Teaching Research, 9(3),
233–265. doi:10.1191/1362168805lr166oa
Popham, W. J. (1978). Criterion-referenced measurement. Englewood Cliffs, NJ: Prentice-Hall.
Popham, W. J. (1981). Modern educational measurement. Englewood Cliffs, NJ: Prentice-Hall.
Popham, W. J. (2008). Transformative assessment. Alexandria, VA: Association for Supervision and Curriculum Development.
Popham, W. J. (2010). Classroom assessment: What teachers need to know (6th ed.). Boston, MA: Pearson.
Purpura, J. E., Brown, J. D., & Schoonen, R. (2015). Improving the validity of quantitative measures in applied linguistics
research. Language Learning, 65(S1), 37–75.
Rea-Dickins, P. (2001). Mirror, mirror on the wall: Identifying processes of classroom assessment. Language Testing, 18(4),
429–462. doi:10.1177/026553220101800407
Rea-Dickins, P. (2006). Currents and eddies in the discourse of assessment: A learning-focused interpretation1. International
Journal of Applied Linguistics, 16(2), 163–188.
Richards, J. C., & Schmidt, R. W. (2010). Longman dictionary of language teaching and applied linguistics (4th ed.). London:
Longman Pearson.
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educa-
tional Measurement: Issues and Practice, 24(2), 3–13.
Ross, J. A. (2006). The reliability, validity, and utility of self-assessment. Practical Assessment Research & Evaluation, 11(10).
Available at http://pareonline.net/getvn.asp?v=11&n=10
Sadler, D. R. (1989). Formative assessment and the design of instructional systems. Instructional Science, 18(2), 119–144.
Shohamy, E. (1995). Performance assessment in language testing. Annual Review of Applied Linguistics, 15, 188–211.
Shohamy, E. (2001). Democratic assessment as an alternative. Language Testing, 18(4), 373–391. doi:10.1177/0265532201
01800404
Stefanakis, E. H., & Meier, D. (2010). Differentiated assessment: How to assess the learning potential of every student (Grades
6–12). San Francisco: Jossey-Bass.
Stevens, D., & Levi, A. J. (2005). Introduction to rubrics: An assessment tool to save grading time, convey effective feedback, and
promote student learning. Sterling, VA: Stylus.
Stiggins, R. J. (2002). Assessment crisis: The absence of assessment for learning. Phi Delta Kappan, 83(10), 758–765.
Teasdale, A., & Leung, C. (2000). Teacher assessment and psychometric theory: A case of paradigm crossing? Language Testing,
17(2), 163–184.
Torrance, H., & Pryor, J. (1998). Investigating formative assessment: Teaching, learning and assessment in the classroom. Buck-
ingham, UK: Open University.
Trace, J., Janssen, G., & Meier, V. (2015). Measuring the impact of rater negotiation in writing performance assessment. Lan-
guage Testing. doi:10.1177/0265532215594830
Tunstall, P., & Gipps, C. (1996). ‘How does your teacher help you to make your work better?’ Children’s understanding of
formative assessment. The Curriculum Journal, 7(2), 185–203.
Vygotsky, L. S. (1978). Mind in society: The development of higher psychological processes. Cambridge, MA: Harvard University.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197–223.
Wiggins, G. (1993). Assessing student performance: Exploring the purpose and limits of testing. San Francisco, CA: Jossey-Bass.
Wiggins, G. (1998). Letter to editor. Educational Researcher, 27(6), 20–22.
Xu, Y., & Liu, Y. (2009). Teacher assessment knowledge and practice: A narrative inquiry of Chinese college EFL teacher’s
experiences. TESOL Quarterly, 43(3), 493–513.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational
Measurement, 30(3), 187–213.

Handbook of Research in Second Language Teaching A... - (Part VII Assessment and Testing)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handbook of Research in Second Language Teaching A... - (Part VII Assessment and Testing)

Uploaded by

Copyright:

Available Formats

VII

Assessment and Testing

Harmondsworth, UK: Penguin Books.

The Language Test Design Cycle

Figure 33.1 The Test Design Cycle

Activity 1: From Test Purpose to Test Tasks

they can “integrate knowledge and ideas”.

Figure 33.2 Options for a School Visit

Figure 33.3 Results of the Class Vote

The requirement to be explicit about test purpose cannot be stressed enough.

Activity 2: Specification Writing and Iteration

Figure 33.4 Typical Level 4 Response

Activity 3: Prototyping and Piloting

Activity 4: The Field Test and the Go/No-Go Decision

Trends and Controversies

Social Responsibility and Accountability

Conclusion and Future Directions

Standards for Educational Assessments

Reliability/Precision and Errors in Measurement

necessary systematic validation research studies ought to be conducted by test development

Reliability or consistency (for norm-referenced tests) and dependability (for criterion-referenced

source of explanation are really biased items.

Decision Making and Consequences

Public Justification and Reasoning

Framework and Agenda for Research

Assessment Development and Research

Score Users and Community Members

Know Your Options

Compiling the Test

Using the Test

Improving the Test

Figure 35.1 Five Steps in Classroom Assessment

a particular way of learning preferred by a learner. Learners approach learning in different

Using Multiple Options

Match Assessment to Learning

Promote Learning with Assessment

Create the Best Possible Items

Write Enough Items

Check the Items Again

Compiling the Test

Organize the Items

Create Scoring Tools

Proofread the Complete Test

Using the Test

Plan the Test Administration

Give Students Feedback

Use the Feedback Yourself

Three other feedback strategies may prove useful as well:

Improving the Test

Analyze the Items

Policy, and Practice, 16(3), 263–268.

You might also like