Chapter 4 - Standardized Testing

STANDARDIZED TESTING _ Every educated person has at some point been touched—if not deeply affected— by a standardized test. For almost a century, schools, universities, businesses, and governments have looked to standardized measures for economical, reliable, and valid assessments of those who would enter, continue in, or exit their institutions Proponents of these large-scale instruments make strong claims for their usefulness when great numbers of people must be measured quickly and effectively. Those claims are well supported by reams of research data that Comprise construct vali- dations of their efficacy, And so we have become a world that abides by the re- | sults of standardized tests as if they were sacrosanct. The rush to carry out standardized testing in every walk of life has not gone unchecked. Some psychometricians have stood up in recent years to caution the public against reading too much into tests that require what may be a narrow band of specialized intelligence (Sternberg, 1997; Gardner, 2000; Kohn, 2000), Organizations such as the National Center for Fair and Open Testing (www.fairtest.org) have reminded us that standardization of assessment procedures creates an illusion of validity, Strong claims from the giants of the testing industry, they say, have pulled the collective wool over the public’s eyes and in the process have incorrectly mar- ginalized thousands, if not millions, of children and adults worldwide, These socio- economic issues in standardized testing are discussed in Chapter 5. Whichever side is “right”—and both sides have legitimate cases—it is important for teachers to understand the educational institutions they are working in,and an integral part of virtually all of those institutions is the use of standardized tests So it is important for you to understand what standardized tests are, what they are not, how to interpret them, and how to put them into a balanced perspective in which we strive to accurately assess all learners on all proposed objectives. We can learn a great deal about many learners and their competencies through standardized forms of assessment. But some of those learners and some of those objectives may not be adequately measured by a sit-down, timed, multiple-choice format that is likely to be decontextualized. ‘This chapter has two goals: to introduce the process of constructing, vall- dating, administering, and interpreting standardized tests of language; and tocHaerer 4 Siandardized Testing 67 acquaint you with a variety of current standardized tests that claim to test overall language proficiency. It should be clear from these goals that in this chapter we are not focusing cen- trally on classroom-based assessment, Don’t forget, however, that standardized tests affect all classrooms, and some of the practical steps that are involved in creating standardized tests are directly transferable to designing classroom tests. WHAT IS STANDARDIZATION? A standardized test presupposes certain standard objectives, or criteria, that are held constant across one form of the test to another. The criteria in large-scale standardized tests are designed to apply to a broad band of competencies that are usually not exclusive to one particular curriculum.A good standardized test is the product of a thorough process of empirical research and development. It dictates standard procedures for administration and scoring. And finally, it is typical of a norm-referenced test, the goal of which is to place testtakers on a continuum across asange of scores and to differentiate testtakers by their relative ranking. Most elementary and secondary schools in the United States have standardized achievement tests to measure children’s mastery of the standards or competencies that have been prescribed for specified grade levels. These tests vary by states, counties, and school districts, but they all share the common objective of economical large-scale assessment, College entrance exams such as the Scholastic Aptitude Test (SAT®) are part of the educational experience of many high school seniors seeking further education. The Graduate Record Exam (GRE*) is a required standardized test for entry into many graduate school programs. Tests like the Graduate ‘Management Admission ‘Test (GMAT) and the Law School Aptitude Test (LSAT) spe- cialize in particular disciplines. One genre of standardized test that you may already be familiar with is the Test of English as a Foreign Language (TOEFL"), produced by the Educational ‘Testing Service (ETS) in the United States and/or its British coun- terpart, the International English Language Testing System (IELTS), which features standardized tests in affiliation with the University of Cambridge Local Examinations Syndicate (UCLES). They are all standardized because they specify a set of competencies (or standards) for a given domain, and through a process of construct validation they program a set of tasks that have been designed to measure those competencies, ‘Many people are under the incorrect impression that all standardized tests con- sist of items that have predetermined responses presented in a multiple-choice format. While it is true that many standardized tests conform to a multiple-choice format, by no means is multiple-choice a prerequisite characteristic. It so happens that a multiple-choice format provides the test producer with an “objective” means for determining correct and incorrect responses, and therefore is the preferred mode for large-scale tests. However, standards are equally involved in certain human- scored tests of oral production and writing, such as the ‘Test of Spoken English TSE") and the Test of Written English (TWE®), both produced by ETS.68 cHarten 4 Standardized Testing ADVANTAGES AND DISADVANTAGES OF STANDARDIZED TESTS Advantages of standardized testing include, foremost, a ready-made previously vali- dated product that frees the teacher from having to spend hours creating a test Administration to large groups can be accomplished within reasonable time limits In the case of multiple-choice formats, scoring procedures are streamlined (for either scannable computerized scoring or hand-scoring with a hole-punched grid’ for fast turnaround time. And, for better or for worse, there is often an air of face validity to such authoritative-looking instruments. Disadvantages center largely on the inappropriate use of such tests, for example, using an overall proficiency test as an achievement test simply because of the convenience of the standardization. A colleague told me about a course director who, after a frantic search for a last-minute placement rest, administered a multiple- choice grammar achievement test, even though the curriculum was mostly listening and speaking and involved few of the grammar points tested. This instrument had the appearance and face validity of a good test when in reality it had no content validity whatsoever. Another disadvantage is the potential misunderstanding of the difference between direct and indirect testing (see Chapter 2). Some standardized tests include tasks that do not directly specify performance in the target objective. For example. before 1996, the TOEFL included neither a written nor an oral production section. yet statistics showed a reasonably strong correspondence between performance on the TOEFL and a student's written and—to a lesser extent—oral production. The comprehension-based TOEFL could therefore be claimed to be an indirect test of production. A test of reading comprehension that proposes to measure ability to read extensively and that engages testtakers in reading only short one- or two- paragraph passages is an indirect measure of extensive reading. ‘Those who use standardized fests need to acknowledge both the advantages and limitations of indirect testing. In the pre-1996 TOEFL administrations, the expense of giving a direct test of production was considerably reduced by offering only comprehension performance and showing through construct validation the appropriateness of conclusions about a testtaker’s production competence Likewise, short reading passages are easier to administer, and if research validates the assumption that short reading passages indicate extensive reading ability, then the use of the shorter passages is justified. Yet the construct validation statistics that offer that support never offer a 100 percent probability of the relationship, leaving room for some possibility that the indirect test is not valid for its targeted use. ‘A more serious issue lies in the assumption (alluded to above) that standardized tests correctly assess all learners equally well. Welkestablished standardized tests usually demonstrate high correlations between performance on such tests and target objectives, but correlations are not sufficient to demonstrate unequivocally the acquisition of criterion objectives by ail testtakers, Here is a nomlanguage example. In the United States, some driver's license renewals require taking a paper- and-pencil multiple-choice test that covers signs, safe speeds and distances, laneCHAPTER Standardized Testing 69 changes, and other“rules of the road Correlational statistics show a strong relationship between high scores on those tests and good driving records, so people who do well on these tests are a safe bet to relicense, Now, an extremely high correlation (of per haps .80 or above) may be loosely interpreted to mean that a large majority of the dei- vers whose licenses are renewed by virtue of their having passed the little quiz are good behind-the-wheel drivers. What about those few who do not fit the model? That small minority of drivers could endanger the lives of the majority, and is that risk worth taking? Motor vehicle registration departments in the United States seem to think so, and thus avoid the high cost of behindthe-wheel driving tests Are you willing to rely on a standardized test result in the case of af the learners in your class? Of an applicant to your institution, or of a potential degree candidate exiting your program? These questions will be addressed more fully in Chapter 5, but for the moment, think carefully about what has come to be known as high-stakes testing, in which standardized tests have become the only criterion for inclusion or exclusion. The widespread acceptance, and sometime misuse, of this gate-keeping role of the testing industry has created a political, educational, and moral maelstrom. DEVELOPING A STANDARDIZED TEST While it is not likely that a classroom teacher, with 4 team of test designers and researchers, would be in a position to develop a brand-new standardized test of large-scale proportions, it isa virtual certainty that some day you will be in a position (a) to revise an existing test, (b) to adapt or expand an existing test, and/or (©) to create a smaller-scale standardized test for a program you are teaching in. And even if none of the above three cases should ever apply to you, it is of paramount importance to understand the process of the development of the standardized rests that have become ingrained in our educational institutions. How are standardized tests developed? Where do test tasks and items come from? How are they evaluated? Who selects items and their arrangement in a test? How do such items and tests achieve consequential validity? How are different forms of tests equated for difficulty level? Who sets norms and cut-off limits? Are security and confidentiality an issue? Are cultural and racial biases an issue in test development? All these questions typify those that you might pose in an attempt to understand the process of test development. In the steps outlined below, three different standardized tests will be used to exemplify the process of standardized test design: (A) The Test of English as a Foreign Language (TOEFL), Educational ‘Testing Service (ETS), (B) The English as a Second Language Placement Test (ESLPT), San Francisco State University (SFSU), (© The Graduate Essay Test (GET), SFSU.0 Ee 70 cuarrers Standardized Testing ‘The first is a test of general language ability or proficiency. The second is a place ment test at a university. And the third is a gate-keeping essay test that all prospective students must pass in order to take graduate-level courses. As we look at the steps, one by one, you will sce patterns that are consistent with those outlined is the previous two chapters for evaluating and developing a classroom test. 1. Determine the purpose and objectives of the test. Most standardized tests are expected to provide high practicality in administration and scoring without unduly compromising validity, The initial outlay of time and money for such a test is significant, but the test would be used repeatedly. It is therefore important for its purpose and objectives to be stated specifically, Let's look at the three tests. (A) The purpose of the TOEFL is “to evaluate the English proficiency of people whose native language is not English” (TOEFL Test and Score Maral, 2001, p. 9). More specifically, the TOEFL is designed to help institutions of higher learning make “valid decisions concerning English language proficiency in terms of [their] own requirements" (p. 9). Most colleges and universities in the United States use TORFL scores to admit or refuse international applicants for admission. Various cutoff scores apply, but most institutions require scores from 475 to 525 (paper-based) or from 150 to 195 (computer-based) in order to consider students for admission. The high-stakes, gate-keeping nature of the TOEFL is obvious. (B) The ESLPT, referred to in Chapter 3, is designed to place already admitted students at San Francisco State University in an appropriate course in academic writing, with the secondary goal of placing students into courses in oral production and grammarediting. While the test's primary purpose is to make placements, another desirable objective is to provide teachers with some diagnostic information about their students on the first day or (wo of class,"The ESLPT is locally designed by university faculty and staff. (© The GET, another test designed at SFSU, is given to prospective graduate students—both native and non-native speakers—in all disciplines to determine whether their writing ability is sufficient to permit them to enter graduate-level courses in their programs. It is offered at the beginning of each term, Students who fail or marginally pass the GET are technically incligible to take graduate courses in their field. Instead, they may elect to take a course in graduate-level writing of research papers. A pass in that course is equivalent to passing the GET. As you can see, the objectives of each of these tests are specific. The content of each test must be designed to accomplish those particular ends. This first stage of goal-setting might be seen as one in which the consequential validity of the test is foremost in the mind of the developer: each test has a specific gate-keeping function to perform; therefore the criteria for entering those gates must be specified accurately. 2. Design test specifications. ‘Now comes the hard part. Decisions need to be made on how to go about structur- ing the specifications of the test. Before specs can be addressed, a comprehensive Eeecuarrer4 Standardized Testing 71 program of research must identify a set of constructs underlying the test itself. This stage of laying the foundation stones can occupy weeks, months, or even years of effort. Standardized tests that don’t work are often the product of shortsighted construct validation, Let's look at the three tests again, (A) Construct validation for the TOEFL is carried out by the TOEFL staff at ETS under the guidance of a Policy Council that works with a Committee of Examiners that is composed of appointed external university faculty, linguists, and assessment specialists. Dozens of employees are involved in a complex process of reviewing current TOEFL specifications, commissioning and developing test tasks and items, assem- bling forms of the test, and performing ongoing exploratory research related to formulating new specs. Reducing such a complex process to a set of simple steps runs the risk of gross overgeneralization, but here is an idea of how a TOEFL is created. Because the TOEFL is a proficiency test, the first step in the developmental process is to define the construct of language proficiency. First, it should be made clear that many assessment specialists such as Bachman (1990) and Palmer G@achman & Palmer, 1996) prefer the term ability w proficiency and thus speak of language ability as the overarching concept. The latter phrase is more consistent, they argue, with our understanding that the specific components of language ability must be assessed. separately. Others, such as the American Council on Teaching Foreign Languages (ACTFL), still prefer the term proficiency because it connotes more of a holistic, unitary trait view of language ability (Lowe, 1988). Most current views accept the ability argument and therefore strive to specify and assess the many components of language. For the purposes of consistency in this book, the term proficiency will nevertheless be retained, with the above caveat. How you view language will make a difference in how you assess linguage proficiency: After breaking language competence down into subsets of listening, speaking, reading, and writing, each performance mode can be examined on a continuum of linguistic units: phonology (pronunciation) and orthography (spelling), words (lexicon), sentences (grammar), discourse, and pragmatic (sociolinguistic, contextual, functional, cultural) features of language How will the TOEFL sample from all these possibilities? Oral production tests can be tests of overall conversational fluency or pronunciation of a particular subset of phonology, and can take the form of imitation, structured responses, or free responses. Listening comprehension tests can concentrate on a particular feature of language or on overall listening for general meaning. Tests of reading can cover the range of language units and can aim to test comprehension of long or short passages, single sentences, or even phrases and words. Writing tests can take on an open-ended form with free composition, or be structured to elicit anything from correct spelling to discourse-level competence. Are you overwhelmed yet? From the sea of potential performance modes that could be sampled in a test, the developer must select a subset on some systematic basis. To make a very long story short (and leaving out numerous controversies), the TOEFL had for many years included three types of performance in its organizational specifications: listening, struc- ture, and reading, all of which tested comprehension through standard multiple-choice

Chapter 4 - Standardized Testing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 4 - Standardized Testing

Uploaded by

Copyright:

Available Formats

You might also like