You are on page 1of 8

Test Review: Assessing the TOEFL iBT Vanessa Armand University of Illinois at Chiago A Review for Classroom Testing

and Assessment for TESOL February 2, 2012

Introduction The Test of English as a Foreign Language, or TOEFL test, can arguably be considered the most successful assessment method used to answer the international demand for an objective way to measure students’ English language ability. Taken by more than 25 million people around the world and accepted by more than 8,000 universities, colleges, and agencies in 130-plus countries, this norm-referenced test boldly lays claim to being the test to take for students advancing their studies into the English-speaking realm. It offers a more convenient and widely accepted test to its takers, boasting 4,500 test centers in 165 coutnries with 30-40 different testing dates, thus giving students the choice to take the test (nearly) anywhere to help them “go anywhere” ( Test Purpose The creators of the TOEFL test aim their materials at students, in particular those in English programs, and those international students wishing to study-abroad at Anglophone universities where the courses for their fields of study will be taught in English. The TOEFL website ( tagline itself directly targets international students, cleverly claiming that “the TOEFL test is [their] passport to study abroad”. The test also appeals to visa applicants, as well as to scholarship/certification candidates who need objective measurement of their English proficiency. Through sound reliability and validity evidence, TOEFL is able to offer its test-takers and test-accepting agencies the piece of mind that their scores with not only be objective, but will also be accurate measures of the linguistic and communicative competence they will need for their academic ventures. The test makes use of tasks that measure students’ ability to integrate the skills of reading, writing, listening, and speaking as well as to use them independently to best perform in the academic setting. History and Theoretical Framework The test boasts an evidence-centered design, its creators placing heavy emphasis on the rigorous level of materials testing that has functioned not only as a basis for the

very first conception of the exam, but also as a driving force in its adaptation and development in keeping with advancements in technology and academic research. Each phase of the test’s development have been influenced by the theories of language teaching that were pervasive at the time, but with each new development in the field, the TOEFL developers have taken steps to adapt to changes in what we know about second language learning and testing. From the very birth of the TOEFL in 1964 by the National Council of Testing English as a Foreign Language (then taken over by ETS [Educational Testing Service] and the Colleege Board a year later), the exam was focused on assessing the English proficiency level of non-native speakers wishing to enter universities. The test at that time was comprised of multiple-choice items testing reading comprehension, listening comprehension, and the student’s grasp of English grammar. In 1979, writing ability and speaking ability components were added to the exam as the TWE® (writing) and the TSE® (speaking) tests to adapt the test to meet research developments in communicative competence, but these two tests were optional and largely administered to international graduate students who were to serve as teaching assistants. In the years that followed, the test morphed into the TOEFL CBT (Computer-based test), which attempted to meet growing demands for computer-based assessment (TOEFL Research Series 1[6], 2011). This form of the test was administered until 2005 when the test was greatly in need of restructuring to fit new advancements in technology as well as in understandings of student needs. The current TOEFL iBT test assesses communicative competence through the use of academic tasks that focus on the integration of receptive and productive skills. The TOEFL website describes the up-to-date framework as stating that “the new ‘...test will measure examinees’ English-language proficiency in situations and tasks reflective of university life...” (Jamieson et al, in TOEFL Research Series 1[6], 2011). Due to its academic focus, the TOEFL functions appropriately as a gate-keeping device by most institutions in that those institutions set admission requirement based on TOEFL scores. There is no passing or failing grade for the TOEFL, rather, it provides students (and score-users) with a gauge of students’ ability to perform in an English language-based academic setting. For this reason, students may consider the test to be high-stakes because their scores determine their opportunities for future work and study, scholarships, and even acceptance for visas.

Description of Test Content The TOEFL iBT test is broken up into 4 sections around the four skills: reading, listening, writing, and speaking, with the Speaking section involving integrated reading and listening skills (see Appendix E). The test is administered online through a computer with a secure Internet testing network and utilizes technology (i.e. headsets and microphones) to administer materials and record responses. The sound files for speaking are then sent to the Online Scoring Network. It is important to note that for each of these sections, test takers are not required to have any specific prior knowledge of the subject matter being tested; rather, they are tested on how much of the material they understand form the context and how well they grasp the material to effectively answer the questions. The tasks are exemplary of the types of tasks that students will most likely encounter at the university level in an English-instructed class. The Reading section (for task examples, see Appendix A) is 36-56 questions long and has duration of 60 to 80 minutes. The test tasks require test takers to read multiparagraph texts (about 700 words long) and then answer multiple-choice questions testing vocabulary in context, main idea synthesis, understanding of factual information, and author’s intent. There are a number of items that require test takers to choose multiple responses and/or add elements to a written text. Such questions as #12 and 13 in Appendix A require an understanding of compositional organization and written discourse markers (“also” and “since”), as well as an overall understanding of the text itself. While the materials used in the excerpt in Appendix A describe an archeological wonder, the structure and language use represented would likely be found in a textbook, simulating what test takers are likely to encounter in academic readings regardless of course focus. The second section, that of Listening (task examples, see Appendix B) is composed of 34-51 questions that require test takers to choose from multiple-choice answers responding to questions posed orally over the headset provided. The task lasts about 60-90 minutes during which test takers listen to two audio files simulating authentic input as would be heard in a university setting. The content in these two files--1 simulating a lecture, the other simulating a conversation—draws inspiration and lexical elements from a study done by Biber et al. (2006) wherein the researchers compiled a

corpus of 1.67 million spoken words at four universities as taken from lectures, conversations, and other spoken interactions that occurred in the academic settings. In an effort to remain sensitive to students cultural sensitivies and fild-specifc knowledge, the TOEFL iBT creators did not integrate actual content utterances directly into the body of the test without first being subjected to editing. The content that remained was thus authentic without the possibility of being offensive to test takers of different backgrounds (TOEFL Research Series 1[4], 2011). As illustrated in the first excerpt in Appendix B, the situation represented—of a conversation between a student and teacher—is one that any student can attest to having experienced at one time or another. Negotiating meaning with a professor is a basic task that can present fundamental problems to students lacking language proficiency, and it is therefore highly appropriate as a subject on the TOEFL test. In the second excerpt, --this time of an academic lecture—it is evident that the oral input for the test taker is full of fillers and real-life exchanges like student-teacher interruptions and clarifications. The test ataker is asked to decipher between filler and content, and then make inferences about main ideas, as well as relationships between students. While this lecture may be more or less formal that what th test taker would experience in his/her L1 culture, it more or less accurately simulates the Anglophone academic experience. The Speaking section involves six tasks (for task examples, see Appendix C); two of the tasks are independent and thus involve test takers responding to broad question on familiar topics. The other four tasks are based on integrated tasks divided as the following skills: 1) listening, reading and speaking in a campus situation and on an academic topic, and 2) listening and speaking in a campus setting and on an academic topic. As seen in the excerpts in Appendix C, the listening/reading/speaking and listening/speaking elements likely touch distinct aspects of student life, as they would be found in discussions between friends about materials posted around campus, as well as involving materials used in the classroom. Very rarely is the skill of speaking utilized in real life in the absence of other stimuli (which, of course, requires the activation of other senses and use of other skills). The Writing section (task examples: see Appendix D) is composed of two tasks, totally 50 minutes for completions, and require test takers to complete one independent writing tasks and one integrated writing task. The first task involves reading a passage,

listening to a lecture, and then answering a written prompt reflecting the two stimuli. The second task requires the test taker to answer a question based on him/her own knowledge and to argue the pros and cons of the subject matter referenced in the test material. Both of these tasks are sure to be encountered y the test taker in the academic setting. University level courses are distinctly structured around written texts, oral lectures, and the emphasis on the student’s use of his/her own ability to reason out problems using pro/con logic. Reliability Evidence The TOEFL iBT takes great care to provide its users with optimal information about the reliability and validity of its test by making the test manual accessible to public audiences via the internet. Here, test creators present the PDF version of the TOEFL iBT Research Insight Series published in 2011 in hopes to “make important research on the TOEFL iBT available to all test score users in a user-friendly format” (Insight Series 1[1], p. 1). Generalizability and reliability analyses are done for each test form in accordance with the Generalizability Theory (Insight Series 1[3], p. 4) (See Appendix G for SEM and Reliability). The test also utilizes minimal test-retest and parallel forms reliability, comparing the scores of 12,000 test takers that have taken two different forms of the test within a period of one month (Insight Series 1[3]). The correlation scores resulting from this analysis produced an overall test score of 0.91 reliability estimate (Insight Series, 1[3], p. 5). As mentioned in the Scoring/Rating section of this paper, the TEOFL iBT employs human raters for the writing anad speaking sections. Inter-rater reliabilyt is guaranteed through rigorous rater training and certification, and a calibration test. Raters are monitored by chief raters during the scoring process for even greater reliability. In addition, the average of all the scores that this rater assigns (per scoring session) is compared with the average score of all the raters participating in the same session. Monitor papers are also used; the old and new scores of the monitor papers are compared, and the rates of agreement between the two sets of scores signify rater consistency in scoring across test administrations.

Validity Evidence The most convincing evidence presented by tge TOEFL iBT can be found in the Research Series’ volume 4 concerning validity evidence. IT argues that, in addition to sound reliability evidence, the TOEFL upholds its name and reputation simply by tge quality of developers it employs and the quality research they bring to tbe process of test development. Tehse test developers come from the fields of study in direct correlation with second language teaching and testing, and have themselves taught English around the world (p. 1). The TOEFL iBT states: Our research, measurement and statistics team includes some of the world’s most distinguished scientists and internationally recognized leaders in diverse areas such as test validity, language learning and testing, and educational measurement and statistics” (p.1). The series also asserts: To date, more than 150 peer-reviewed TOEFL research reports, technical reports and monographs have been published by ETS. In addition to the 20-30 TOEFLrelated research projects conducted by ETS Research & Development staff each year, the TOEFL Committee of Examiners (COE)…funds an annual program of TOEFL research by external researchers from all over the world, including preeminent researchers from Australia, the UK, the US, Canada, and Japan (p. 1). That is to say that TOEFL iBT takes great pride in taking the initiative to constantly evaluate its exam—the content, constructs, and even washback effect on its test takers— so that ic can best remain alighed with the needs of its audience. The test content and construct validities (see TOEFL explanations: Appendix H) are inextricably intertwined, each having a direct correlation to academic TLU domain tasks. As such, the content focuses on the simulating real academic subject matter—both the subjects, themselves as well as the form of their presentation—and the constructs focus on the integration of skills necessary to complete the tasks. Such tasks are verified by “experts in higher education” in order to “simulate univeristiy life and coursework” (, 2011). One study that supported the emphasis of this tangible criteria is that of Rosenfelf, Leung & Oltman (2001) (as seen in Insight Series 1[4], p. 4). In conducing a survey of undergraduate and graduate faculty and students, this study helped emphasize the importance of integrated skills and tasks for academic success. The data collected were heavily factored into the new TOEFL iBT, stemming from faculty and

student opinions about the importance of such tasks. Other examples presented by the TOEFL Insight Series support a direct correlation of TOEFL iBT scores with English language proficiency and communicative competence. These examples include test takers’ self-assessment, results of local institutional tests used to assess oral proficiency of international teaching assistants (see Appendix H), performance on research study academic reading tasks created to reflect the real-world classroom setting, and, finally, score comparability of test taker performance on the TOEFL and IEFLT tests (see Appendix H). One such study is mentioned under “Test Use and Consequences” (Insight series 1[4], p. 10) is that which was conducted by Wall and Horák (2006, 2008a). This stucy looked at a small group of EFL teachers in Eastern Europe and the ways in which the nrew TOEFL iBT test has affected their approach to teaching and test preparation. The study concluded that the overall feel is one of increased focus on oral production and skills integration (p.10). The developers would do well to continue this kind fo research in different areas across the testing zones, and also to perhaps conduct surveys of students who have taken (and perhaps teacher who have taught or administered) both forms of the TTOEFL test (pre- and post- 2005 iBT). In this way, developers might better understand how the test takers’ perceptions have changed concerning how wll the test measures their competencies and whether or not students havae changed their study habits in response to the new iBT form and foci. Conclusion In conclusion, the TOEFL aptly defends its position as one of the leading test developers of English Language ability assessment. Through its high validity and reliability ratings, it provides test score users with the assurance that the scores they are using do indeed give an accurate measurement of both linguistic and communicative competence. For learners, it offers peace of mind that the test and its methods of preparation prepare them to experience the realm of university level studies in English. It is not surprising, then, why 25 million people have chosen this test to measure their abilities, and why more than 8,000 academic establishments across the globe accept its results.