Current Issues in Language Evaluation, Assessment and Testing

Current Issues
in Language
Evaluation,
Assessment
and Testing
Current Issues
in Language
Evaluation,
Assessment
and Testing:
Research and Practice
Edited by
Christina Gitsaki and Christine Coombe

Current Issues in Language Evaluation, Assessment and Testing:
Edited by Christina Gitsaki and Christine Coombe
This book first published 2016
Cambridge Scholars Publishing
Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK
British Library Cataloguing in Publication Data

A catalogue record for this book is available from the British Library
Copyright © 2016 by Christina Gitsaki, Christine Coombe

and contributors
All rights for this book reserved. No part of this book may be reproduced,
stored in a retrieval system, or transmitted, in any form or by any means,
electronic, mechanical, photocopying, recording or otherwise, without
the prior permission of the copyright owner.
ISBN (10): 1-4438-8590-8

ISBN (13): 978-1-4438-8590-4
TABLE OF CONTENTS
List of Tables ............................................................................................ viii
List of Figures............................................................................................ xii
List of Appendices .................................................................................... xiv
List of Abbreviations ................................................................................. xv
Preface ...................................................................................................... xix

Issues in the Analysis and Modification of Assessment Tools and Tests
Chapter One ................................................................................................. 2

How Well do Cloze Items Work and Why?
James Dean Brown, Jonathan Trace, Gerriet Janssen, and Liudmila
Kozhevnikova
Chapter Two .............................................................................................. 40

Estimating Absolute Proficiency Levels in Small-Scale Placement Tests
with Predefined Item Difficulty Levels
Kazuo Amma
Chapter Three ............................................................................................ 63

Bilingual Language Assessment in Early Intervention: A Comparison
of Single- versus Dual-Language Testing
Caroline A. Larson, Sarah Chabal, and Viorica Marian
Chapter Four .............................................................................................. 80

Frequency and Confidence in Language Learning Strategy Use
by Greek Students of English
Penelope Kambakis-Vougiouklis and Persephone Mamoukari
vi Table of Contents
Chapter Five .............................................................................................. 98

The Development of a Vocabulary Test to Assess the Breadth
of Knowledge of the Academic Word List
Lee-Yen Wang
Issues in the Creation of Assessment and Evaluation Tools
Chapter Six .............................................................................................. 118

Assessment for Learning; Assessment for Autonomy
Maria Giovanna Tassinari
Chapter Seven.......................................................................................... 137

Cultivating Learner Autonomy through the Use of English Learning
Portfolios: A Longitudinal Study
Beilei Wang
Chapter Eight ........................................................................................... 158

Sharing Assessment Power to Promote Learning and Autonomy
Carol J. Everhard
Chapter Nine............................................................................................ 177

Developing a Tool for Assessing English Language Teacher Readiness
in the Middle East Context
Sadiq Midraj, Jessica Midraj, Christina Gitsaki, and Christine Coombe
Chapter Ten ............................................................................................. 201

Foreign Language Teachers’ Proficiency: The Implementation
of the EPPLE Examination in Brazil
Douglas Altamiro Consolo and Vera Lúcia Teixeira da Silva
Issues in Language Assessment and Evaluation
Chapter Eleven ........................................................................................ 222

Vocabulary Size Assessment as a Predictor of Plagiarism
Marina Dodigovic, Jacob Mlynarski, and Rining Wei
Chapter Twelve ....................................................................................... 236

What is the Impact of Language Learning Strategies on Tertiary Students’
Academic Writing Skills? A Case Study in Fiji
Zakia Ali Chand
Current Issues in Language Evaluation, Assessment and Testing: vii
Chapter Thirteen ...................................................................................... 253

Speaking Practice in Private Classes for the TOEFL iBT Test:
Student Perceptions
Renata Mendes Simões
Chapter Fourteen ..................................................................................... 272

Assessing the Level of Grammar Proficiency of EFL and ESL Freshman
Students: A Case Study in the Philippines
Selwyn A. Cruz and Romulo P. Villanueva Jr.
Chapter Fifteen ........................................................................................ 287

Methodology in Washback Studies
Gladys Quevedo-Camargo and Matilde Virginia Ricardi Scaramucci
Contributors ............................................................................................. 305
Index ........................................................................................................ 314

LIST OF TABLES
Table 1-1: Descriptive statistics for 50 cloze passages and

reliability–Japan sample (adapted and expanded from
Brown, 1998).
Table 1-2: Descriptive statistics for 50 cloze passages and
reliability–Russia sample (adapted and expanded from
Brown, 1998).
Table 1-3: Frequencies and percentages of tests that functioned well
or poorly in terms of internal consistency reliability.
Table 1-4: Frequencies and percentages of items that functioned
well in terms of Item Facility (IF).
well in terms of Item Discrimination (ID).
Table 1-6: Summary fit statistics for analyses of 1,496 items by
country of origin.
Table 1-7: Vertical rulers for test taker ability, test version
difficulty, and test item difficulty for Japan.
Table 1-8: Vertical rulers for test taker ability, test version
difficulty, and test item difficulty for Russia.
well in terms of CTT ID and Rasch Fit.
Table 1-10: Item misfit by linguistic background for different parts
of speech for 1,496 items.
Table 1-11: Item misfit for word type (content & function) in the
Japan and Russia data for 1,496 items.
Table 1-12: Item misfit for word origin (Latinate & Germanic) in the
Russia and Japan data for 1,496 Items.
Table 1-13: Item misfit as a function of word frequency (in the
Brown Corpus found in Francis & Kuþera, 1979) in the
Table 2-1: Descriptors of overall reading comprehension (Source:
Council of Europe, 2001, p. 69).
Table 2-2: Item difficulty levels and required proficiency in the
placement test.
Table 2-3: Description of imaginary test items.
Current Issues in Language Evaluation, Assessment and Testing: ix
Table 2-4: Summary report of a candidate (Adapted from Linacre

1987, p. 25).
Table 2-5: Simulation 1.
Table 2-7: Item difficulty and responses (arranged in the order of
difficulty levels).
Table 2-8: Statistics of the candidates’ proficiency levels and
confidence intervals.
Table 2-9: Item difficulty and responses by exact scoring.
Table 2-10: Adjusted responses for Candidate A (part).
Table 2-11: Adjusted responses for Candidate B (part).
Table 2-12: Comparison of estimated proficiency levels and
Table 3-1: Demographic information for study participants.
Table 3-2: Receptive language ability as indexed by The Rossetti
(1990) Language Comprehension subtest.
Table 3-3: Expressive language ability as indexed by The Rossetti
(1990) Language Expression subtest.
Table 5-1: Participant information.
Table 5-2: Distribution of the sample words across the AWL
Levels.
Table 5-3: Omitted AWL words with Level information.
Table 5-4: Pseudowords with Level information and the original
AWL words.
Table 5-5: Descriptive statistics (freshmen = 50, seniors = 39).
Table 5-6: Comparison between freshmen and seniors in their
knowledge of non-omitted/omitted academic words
using ANOVA.
Table 5-7: Reliability analysis.
Table 5-8: Descriptive statistics for the 52 and 38 omitted items.
Table 5-9: Omitted academic words and their COCA ranks and
frequencies.
Table 6-1: Reasons for choice of components (more than one
answer possible).
Table 6-2: Effects of the self-assessment on participants (more than
one answer possible).
Table 6-3: Comparison of studies on learner autonomy.
Table 7-1: Student participants in the study.
Table 7-2: Demographic data of case study participants.
Table 7-3: Phases of treatment in the junior high school.
Table 7-4: Independent samples test of LA and its subscales.
x List of Tables
Table 7-5: Planning and reflecting.

Table 7-6: Alice’s monthly learning plan (A-P1-9).
Table 7-7: David’s weekly plan.
Table 8-1: AARP assessment overview (Writing 1 and Writing 2).
Table 8-2: AARP assessment overview (Oral Presentation).
Table 8-3: AARP mean scores for Writing 2, for Groups C and D,
with ANOVA results for self-assessment variations.
Table 8-4: Paired t-test results for Writing 2 self-assessment
Variations (AY2).
Table 8-5: Responses to questions about peer- and self-assessment.
Table 8-6: Modified extract from the AARP model showing the
relationship between assessment and degrees of
autonomy (Based on Everhard 2014, 2015a; Harris &
Bell, 1990).
Table 9-1: Example performance indicator and achievement
indicators (Source: TESOL 2010, p. 69).
Table 9-2: Criteria for the external review of the indicators and the
MCIs.
Table 10-1: Marks in the FCE and in the TEPOLI (Source: Baffi-
Bonvino, 2007, pp. 276-277).
Table 10-2: Grammatical accuracy as measured in TEPOLI (Source:
Borges-Almeida & Consolo, 2010).
Table 10-3: Grammatical complexity as measured in TEPOLI
(Source: Borges-Almeida & Consolo, 2010).
Table 11-1: Correlation coefficient (effect size) values for lexical
errors.
Table 12-1: Correlations between strategy use, gender and ethnicity.
Table 12-2: Results of the error analysis.
Table 12-3: Correlation between strategy use and language
proficiency.
Table 13-1: Classification of production variables in oral
communication tasks (Adapted from Ellis, 2003, p. 117).
Table 13-2: Data collection procedures in each phase of the study.
Table 13-3: Student perceptions of their language skills before the
course.
Table 13-4: Change in students’ perceptions (N = 17).
Table 13-5: Average perceived overall improvement in language
skills.
Table 13-6: Students’ final TOEFL iBT score.
Table 13-7: Comparative table of the TOEFL iBT scores.
Table 14-1: A scale for measuring grammar proficiency.
Current Issues in Language Evaluation, Assessment and Testing: xi
Table 14-2: Mean and standard deviation of participants’ total score

based on nationality.
Table 14-3: Means and standard deviations for the ESL participants.
Table 14-4: Means and standard deviations for the EFL participants.
Table 15-1: Studies distributed by continent.
Table 15-2: Number of instruments used in the studies.
LIST OF FIGURES
Figure 2-1: Sample representation of logistic regression analysis of a

test item (Adapted from Amma, 2001).
Figure 2-2: Sample representation of logistic regression analysis of a
candidate (Adapted from Amma, 1990).
Figure 2-3: Logistic regression analysis of Simulation 1.
Figure 2-5: Logistic fit of Candidate A with confidence interval.
Bottom layer = ‘fail’.
Figure 2-6: Logistic fit of Candidate B with confidence interval.
Bottom layer = ‘fail’.
Figure 3-1: Participants’ receptive language assessment results using
The Rossetti (1990) Language Comprehension subtest.
Error bars represent standard errors and asterisks indicate
significant differences at p < .05.
Figure 3-2: Participants’ expressive language assessment results
using The Rossetti (1990) Language Expression subtest.
Error bars represent standard errors and asterisks indicate
significant differences at p < .05.
Figure 3-3: Percent language delay in primary-language-only
assessment and in dual-language assessment using The
Rossetti (1990) Language Comprehension and Language
Expression subtests. Error bars represent standard errors
and asterisks indicate significant differences at p < .05.
Figure 4-1: An example from the SILL questionnaire employing the
[01] bar for frequency and confidence.
Figure 6-1: The dynamic model of learner autonomy (Source:
Tassinari, 2010, p. 203).
Figure 7-1: The process of the ELP practice.
Figure 9-1: Project stages.
Figure 10-1: The EPPLE examination, Oral Test, Part 1.
Figure 10-2: The EPPLE examination, Oral Test, Part 3.
Figure 13-1: Difficulties with language at the course (N = 17).
Figure 13-2: Student perceptions of their language ability at the end
of the course (N = 17).
Figure 13-3: Student self-perceived learning (N = 17).
Current Issues in Language Evaluation, Assessment and Testing: xiii
Figure 13-4: Students’ perceptions of their language skills before the

course (N = 17).
Figure 13-5: Students’ perceptions of their language skills after the
course (N = 17).
Figure 13-6: Student improvement in Task 1 and Task 2 (N = 17).
Figure 15-1: Studies per year of publication.
Figure 15-2: Data-collection instruments in the reviewed studies.
LIST OF APPENDICES
Appendix 1-A: List of participating universities.

Appendix 1-B: Example cloze test (Adapted from Brown, 1989).
Appendix 1-C: Actual word frequencies.
Appendix 1-D: FACETS input file for 50 cloze tests with 10 anchor
items.
LIST OF ABBREVIATIONS
A Self-Assessor (in Tukey-Kramer formulae)

AARP Assessment for Autonomy Research Project
ANOVA One-way Analysis of Variance
AS-unit Analysis of Speech Unit
ASC Ascendente (rising intonation)
ASHA American Speech-Language-Hearing Association
AUTh Aristotle University of Thessaloniki
AVL Academic Vocabulary List
AWL Academic Word List
AY Academic Year
B Peer-Assessor (in Tukey-Kramer formulae)
BNC British National Corpus
C Teacher-Assessor (in Tukey-Kramer formulae)
CAEP Council for the Accreditation of Educator
Preparation
CAPES Coordenação de Aperfeiçoamento de Pessoal do
Ensino Superior - Brazilian Agency for the
Development of Graduate Studies
CEEC College Entrance and Examination Center
CEFR Common European Framework of Reference
CEPA Common Educational Proficiency Assessment
CG Control Group
CI Confidence Interval
CILL Centre for Independent Language Learning
CNPq Conselho Nacional de Desenvolvimento
Científico e Tecnológico
COCA Corpus of Contemporary American English
Comm Arts 1 Communication Arts and Skills 1
Corr. Coeff. Correlation Coefficient
CRAPEL Centre de Recherches et d’Applications
Pédagogiques en Langues
CTT Classical Test Theory
EAL English as an additional language
EFL English as a Foreign Language
EG Experimental Group
xvi List of Abbrerviations
ELF English as a Lingua Franca

ELL English Language Learners/Learning
ELP English Learning Portfolio
ELT English Language Teaching
EMI English Medium Instruction
ENAPLE-CCC Ensino e Aprendizagem de Língua Estrangeira:
Crenças, construtos e competências
Eng AN Introduction to Language Arts
EPPLE Exame de Proficiência para Professores de Língua
Estrangeira
ERWL English Reference Word List
ESL English as a Second Language
ESOL English for Speakers of Other Languages
ESP English for Specific Purposes
ETS Educational Testing Service
F Female
FATEC Faculdades de Tecnologia
FCE First Certificate in English
FEU Far Eastern University
FL Foreign Language
FUB Freie Universität Berlin
GSEAT General Scholastic English Ability Test
GSL General Service List
HE Higher Education
ID Item Discrimination
IELTS International English Language Testing System
IF Item Facility
INCOMP Incomprehensible
IRT Item Response Theory
L2 Second Language
LA Learner Autonomy
LLS Language Learning Strategies
LPFLT Language Proficiency of Foreign Language
Teachers
M Male
Max Maximum
MCI Multiple-Choice Item
MENA Middle East and North Africa
MFRM Many Facet Rasch Measurement
Min Minimum
MOE Ministry of Education
Current Issues in Language Evaluation, Assessment and Testing: xvii
N Number (of participants)

n.s. Non-significant differences in One-way ANOVA
NAPO National Admissions and Placement Office
NBPTS National Board for Professional Teaching
Standards
NCATE National Council for Accreditation of Teacher
Education
NECS National English Curriculum Standards
NNS Non-Native Speakers
NRT Norm-Referenced Test
NS Native Speaker
OPI Oral Proficiency Interview
OPT Oxford Placement Test
P-A Peer-Assessment
P-T Peer-Assessment and Teacher Assessment
r Reliability
S-A Self-Assessment
S-T Self-Assessment and Teacher Assessment
SAT Scholastic Assessment Test
SD Standard Deviation
SEM Standard Error of Measurement
Sig. Significance
SILL Strategy Inventory for Language Learning
SOE School of English
SPSS Statistical Package for the Social Sciences
SVA Subject Verb Agreement
T-A Teacher Assessment
T-TRI TESOL Teacher Readiness Inventory
TECOLI Teste de Compreensão Oral em Língua Italiana
TEPOLI Teste de Proficiência Oral em Língua Inglesa
TESOL Teaching English to Speakers of Other Languages
The Rossetti The Rossetti Infant-Toddler Language Scales
TOEFL Test of English as a Foreign Language
TOEFL-iBT Test of English as a Foreign Language–Internet-
based Test
TOEIC Test of English for International Communication
UEM Universidade Estadual de Maringá
UERJ Universidade Estadual do Rio de Janeiro
UnB Universidade de Brasília
UNESP Universidade Estadual Paulista
UNIP Universidade Paulista
xviii List of Abbrerviations
UNISEB/Estacio Centro Universitário do Sistema Educacional

Brasileiro
VST Vocabulary Size Test
Y/N Yes/No
PREFACE
Language assessment, whether formative or summative, plays an

important role in second language learners’ educational experience and
learning outcomes. Whether assessment is used for student initial
screening, placement, or progression in a language course, it always
involves gathering, interpreting and evaluating evidence of learning. Such
information collected through the different assessment and evaluation
tools allows educators to identify student needs and plan a course of action
to address these needs, provides feedback about the effectiveness of
teaching practice, guides instruction and curriculum design, and provides
accountability for the system.
For language educators, assessment is perhaps one of the most difficult
and demanding tasks they have to perform given that designing valid and
reliable assessment tools requires specialized skills, while decisions about
assessment, especially ‘high-stakes’ exams, can have a lasting impact on
students’ progress and life.
Furthermore, in order for assessment to be useful, it must align itself
with the mandated standards and academic expectations of the specific
context where it occurs. Since no single type of assessment can provide all
the information that is necessary to gauge students’ progress and language
proficiency levels, educators need to incorporate a variety of assessment
techniques into their practice and be aware of approaches and methods that
can help provide valid and reliable evidence of student learning.
The edited volume presented here, Current Issues in Language
Evaluation, Assessment and Testing: Research and Practice, is a
collection of papers that address relevant issues in language assessment
from a variety of contexts and perspectives. The book is divided into three
major sections. The first section addresses Issues in the Analysis and
Modification of Assessment Tools and Tests. In Chapter One, JD Brown,
Jonathan Trace, Gerriet Janssen, and Liudmila Kozhevnikova discuss a
comparative study of analyzing cloze tests using Classical Theory Test
(CTT) item analysis and multifaceted Rasch analysis. Through the
examination and analysis of almost 7,500 cloze tests from university
students studying English as a foreign language (EFL) in Japan and
Russia, the Rasch analyses proved to be more appropriate than CCT. In
Chapter Two, Kazuo Amma proposes using a logistic regression analysis
xx Preface
with predefined item difficulty levels in order to properly assess a

student’s true proficiency level and the confidence interval, avoiding the
pitfalls that may occur from estimating proficiency based on the total test
score. In Chapter Three, Caroline Larson, Sarah Chabal, and Viorica
Marian examine the use of The Rossetti Infant-Toddler Language Scale
with Spanish-English speaking children. Their findings suggest that when
the instrument is used in the children’s primary language only, their
language skills are underestimated and their language delay is
overestimated leading to inappropriate Early Intervention referrals. In
order to maximize the efficacy, reliability and validity of The Rossetti, the
researchers recommend administering it in both the primary and the
secondary languages of the child. In Chapter Four, Penelope Kambakis-
Vougiouklis and Persephone Mamoukari present the results of their
investigation of language learning strategy use of Greek EFL students.
Their study is a pilot of a modified version of the Strategies Inventory for
Language Learning (SILL) using a bar instead of a Likert scale, measuring
frequency of strategy use as well as student confidence in the effectiveness
of the different strategies, and administering the instrument orally rather
than in writing. Their modifications of the SILL allowed them to detect
discrepancies in student understanding of language learning strategies and
their effectiveness which would not have been otherwise evident, as well
as problematic items within the SILL that need further modification prior
to future administrations of the instrument. In the last chapter for this
section, Chapter Five, Lee-Yen Wang describes the use of a Yes/No test to
measure University EFL students’ vocabulary acquisition of academic
words that were left out of the Ministry-defined vocabulary list for schools
in Taiwan. The analyses of the data highlighted the limitations of having a
centrally controlled national wordlist.
The next part of the volume, addresses Issues in the Creation of
Assessment and Evaluation Tools. In Chapter Six, Maria Giovanna
Tassinari discusses the creation of a dynamic model for assessing language
learner autonomy and provides evidence of the use of the model with
foreign language learners in a German University context. Her findings
indicate that the model initiated and maintained pedagogic dialogue
between the students and their teachers, raised students’ awareness of the
different dimensions of learner autonomy, and enhanced their reflexive
learning. In Chapter Seven, Beilei Wang describes the use of a three-
dimensional learner autonomy scale with junior high school EFL students
in China. Findings revealed that the use of the English Language
Portfolios was conducive to helping students gain learner autonomy.
Learner autonomy was also the primary interest of Carol Everhard’s study
Current Issues in Language Evaluation, Assessment and Testing: xxi
in Chapter Eight. Greek EFL students in a university context participated

in self- and peer-assessment activities of their oral and writing skills over
the course of the 5-year study. Findings suggest that the use of such
formative assessment techniques activated students’ criterial thinking and
metacognitive awareness of their learning process. The next two chapters
in this section describe the creation of assessment and evaluation tools for
measuring teacher proficiency. In Chapter Nine, Sadiq Midraj, Jessica
Midraj, Christina Gitsaki, and Christine Coombe describe the process of
compiling a contextually relevant resource for independent learning and
self-assessment in order to strengthen EFL teachers’ content knowledge,
pedagogical knowledge, and professional dispositions. The resource was
created for teachers in the Gulf Region using internationally accepted
professional standards in teaching English to speakers of other languages
(TESOL). In Chapter Ten, Douglas Altamiro Consolo and Vera Lúcia
Teixeira da Silva discuss a meta-analysis of a string of research studies
conducted within the framework of designing assessment tools for
evaluating foreign language teachers’ proficiency in the foreign language
they teach. Their meta-analysis is motivated by the need to revise and
improve the EPPLE examination of foreign language teachers in Brazil.
The third and final section of the volume, Issues in Language
Assessment and Evaluation, comprises studies that implemented different
instruments to measure learner proficiency in different language skills. In
Chapter Eleven, Marina Dodigovic, Jacob Mlynarski, and Rining Wei
describe how they used instruments such as Grammarly and the
Vocabulary Size Test (VST) to investigate possible correlations between
academic plagiarism and vocabulary knowledge in the academic writing of
University EFL students in China. Through their investigation poor
vocabulary command emerged as a major cause of plagiarism. In Chapter
Twelve, Zakia Ali Chand investigated whether there is a correlation
between language strategy use and academic writing proficiency, using the
SILL. The study involved University ESL students in Fiji, and the
preliminary results showed that students were moderate users of language
learning strategies and their academic writing was not influenced by their
use of language learning strategies indicating the need for more strategy
training in the language classroom. In Chapter Thirteen, Renata Mendes
Simões addresses an area of language teaching that has not received much
attention in research, that of one-to-one tutorials focusing on candidate
preparation for high stakes exams. The study involved Brazilian students
preparing for the Test of English as a Foreign Language (TOEFL). The
process of assessing students’ progress over the course of several weeks is
described and enhanced by the use of qualitative and quantitative data in
xxii Preface
order to provide a deeper understanding of the effectiveness of English for

Special Purposes (ESP) courses for exam preparation. In Chapter
Fourteen, Selwyn Cruz and Romulo Villanueva describe the development
and administration of a grammar proficiency test in order to investigate
issues in the grammatical proficiency of Korean and Filipino students
studying in an English Medium University in the Philippines. The final
paper in the volume, Chapter Fifteen, addresses issues of washback from
language assessment. Gladys Quevedo-Camargo and Matilde Virginia
Ricardi Scaramucci discuss the results of a meta-analysis of research
studies on washback from around the globe and their respective
methodologies and provide an account of the diverse instruments used to
investigate washback.
All fifteen papers included in this volume underwent a rigorous
selection process through a double-blind peer review process that involved
a number of notable academics. The papers underwent further review and
editing before being published in this book. Below is the list of academics
who were involved in the double blind review process:
Thomaï Alexiou Aristotle University, Greece

Ramin Akbari Modares Tarbiat, Iran
Deena Boraie American University of Cairo, Egypt
Helene Demirci Higher Colleges of Technology, UAE
Aymen Elsheikh New York Institute of Technology, UAE
Atta Gebril American University of Cairo, Egypt
Melanie Gobert Higher Colleges of Technology, UAE
Tony Green University of Bedfordshire, UK
Sahbi Hidri Sultan Qaboos University, Oman
Elisabeth Jones Zayed University, UAE
Mary Lou McCloskey EDUCO, USA
Josephine O’Brien Zayed University, UAE
Sufian Abu Rmaileh UAE University, UAE
The volume presents research studies conducted in a variety of

contexts (from early childhood to University and post-graduate studies)
from around the world covering an equally diverse range of issues in
language assessment and evaluation. It is hoped that it will be of use to
both new and seasoned researchers in the field of Applied Linguistics and
TESOL as well as teacher educators, language teachers, curriculum and
assessment designers.

ISSUES IN THE ANALYSIS
AND MODIFICATION OF ASSESSMENT
TOOLS AND TESTS
CHAPTER ONE
HOW WELL DO CLOZE ITEMS WORK

AND WHY?
JAMES DEAN BROWN, JONATHAN TRACE,

GERRIET JANSSEN,
AND LIUDMILA KOZHEVNIKOVA
Abstract
This study examined item-level data from fifty 30-item cloze tests that
were randomly administered to university-level examinees from Japan (N
= 2,298) and Russia (N = 5,170). A single 10-item anchor cloze test was
also administered to all students. The analyses investigated differences
between the two nationalities in terms of both classical test theory (CTT)
item analysis and multifaceted Rasch analysis (the latter allowed us to
estimate test-taker ability and item difficulty measures and fit statistics
simultaneously across 50 cloze tests separately and combined for the two
nationalities). The results indicated that considerably larger proportions of
items functioned well in the Rasch item analyses than in the traditional
CTT item analysis. Rasch analyses also turned out to be more appropriate
for our cloze test analysis and revision purposes than did traditional CTT
item analyses. Linguistic analyses of items that fit the Rasch model
revealed that blanks representing certain categories of words (i.e., function
words rather than content words, and Germanic-origin words rather than
Latin-origin words), and to a greater extent relatively high frequency
words were more likely to work well for norm-referenced test (NRT)
purposes. In addition, this study found that different items were
functioning well for the two nationalities.
How Well do Cloze Items Work and Why? 3
Introduction
Taylor (1953) first proposed the use of cloze tests for evaluating the
readability of reading materials in US elementary schools. In the 60s and
70s, a number of studies appeared on the usefulness of cloze for English as
a second language (ESL) proficiency or placement testing (see Alderson,
1979 for a summary of this early ESL research). Since then, as Brown
(2013) noted, this research on using cloze in ESL proficiency or placement
testing has continued, but has been inconsistent at best with reported
reliability estimates ranging from .31 to .96 and criterion-related validity
coefficients ranging from .43 to .91.
While the literature has focused predominantly on fixed interval (i.e.,
every nth word) deletion cloze tests, other bases have been used for
developing cloze tests. For example, rational deletion cloze was
developed by selecting blanks based on word classes (cf., Bachman, 1982,
1985; Markham, 1985). Tailored cloze involved using classical test theory
(CTT) item analysis techniques to select items and thereby create cloze
tests tailored to a particular group of students (cf. Brown, 1988, 1989,
2013; Brown, Yamashiro, & Ogane, 1999, 2001; Revard, 1990).
For the most part, cloze studies have been based on CTT. However,
Item Response Theory (IRT), including Rasch analysis, has been applied
to cloze in a few cases. Baker (1987) used Rasch analysis to examine a
dichotomously scored cloze test and found that “observed and expected
item characteristic curves show reasonable conformity, though with some
instances of serious misfit…no evidence for departure from unidimensionality
is found for the cloze data…” (p. iv). Hale, Stansfield, Rock, Hicks,
Butler, and Oller (1988) found that IRT provided stable estimates for cloze
in their study of the degree to which groups of cloze items related to
different subparts of the overall Test of English as a Foreign Language
(TOEFL). Hoshino and Nakagawa (2008) used IRT in developing a
multiple-choice cloze test authoring system. Lee-Ellis (2009) used Rasch
analysis in developing and validating a Korean C-test. However, Rasch
analysis has not been used to study the effectiveness of individual items.
The Study
Certainly, no work has investigated the degree to which cloze items
function well when analyzed using both CTT and IRT frameworks, and
little research has examined the functioning of cloze items in terms of their
linguistic characteristics. To address these issues and others, the following
research questions were posed, all focusing on the individual items
4 Chapter One
involved in 50 cloze tests that were administered to university-level

examinees from two linguistically different backgrounds:
1. How do the CTT descriptive statistics, reliability, and item analyses

differ for test-taker groups from different linguistic backgrounds?
2. How do Rasch item difficulty measures differ for the test-takers
from different linguistic backgrounds?
3. How do the proportions of functioning cloze items differ between
the CTT and IRT analyses, when based on the test-taker responses
from different linguistic backgrounds?
4. In what ways do Rasch item fit patterns differ in terms of factors
such as linguistic background and four cloze item linguistic
features: parts of speech, word type, word origin, and word
frequency?
5. Which linguistic characteristics will increase the probability of
cloze items functioning well during piloting?
Participants
A total of 7,468 English as a foreign language (EFL) students
participated in this study: 2,298 of these EFL students were studying at 18
different universities in Japan as part of their normal classroom activities;
the remaining 5,170 EFL students were studying at 38 universities in
Russia (see Appendix 1-A for a list of the participating universities in both
countries). In Japan, about 38.3% of the participants were women, and
61.7% of the participants were men; in Russia, 71.7% of the participants
were women, and 28.0% were men, with the remaining 0.3% giving no
response. The participants in Japan were between 18-24 years old, while in
Russia they were between 14-46 years old. The data from Japan were
collected as part of Brown (1993 & 1998); the data from Russia were
collected in 2012-2013 and served as the basis of Brown, Janssen, Trace,
and Kozhevnikova (2013). Though these samples were convenience
samples (i.e., not randomly selected), they were relatively large, which is
important as this sample size permits robust analyses of these cloze data.
It is critically important to stress that in this study we are interested in
how linguistic background affects different analyses; we do not make any
claims for the generalizability of these results to the EFL populations of all
undergraduate students in university-level institutions in these countries.
In fact, we want to stress that the samples from Japan and Russia cannot
be said to be comparable given the sampling procedures, the very different
proportions of university seats per million people available in the two
countries, the proportions of young people who go to university, and so

forth. Thus, any interpretations of these data to indicate that the English
proficiency of students in either country is higher than in the other country
are unwarranted and indefensible.
Measures
The 50 cloze tests used in this study were first created and used for
Brown (1993). The 50 passages were randomly selected from among the
adult-level books at a public library in Florida. Passages were chosen from
each book by randomly selecting a page then working backwards for a
reasonable starting place. Passages were between 366-478 words long
with a mean of 412.1 words. Each passage contained 30 items, and the
deletion pattern was every 12th word, which created a fairly high degree of
item independence relative to the more typical 7th-word deletion pattern.
The first and last sentences of all passages were left intact to provide
context. Appendix 1-B shows the layout of the directions, example items,
and answer key.
A 10-item cloze passage was also administered to all participants to act
as anchor items (i.e., items that provide a common metric for making
comparisons across tests and examinee samples). This anchor-item cloze
was first created in a study by Brown (1989), wherein it was found that
these 10 items were functioning effectively.
To check the degree to which the English in the cloze passages was
representative of typical written English, the lexical frequencies for all 50
passages combined were calculated (see Appendix 1-C) and compared to
the frequencies reported for the same words in the well-known Brown
Corpus (Francis & Kuþera, 1979, 1982; Kuþera & Francis, 1967). We felt
justified in comparing the 50 passages to this particular corpus for two
reasons. First, following Stubbs (2004), though the Brown Corpus is
relatively small, it is
“still useful because of their careful design ... one million words of written
American English, sampled from texts published in 1961: both informative
prose, from different text types (e.g., press and academic writing), and
different topics (e.g., religion and hobbies); and imaginative prose (e.g.,
detective fiction and romance).” (p. 111)
Then, too, we found that the logarithmically transformed word

frequencies of the cloze test items (to normalize the Zipfian nature of
vocabulary distributions) and the logarithmically transformed frequencies
of these same words in the Brown Corpus correlated strongly at .93
6 Chapter One
(Brown, 1998). Thus, we felt reasonably certain that these passages and
cloze items were representative of the written English language, or at a
minimum the genres of English found in US public library books.
Procedures
The 50 cloze tests were distributed to intact classes by teachers such
that every student had an equal chance of receiving each of the 50 cloze
test passages. In Japan, 42-50 participants completed each cloze test, with
a mean of 46.0 participants completing each passage. In Russia, 90-122
completed each cloze test (Mean = 103.4). All examinees in both countries
completed the 10-item anchor cloze. Twenty-five minutes were allowed
for completing the tests. Exact-answer scoring was used (i.e., only the
word found in the original text was counted as correct). This was done for
two reasons: (a) we wanted each item to be interpretable as fillable by a
single lexical item for analysis purposes; and (b) with the hundreds of
items and thousands of examinees in this study, using an acceptable-
answer scoring or any other of the available scoring schemes would
clearly have been beyond our resources.
Analyses
Initially, CTT statistics were used to analyze the cloze test data. These
statistics included: the mean, standard deviation, minimum and maximum
scores, reliability, item facility, and item discrimination. Rasch analyses
were also used in this study to calculate item difficulty measures and to
identify misfitting test items. We used FACETS (Linacre, 2014a) analysis
rather than WINSTEPS because the former allowed us to easily analyze
our nested design (i.e., multiple tests administered to different groups of
examinees). Or as Linacre put it, “Use Winsteps if the data can be
formatted as a rectangle, e.g., persons-items … Use Facets when Winsteps
won’t do the job” (Linacre, 2014b, np).
We performed the analyses in several steps. Initially, we needed to
determine anchor values through a separate FACETS analysis of only the
10 anchor items that were administered across all groups of participants.
Then, we created a FACETS input file to link our 50 cloze tests by using
our 10 anchor items (see Appendix 1-D for a description of the actual code
that was used). There were three facets in this analysis: test-takers, test
version, and test items. By using the FACETS program, we were able to
combine the 50 different cloze procedures for both nationalities into a
single analysis using anchor items, and put all of the items onto the same
true interval scale for ease of comparison (see e.g., Bond & Fox, 2007, pp.
75-90). Four of the total 1,500 items had blanks that were either missing or
made no sense, thus the total number of valid cloze items was 1,496.
Appendix 1-D also shows how we coded the data for the analysis. In
order to analyze separate tests in a single analysis using a common set of
anchor items, each examinee required two lines of response data. The first
line corresponds to the set of items for the particular cloze procedure, set
up by examinee ID, test version, the range of applicable items (e.g., 101-
130 for items 1-30 on Test 1), followed by the observed response for each
item. An additional line was also needed for examinee performance on
anchor items, with the same coding format as above except for a common
range of items for all examinees (31-40). The series of commas within the
data indicates items that were removed as explained above. Using the
same setup, we were able to run the program separately for the samples in
Russia, Japanese, and Combined (i.e., with the two samples analyzed
together as one).
Results
Classical Test Theory
Descriptive Statistics. As most previous item analyses of cloze tests have

been based on CTT, we began our analysis by focusing on the CTT
characteristics of our cloze tests and their items. Tables 1-1 and 1-2 show
the descriptive statistics and internal consistency reliability estimates for
our 50 cloze tests in test number order for each nationality. In general, the
means are low for the 30-item cloze tests, indicating that the items (scored
for exact answers) were quite difficult for the students. Tables 1-1 and 1-2
indicate that the Russia sample generally produced higher means and
standard deviations than the Japan sample.
Reliability. Tables 1-1 and 1-2 also show how the reliability estimates of
the various cloze passages were for the two nationalities. These cloze tests
functioned somewhat less reliably with the Japan sample (ranging from
.17 to .87) than with the Russia sample (ranging from .65 to .92). This
pattern could be a consequence of the greater variation and perhaps the
larger sample sizes in Russia. A synthesis of the cloze passages’ reliability
estimates is shown in Table 1-3.
8 Chapter One
Table 1-1: Descriptive statistics for 50 cloze passages and reliability –

Japan sample (adapted and expanded from Brown, 1998).
Japan
Test Mean SD Min Max N r
1 5.23 3.16 0 15 48 0.71
2 4.21 3.42 0 13 47 0.86
3 2.02 2.13 0 10 48 0.74
4 7.54 3.87 2 16 46 0.80
5 3.98 2.79 0 13 47 0.73
6 5.11 3.23 0 14 47 0.80
7 6.14 3.41 0 16 43 0.83
8 3.16 2.27 0 8 45 0.46
9 2.85 2.46 0 11 46 0.77
10 2.54 2.31 0 8 46 0.83
11 5.94 3.36 0 16 46 0.74
12 8.98 3.97 0 21 47 0.79
13 2.87 1.71 0 8 46 0.50
14 3.23 2.50 0 9 47 0.68
15 9.18 3.42 4 18 49 0.68
16 1.36 1.41 0 6 48 0.65
17 1.38 1.25 0 5 46 0.35
18 1.02 1.09 0 3 50 0.50
19 4.76 2.88 0 10 50 0.70
20 4.38 3.24 0 15 47 0.86
21 9.92 4.44 0 19 48 0.84
22 3.70 2.86 0 11 47 0.84
23 3.64 2.40 0 11 43 0.65
24 2.96 2.26 0 9 47 0.44
25 5.36 2.74 0 12 46 0.63
26 2.68 1.56 0 5 47 0.17
27 2.34 2.72 0 13 47 0.87
28 2.58 2.17 0 8 43 0.57
29 2.32 1.77 0 7 44 0.64
30 9.56 3.28 3 16 48 0.72
31 3.78 3.08 0 15 46 0.83
32 3.83 2.53 0 9 42 0.77
33 2.14 1.87 0 6 44 0.63
34 5.87 2.92 0 13 45 0.82
35 6.63 3.66 0 17 45 0.72
36 5.00 2.05 0 9 46 0.51
Japan
37 5.46 3.66 0 13 48 0.77
38 1.71 1.57 0 8 48 0.75
39 2.51 1.98 0 9 47 0.65
40 3.49 1.90 0 9 43 0.66
41 2.87 2.51 0 10 43 0.76
42 4.41 3.10 0 18 44 0.81
43 1.43 1.45 0 7 44 0.19
44 3.24 2.52 0 10 46 0.67
45 6.55 3.87 0 16 42 0.79
46 2.16 1.82 0 7 47 0.31
47 3.79 2.33 0 11 43 0.69
48 2.69 2.12 0 11 42 0.74
49 4.56 2.81 0 11 49 0.75
50 2.49 2.70 0 12 45 0.77
Mean 4.11 2.61 0.18 11.34 45.96 0.69
Note: SD = Standard Deviation; Min = Minimum score; Max = Maximum
score; N = number of items; r = reliability.
Table 1-2: Descriptive statistics for 50 cloze passages and reliability –

Russia sample (adapted and expanded from Brown, 1998).
Russia
1 6.78 3.99 0 16 120 0.75
2 7.06 4.94 0 19 102 0.85
3 3.94 3.71 0 14 103 0.81
4 9.82 6.12 0 21 105 0.89
5 6.54 4.38 0 22 106 0.82
6 5.34 4.19 0 16 102 0.83
7 8.07 6.22 0 20 103 0.90
8 3.13 3.67 0 24 101 0.86
9 4.08 3.67 0 23 105 0.81
10 3.77 4.24 0 22 102 0.87
11 5.74 4.53 0 17 101 0.85
12 9.27 4.86 0 20 115 0.83
13 3.30 3.89 0 17 105 0.86
14 5.10 4.70 0 17 107 0.87
15 8.10 5.60 0 21 106 0.89
16 2.30 2.70 0 11 115 0.77
10 Chapter One
Russia
17 2.55 2.29 0 10 109 0.65
18 1.60 2.27 0 15 100 0.78
19 6.15 5.08 0 30 102 0.88
20 5.41 5.01 0 24 97 0.89
21 10.32 6.96 0 27 103 0.92
22 3.74 3.64 0 14 102 0.83
23 3.58 3.36 0 14 102 0.79
24 2.13 2.37 0 10 101 0.71
25 4.63 4.55 0 15 102 0.87
26 4.35 3.25 0 21 100 0.77
27 3.48 3.07 0 15 100 0.75
28 4.01 3.81 0 18 102 0.84
29 3.39 2.70 0 11 102 0.70
30 12.82 5.39 0 22 111 0.83
31 4.88 3.89 0 14 101 0.82
32 4.96 3.22 0 12 101 0.79
33 2.82 2.57 0 10 102 0.71
34 7.11 4.43 0 18 102 0.83
35 6.72 5.54 0 25 103 0.87
36 4.81 4.11 0 16 96 0.83
37 8.38 5.46 0 24 103 0.87
38 2.42 2.44 0 14 106 0.74
39 3.62 3.44 0 12 103 0.80
40 3.87 4.39 0 24 90 0.88
41 4.53 3.56 0 14 101 0.79
42 4.78 4.10 0 20 93 0.84
43 2.09 2.56 0 15 99 0.76
44 4.80 4.28 0 19 102 0.85
45 9.24 6.59 0 21 101 0.91
46 3.69 3.49 0 14 93 0.80
47 3.19 2.79 0 12 104 0.73
48 2.98 3.36 0 18 108 0.75
49 4.39 4.10 0 15 122 0.86
50 3.57 3.04 0 13 109 0.76
Mean 5.07 4.05 0 17.52 103.40 0.82
Note: SD = Standard Deviation; Min = Minimum score; Max = Maximum score; N
= number of items; r = reliability.
Table 1-3: Frequencies and percentages of tests that functioned well

or poorly in terms of internal consistency reliability.
Below .50- .60- .70- .80- .90- Total

.49 .59 .69 .79 .89 .92
Frequencies
Japan 6 4 11 17 12 0 50
Russia 0 0 1 16 30 3 50
Total 6 4 12 33 42 3 100
Percentages
Japan 12% 8% 22% 34% 24% 0% 100%
Russia 0% 0% 2% 32% 60% 6% 100%
Total 6% 4% 12% 33% 42% 3% 100%
Item Analyses. One aspect of CTT item analysis that testers often examine
while developing norm-referenced tests (NRTs) is item facility (IF).
Brown (2005) recommends keeping items with an IF in the range between
.30 and .70 and discarding or replacing any items with IF values outside
that range. Table 1-4 shows at the bottom of the third column of numbers
that only 19.1% (Japan = 14.0%; Russia = 24.1%) of the items overall
were functioning well in the .30-.70 range. Interestingly, 27.6% of the
items (Japan = 31.7%; Russia = 23.5%) were not functioning at all (i.e.,
nobody answered correctly, hence IF = .00) and over 50% of the items
were in the .01 to .29 range, which further confirms that these items were
generally too difficult for these two samples.
12 Chapter One
Table 1-4: Frequencies and percentages of items that functioned well

in terms of Item Facility (IF).
Range .00 .01-.29 .30-.70 .71-1.00 Total

Frequencies
Japan 474 776 210 36 1,496
Russia 351 759 360 26 1,496
Total 825 1535 570 62 2,992
Percentages
Japan 31.7% 51.9% 14.0% 2.4% 100.0%
Russia 23.5% 50.7% 24.1% 1.7% 100.0%
Total 27.6% 51.3% 19.1% 2.1% *100.1%
Note: * = Does not add up to exactly 100% because of rounding error.
Another aspect of CTT item analysis that testers often consider when
developing NRTs is item discrimination (ID, calculated using the point-
biserial correlation coefficient in this study). ID values can range from .00
for items that do not discriminate at all between the high and low
performing examinees to 1.00 for items that are discriminating perfectly
between the high and low examinees. Negative discrimination values,
which can range down to -1.00, indicate the degree to which items are
measuring differently from the total scores on the test. Generally, CTT test
designers try to use items with the highest positive ID values available
when developing and revising NRTs. Ebel (1979) suggests these following
ID value ranges for test development: poor ID (.00-.19), marginal ID (.20-
.29), good ID (.30-.39), and very good ID (.40 and higher).
Table 1-5 shows the frequencies and percentages of cloze items in
terms of different ID value ranges for both nationalities. The Total row at
the bottom shows that on average 28.8% of the items contributed nothing
to the discrimination of these tests, though that percentage was
considerably higher in Japan (41.2%) than in Russia (16.4%). While large
proportions of items were Very Poor, Poor, or Marginal discriminators
when used with both test-taker samples, the cloze tests when used in
Russia had considerably more items in the Good and Very Good categories
(19.1% and 10.1%, respectively) than in Japan (9.2% and 2.1%,
respectively).

in terms of Item Discrimination (ID).
Very poor .01-.09
Marginal .20-.29
.00 or negative
Very good .40+

Non-functional
Good .30-.39
Poor .10-.19
Total
Frequencies
Japan 616 194 258 258 138 32 1,496
Russia 246 173 292 348 286 151 1,496
Total 862 367 550 606 424 183 2,992
Percentages
Japan 41.2% 13.0% 17.2% 17.2% 9.2% 2.1% 100%
Russia 16.4% 11.6% 19.5% 23.3% 19.1% 10.1% 100%
Total 28.8% 12.3% 18.4% 20.3% 14.2% 6.1% 100%
Many Faceted Rasch Measurement (MFRM)

Many-Faceted Rasch measurement (MFRM) was used to complement
our CTT analyses. Infit mean squares were used to identify misfitting test
items (i.e., items that were not functioning properly within the MFRM
model) and to calculate item difficulty measures (analogous to the item
facility, or IF, discussed for CTT) for each test item using FACETS
(Linacre, 2014a). Table 1-6 summarizes the fit statistics for Japan, Russia,
and the two combined.
The # Extreme Items column in Table 1-6 shows how many items did
not fit the model based on nobody answering them correctly. There were
475, 195, and 171 of these items for the Japan, Russia, and Combined
samples, respectively. Thus, about 31.8%, 13.0%, and 11.4% of the items
were extreme items, respectively. Clearly, the percentage of such items
was much higher in the Japan samples than in the Russia samples and the
two sets of samples combined.
The # Underfitting Items column in Table 1-6 shows how many items
“did not fit the general pattern of responses in the matrix, and can thus be
classified as relatively misfitting…” (McNamara, 1996, p. 171). Notice
that there were 32, 34, and 43 underfitting items in the analyses for the
14 Chapter One
Japan, Russia, and Combined samples, respectively. Thus, only 2.1%,

2.3%, and 2.9% of the items were underfitting, respectively.
The Total # Misfitting column in Table 1-6 represents the total of both
the extreme and underfitting items, for which there were 507 (33.9%), 229
(15.3%), and 214 (14.3%) of these for the Japan, Russia, and Combined
samples, respectively. Conversely, the # Fitting Items represents the total
number of items that fit the model according to the Rasch analyses. For
the 1,496-item analysis, there were 989, 1,267, and 1,282 of these for the
Japan, Russia, and Combined samples, respectively. Thus, about 66.1%,
84.7%, and 85.7% of the items fit the model. These findings underscore
two commonsense notions: that many test items when piloted will not
function properly and that item function may be related to test-taker
origin.
Table 1-6: Summary fit statistics for analyses of 1,496 items by

country of origin.
# Extreme Items*
Total # Misfitting
Item Separation
Item Reliability
# Fitting Items
# Underfitting
Item RMSE
Items
Sample Ȥ2
Japan 475 32 507 989 1.16 1.53 .70 .00

Russia 195 34 229 1,267 0.82 2.62 .87 .00
Combined 171 43 214 1,282 0.76 2.84 .89 .00
Note: * = Extreme items are those for which no examinee answered correctly.
The Item RMSE is the root mean square standard error statistic used
in calculating the separation index discussed next, but it can be interpreted
on its own as a standard error. The lower the value of RMSE, the better the
data fit the model. In this study, the RMSE values were high (ranging from
.76 to 1.16) which indicates that a number of items were not fitting the
model as well as might be desired (as discussed above).
The item separation index is an estimate of the spread of the item
estimates relative to their precision, “the number of standard errors of
spread among the items” (Bond & Fox, 2007, p. 59), which is to say that
this measure reports reliability in units of standard error. Higher values are
desired in this case. Notice that the separation index is higher for the
combined data than for the single nationalities, and also higher for the
samples in Russia than for those in Japan. In all three cases, this indicates
that the items are comparatively high in terms of the way the difficulty
estimates are spread out relative to their precision.
The item reliability statistic shown in Table 1-6 is similar to
Cronbach’s alpha (Bond & Fox, 2007), and as with Cronbach’s alpha, is
on a scale from 0.00 to 1.00. High reliability for items in this case
indicates that items’ measures of difficulty are predicted to be ordered
similarly across different iterations of Rasch modeling. The analyses
indicated moderate item reliability of .70 for the Japan samples,
considerably higher reliability of .87 for the Russia samples, and even
higher reliability at .89 for the combined samples.
The chi-square (fixed) values test the following hypothesis: “Can this
set of elements be regarded as sharing the same measure after allowing for
measurement error?” Thus, for this design, the following hypothesis is
being tested: Can these items be thought of as equally difficult? Clearly,
since all of the chi-square (fixed) statistics in this study were found to be
significant (at p < .00), this hypothesis must be rejected, that is, the answer
is no, the items cannot be thought of as equally difficult. Thus, the
variations in item difficulty estimates are probably due to factors other
than chance.
Rasch Vertical Rulers. Tables 1-7 and 1-8 present the vertical rulers
that resulted from our FACETS analyses for Japan and Russia. The first
column shows the scale for the vertical ruler, which represents the range of
scores on a true interval logit scale, centered on 0. Note that FACETs
requires at least one category to be fixed (i.e., centered on 0.00) in order to
set the parameters of the scale. We chose to center tests on 0.00 because
they were the same for both groups. Because persons and items were the
more interesting categories, they were set to float (i.e., non-centered) to
reveal their positions relative to one another. In these rulers, the range of
logit scores is from low scores at about -6 to high scores at +7. The second
column shows the test-taker ability measures for Japan ranging from about
2.5 down to -6. The third column shows the relative difficulty of the 50
cloze tests when used in Japan. The fourth column shows the test item
difficulties for Japan with a number of items at +4 (i.e., maximally
difficult because nobody answered them correctly) and others ranging
down below -5. The fifth column shows the logit scores again, and the
sixth column shows the test takers in Russia on the scale ranging from
about 6.5 down to -5. The seventh column shows the test versions in
Russia. The eighth column shows the test item difficulties in Russia with a
number of items at +7 (i.e., meaning that they were maximally difficult in
16 Chapter One
that nobody answered them correctly) and the others ranging down to
below -3.
Table 1-7: Vertical rulers for test taker ability, test version difficulty,
and test item difficulty for Japan.
Japan
Logits Test Taker Test Version Test Item Difficulty
Ability Difficulty
7 + + +
| | |
| | |
| | |
6 + + +
| | |
| | |
| | |
5 + + +
| | |
| | |
| | |
4 + + + XXXXXXXXX*
| | | *
| | | *
| | | X*
3 + + + X*
| | | *
| . | | *
| . | | *
2 + + + *
| . | | X*
| . | | *
| . | | *
1 + * + + *
| *. | | *
| **. | | X*
| ***. | *** | *
0 * ****. | ********** * *
| *****. | ***. | *
| ********. | | *
| *******. | | *
Japan
Logits Test Taker Test Version Test Item Difficulty
Ability Difficulty
-1 + ********. + + *
| *********. | | *
| *******. | | *
| *********. | | *
-2 + ******. + + *
| *****. | | *
| ****. | | *
| ***. | | *
-3 + **** + + *
| **. | | *
| *. | | *
| . | | *
-4 + . + + *
| | | *
| . | |
| . | | *
-5 + + + *
| . | | *
| | |
| . | |
-6 + *. + +
* = 23 *=3 x = 48
. = 4 . =1 * = 24
18 Chapter One
Table 1-8: Vertical rulers for test taker ability, test version difficulty,
and test item difficulty for Russia.
Russia
Logits Test Taker Test Version Test Item
Ability Difficulty Difficulty
7 + + + *********.
| | |
| . | |
| | |
6 + + + .
| | | .
| . | | *.
| . | | **.
5 + + + *.
| | | **.
| . | | *.
| . | | *.
4 + . + + **.
| . | | *.
| . | | *.
| . | | **.
3 + . + + **.
| . | | **.
| . | | **.
| *. | | **.
2 + **. + + ***.
| ***. | | ***.
| *****. | | ***
| *****. | | **.
1 + *******. + + ***.
| ********. | | **.
| *********. | | **.
| ********. | *. | **.
0 * *******. | ********. * *.
| ********. | ** | *.
| *******. | | *.
| ********. | | *.
-1 + *******. + + *.
| *******. | | .
| ******. | | .
Russia
Logits Test Taker Test Version Test Item
Ability Difficulty Difficulty
| ***. | | .
-2 + *****. + + .
| ***. | | .
| ***. | | .
| ***. | | .
-3 + **. + + .
| *. | |
| . | |
| . | |
-4 + . + +
| . | |
| . | |
| | |
-5 + . + +
| | |
| | |
| | |
-6 + + +
* = 34 *=4 * = 20
. = 28 . =1 . = 10
Comparison of CTT and Rasch Item Analysis Results

Table 1-9 shows the frequencies and percentages of items that
functioned well in terms of CTT ID and Rasch fit statistics. Notice that
according to the CTT framework, only 11.4% and 29.2% of the items for
the Japan samples and Russia samples, respectively, worked well in the
sense that they produced ID values of .30 or higher, which by extension
means that 88.6% and 70.8%, respectively, of the items for the two
samples were .29 or lower. In short, based on the CTT ID values, a fairly
small number of items—no more than 29.2%—can be said to have been
working well for NRT development purposes even in the Russia sample.
20 Chapter One

in terms of CTT ID and Rasch Fit.
CTT ID CTT ID Rasch Rasch

.30 + Below .30 Fit Misfit
Frequencies
Japan 170 1,326 989 507
Russia 437 1,059 1,267 229
Percentages
Japan 11.4% 88.6% 66.1% 33.9%
Russia 29.2% 70.8% 84.7% 15.3%
The Rasch analyses results in Table 1-9 provide a brighter picture with
larger proportions of items, 66.1% and 84.7% for the Japan samples and
Russia samples, respectively, fitting the model and thus being analyzable
and useable. The association between the nationality and fit (analyzed in a
2 x 2 contingency table with Japan and Russia on one dimension and
misfit and fit on the other) indicated a small but demonstrable degree of
association (22%) between these two factors (Ȥ² = 139.26, df = 1, p < .00;
phi = .22). We conclude from Table 1-9 that the CTT item analysis
statistics indicated that only small numbers of items are functioning well,
while the Rasch analysis indicated that relatively larger numbers of items
fit the Rasch model and were thus analyzable, interpretable, and useable
for NRT purposes.
Further analyses provided additional detail. It turned out that the
number (and percentage) of the 1,496 items that were functioning well in
the CTT items analysis (i.e., with ID = .30+) were as follows: 106 (7.1%)
worked for both nationalities; 331 (22.1%) worked uniquely in the Russia
samples; and 64 (4.3%) worked uniquely in the Japan samples. Hence, in
the CTT analyses, considerable differences occurred in which items were
working for each of the nationalities. Similarly, in the Rasch analyses, the
number (and percentage) of the 1,496 items which fit were as follows: 938
(62.7%) fit for both nationalities; 329 (22.0%) fit uniquely for the Russia
takers; 51 (3.4%) fit uniquely for the Japan test-takers. Again,
considerable differences surfaced for which items were fitting for each of
the nationalities, but less so than in the CTT analyses.
Linguistic Characteristics and Rasch Item Fit
Given that the Rasch analysis turned out to be more sensitive and
appropriate for analyzing the effectiveness of the items in this study, the
remaining analyses were based on the Rasch item fit statistics. The
linguistic characteristics considered here were (a) the parts of speech of
the word in each blank, (b) whether the word was a content or function
word, (c) whether it was of Latinate or Germanic origin, and (d) the
frequency of the word in the Brown Corpus. Based on previous research
(Brown, 1992), we had a reasonable expectation that these linguistic
characteristics would have some relationship with item performance.
Parts of Speech and Rasch Item Fit. Table 1-10 shows the item misfit
and fit in the Rasch analyses separately for Japan and Russia for the
different parts of speech across all 1,496 items, presented in alphabetical
order by part of speech. Note that the chi square (Ȥ²) statistic for items that
fit for parts of speech by Japan and Russia was only 13.68 with df = 16,
which is not significant even at the very liberal p > .50. The Cramer’s V
statistic was .078. Thus, counter to our expectations, the pattern of item fit
for the parts of speech does not vary for the two nationalities beyond what
would be expected by chance alone (in this case, with p > .50) and the
association between test-taker linguistic background and parts of speech is
only about 7.8%. Of course, none of this means that these frequencies are
significantly the same as what could reasonably be expected by chance
alone, but it did convince us that there were no variations worth further
scrutiny in this table.
Content/Function Words and Rasch Item Fit. Table 1-11 shows the
number of content and function words that misfit or fit depending on
nationality for the 1,496 items. The Ȥ² statistic for word type by nationality
2 x 2 contingency table for the frequencies of fitting items was 5.02 with
df = 1, which is significant at the .025 level. The phi statistic here was .05.
Thus, these fluctuations in frequencies are significantly different from
chance at .025 and were associated to a very small degree (5%) with
content versus function distinction. Visual inspection of the percentages
shown in this table indicated that (a) fewer items in both the content and
function word categories fit in the Japan sample, (b) in the Russia sample,
a somewhat higher percentage of function words fit the Rasch model than
content words, and (c) in the Japan sample, considerably higher
percentage of function words fit than content words.
22 Chapter One
Table 1-10: Item misfit by linguistic background for different parts of

speech for 1,496 items.
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Japan Fit
Russia Fit
Part of
Speech Total
Adjective 86 70 45 111 156 44.9 71.2 26.3

Adverb 40 63 15 88 103 61.2 85.4 24.3
Auxiliary 1 22 0 23 23 95.7 100.0 4.3
Conjunction 16 69 7 78 85 81.2 91.8 10.6
Contraction 1 3 1 3 4 75.0 75.0 0.0
Copula 2 29 2 29 31 93.5 93.5 0.0
Definite 13 96 10 99 109 88.1 90.8 2.8
Article
Gerund 3 8 1 10 11 72.7 90.9 18.2
Indefinite 8 27 0 35 35 77.1 100.0 22.9
Article
Modal 2 7 1 8 9 77.8 88.9 11.1
Noun 169 187 79 277 356 52.5 77.8 25.3
Number 7 6 1 12 13 46.2 92.3 46.2
Personal 7 61 3 65 68 89.7 95.6 5.9
Pronoun
Preposition 45 174 15 204 219 79.5 93.2 13.7
Pronoun 10 46 1 55 56 82.1 98.2 16.1
Proper Noun 14 42 8 48 56 75.0 85.7 10.7
Verb 83 79 40 122 162 48.8 75.3 26.5
Total 507 989 229 1,267 1,496 66.1 84.7 18.6
Table 1-11: Item misfit for word type (content & function) in the
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Russia Fit
Japan Fit
Word
Type Total
Content 424 601 197 828 1,025 58.6 80.8 22.2
Function 83 388 32 439 471 82.4 93.2 10.8
Total 507 989 229 1,267 1,496 66.1 84.7 18.6
Germanic/Latinate Word Origin and Rasch Item Fit. Table 1-12 shows
the number of Latinate and Germanic words that misfit or fit in the Rasch
models constructed for the 1,496 items, used with test-takers from two
nationality backgrounds. The Ȥ² statistic for the word origin by nationality
2 x 2 contingency table for the frequencies of fitting items was 6.17 with
df = 1, which is significant at the .025 level. The phi statistic here indicates
that the association was .05. Thus, these fluctuations in frequencies are
significantly different from chance at .025 and were associated to a very
small degree (5%) with Latinate versus Germanic distinction. Visual
inspection of the row percentages shown in this table indicates again that
(a) a smaller proportion of items fit in the Japan samples than in the Russia
sample, (b) in both the Russia and Japan samples, a considerably higher
proportion of the Germanic words fit the Rasch model than Latinate
words, and (c) Latinate words were less likely to fit than Germanic words
for the Russia sample and considerably less for the Japan sample.
Word Frequency and Rasch Item Fit. The point-biserial correlation
coefficients between whether or not the items fit with the raw frequencies
were .222 for the Japan data and .123 for the Russia data. Because the
frequencies had skewed distributions, we also transformed those
frequencies using a natural log transformation. The point-biserial
correlation coefficients between whether or not the items fit with the
transformed frequencies were .363 for the Japan data and .279 for the
Russia data, which were significant at p < .01. These results indicate
Rasch item fit estimates were somewhat related to item frequencies.
24 Chapter One
Table 1-12: Item misfit for word origin (Latinate & Germanic) in the
Russia and Japan data for 1,496 Items.
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Russia Fit
Japan Fit
Word Origin Total
Latinate 219 181 114 286 400 45.3 71.5 26.2
Germanic 288 808 115 981 1,096 73.7 89.5 15.8
Total 507 989 229 1,267 1,496 66.1 84.7 18.6
Pearson correlation coefficients between the item logit scores and raw
item frequencies were -.215 (p < .05) for the Japan data and -.294 (p < .01)
for the Russia data. The Pearson correlation coefficients between the item
logit scores and the transformed item frequencies were -.356 for the Japan
data and -.430 for the Russia data, both of which were significant at p <
.01. These correlations were negative because, as we expected, as the
magnitude of the difficulty estimates increased the frequencies decreased.
Nonetheless, these results indicate Rasch item difficulty estimates were
also somewhat related to item frequencies.
Table 1-13 shows the Rasch item fit frequencies separately for the
different the levels of vocabulary frequency in the Brown Corpus across
all 1,496 items for the two nationalities. Examining the percentages of
item fit on the right side of Table 1-13, it is easy to see that, below a
certain frequency threshold (i.e., as items become less and less frequent),
the infrequent items did not fit the models well. For instance, lexical items
that occurred fewer than 1,000 times in the Brown Corpus were much less
likely to fit (i.e., accounted for a much lower percentage) than more
frequent lexical items in the Japan data. Similarly, lexical items that
occurred fewer than 50 times in the Brown Corpus were much less likely
to fit than lexical items that occurred more often in the Russia data. This
result is intuitive in that less-frequent lexical items are likely to be more
unpredictably known or not known by test takers than the more frequent
items and thus are less likely to fit.
Table 1-13: Item misfit as a function of word frequency (in the Brown
Corpus found in Francis & Kuþera, 1979) in the Japan and Russia
data for 1,496 items.
% Difference
Russia Misfit
Russia % Fit
Japan Misfit
Japan % Fit
Russia Fit
Japan Fit
Frequency in
Brown Corpus
Total
30000+ 17 142 11 148 159 89.3 93.1 3.8

20000-29999 31 116 7 140 147 78.9 95.2 16.3
10000-19999 1 20 2 19 21 95.2 90.5 -4.7
5000-9999 18 128 6 140 146 87.7 95.9 8.2
1000-4999 44 166 18 192 210 79.0 91.4 12.4
100-999 143 210 53 300 353 59.5 85.0 25.5
50-99 56 69 20 105 125 55.2 84.0 28.8
10-49 99 73 49 123 172 42.4 71.5 29.1
0-9 98 65 63 100 163 39.9 61.3 21.4
Total 507 989 229 1,267 1,496 66.1 84.7 18.6
Discussion
A number of research studies (Brown, 1988, 1989, 2013; Brown,
Yamashiro, & Ogane, 1999, 2001; Revard, 1990) have examined the
issues involved in developing cloze items based on CTT item analysis.
This chapter expands on those analyses and is one of a few that apply
Rasch analysis to cloze testing. It is also one of the first studies to
systematically examine the performances of large numbers of students
from two distinct language backgrounds on 50 different cloze tests.
Recall that the title of the study was How well do cloze items work and
why? With regards to the first part of that question, How well do cloze
items work?, as mentioned above, the literature on the topic has shown
considerable variation in how well cloze tests function with reliability
estimates ranging from .31 to .96 and validity coefficients ranging from
.43 to .91 (Brown, 2013). The present study was designed to look a bit
closer at these issues by addressing five research questions.
26 Chapter One
How do the CTT descriptive statistics, reliability, and item analyses

differ for the test taker populations from two different linguistic
backgrounds?
The descriptive statistics in Tables 1-1 and 1-2 indicated that the raw
score means were low for the 50 thirty-item cloze tests administered to
samples from two different linguistic backgrounds. Reliability estimates
varied widely depending on the different cloze test forms and the language
background of the groups. IF statistics indicated that about one fifth of the
test items were contributing effectively to the variance in cloze scores for
these different test-taker populations; similarly, ID values also indicated
that that about one fifth were Good or Very Good discriminators.
From a CTT perspective, then, many of the 50 tests in this study were
not functioning particularly well in terms of central tendency, dispersion,
and reliability. These mixed results are consistent with the literature and
may be due to the fact that large numbers of the 1,496 individual items
were either not working at all or were not particularly effective as NRT
items in terms of IF and ID. These data potentially indicate that random
deletion patterns may not be the most effective way to build cloze tests;
indeed, item creation based on rational selection that considers lexical
items with relatively high word frequency may be a more productive
strategy for the development of cloze items either manually or through
automatic generation (as advocated by Coniam, 2013).
How do Rasch item difficulty measures differ for the test takers from
different linguistic backgrounds?
In the Rasch analyses, the items also proved to be difficult, that is, they
were generally suitable for students of high or very high ability levels as
indicated by the logit scores for the two test-taker groups. For both test-
taker groups, ability estimates were lower than the item logits in many
cases. Thus, as in the CTT analyses, a number of items were suitable for
students above the general ability levels of these samples. Nonetheless, the
same vertical rulers indicate that a fairly high proportion of items was also
suitable for the examinees in these samples.
How do the proportions of functioning cloze items differ between the

CTT and IRT analyses, when based on the test taker responses from
different linguistic backgrounds?
Table 1-9 presented the frequencies and percentages of items that

functioned well (in terms of CTT ID and Rasch fit statistics). As shown
there, higher percentages of items were analyzable and usable if

considered within the constraints of Rasch analysis. It appears then that
Rasch analysis was better suited for accounting for items like the ones
found in our cloze tests than was CTT item analyses.
In what ways do Rasch item fit patterns differ in terms of factors such
as linguistic background and four cloze item linguistic features: parts
of speech, word type, word origin, and word frequency?
Given that the Rasch analyses proved to be more sensitive and

appropriate for analyzing the effectiveness of the items in this study, we
investigated the degree to which parts of speech, word types, word origins,
and word frequencies accounted for any differences in how items fit the
Rasch model in the Russia and Japan samples.
No statistically significant association (p > .50) surfaced in the
numbers of items that fit for the two samples due to their parts of speech.
However, significant associations (at p < .025) were found for the
proportions of items in the Russia and Japan data that fit for both word
type (content vs. function) and word origin (Latinate vs. Germanic). Phi
statistics indicated that each of these variables only accounted for about
5% of their association with item fit. In both samples, function words
(with 93.2% and 82.4% fitting for the Russia and Japan samples,
respectively) were somewhat more likely to fit than content words (with
80.8% and 58.6% fitting for the Russia and Japan samples, respectively),
and Germanic words (with 89.5% and 73.7% fitting for the Russia and
Japan samples, respectively) were more likely to fit than Latinate (with
71.5% and 45.3% fitting for the Russia and Japan samples, respectively).
The analyses of the relationship between word frequency and item fit
showed somewhat more association. Point-biserial correlation coefficients
(rpbi) between whether or not items fit and the (natural log) transformed
frequencies of the words in the blanks were .279 for the Russia sample and
.363 for Japan. In more detail, items in the Russia samples were more
likely to fit (with 84% or more fitting) if they were based on words with
frequencies of 50 more in the Brown Corpus. The results for the Japan
samples were a bit less clear. However, in general, items based on words
with frequencies of 1000 or more were considerably more likely to fit the
Rasch model.
28 Chapter One
Which linguistic characteristics will increase the probability of cloze

items functioning well during piloting?
What does all of this mean for cloze test design? Based on the results
of these analyses, it appears that using higher proportions of frequent
words (i.e., with frequencies over 50 in the Brown Corpus) should help
produce items that fit for samples like that in Russia. However, for
samples like that in Japan, items based on words with frequencies of 1,000
or more appear to be more likely to fit the Rasch model. It is also
advisable to use higher proportions of items requiring Germanic words (as
opposed to Latinate words) as it could help produce somewhat more items
that fit for samples like those in Russia and Japan (though somewhat more
so for samples like that in Russia). Using higher proportions of items
requiring function words (as opposed to content words) may help produce
somewhat more items that fit for samples like those in Russia and Japan,
(though more so for samples like that in Russia). Note that increasing the
proportions of function words might increase the degree to which
grammar knowledge is being tested.
Certainly, sound test development practices (and the results of this
study) dictate that the best strategy for producing effective cloze tests is to
pilot those tests with larger numbers of items and then use Rasch analysis
to select those items that fit the sorts of examinees being tested. If that is
not possible, it may help to select items that tend to require function words
and words of Germanic origin in the blanks, or more importantly, words
that occur frequently in English.
Conclusion
Implicit in the second part of the question posed by the title of this
paper is the question: Why do cloze items and tests operate the way they
do? Clearly, one reason for the items being as difficult as they proved to
be is that they were natural cloze items (i.e., cloze tests developed from
passages randomly selected from a large collection of native-speaker
texts). As first demonstrated in Brown (1993), such natural cloze tests
tend to be difficult even for university level students of English, especially
when scored using the exact-answer scoring method as was the case in this
study.
The generally wider dispersion of scores found for the Russia samples
may have occurred because (a) these test-takers varied more widely in
ability levels than the Japan samples, (b) the potential for variation was
greater as a result of their higher means, or (c) both. The differences in
reliability may have been due to the higher means in the Russia samples,
to the greater variance, to the larger sample sizes, or to differences in the
test-takers (e.g., higher motivation, more familiarity with cloze format,
etc.).
The fact that much larger proportions of items functioned well in the
Rasch item analyses than in the traditional CTT analysis may be explained
by the nature of Rasch analyses which provide item difficulty estimates
based on the probability of an average test-taker answering a given item
correctly rather than on the proportion of examinees who answered
correctly as in CTT. As a result, the Rasch item difficulty estimates were
not sample dependent and were not affected as much as the CTT item
analyses statistics by the relative difficulty of the items in these 50 tests for
both samples. Hence, we were able to identify greater numbers of
functioning items, even those items which might be challenging for the
test-takers, and understand at least partially why and how the items were
functioning linguistically in interesting and interpretable ways.
In addition, the fact that we used multifaceted Rasch analysis in the
form of FACETS analysis made it possible to simultaneously analyze
(using 10 anchor items administered to all test takers) 50 cloze tests with
test-takers and items nested within tests (i.e., with different test takers and
items on each test). FACETS analysis also made it possible to link the 50
cloze tests from two nationalities and thereby put the test-taker ability
estimates and item difficulty estimates on the same scales for all tests and
both nationalities. Thus, we were able to learn that Rasch analysis is more
appropriate for our cloze test analysis and revision purposes, indeed,
considerably better than traditional CTT analyses. In addition, we found
that blanks representing certain categories of words (i.e., function words
and words of Germanic origin, and to a greater extent relatively high
frequency words) are more likely to work well for NRT purposes.
If we were to revise any or all of the 50 cloze tests in this study by
selecting only those items that functioned well from a CTT perspective (as
described in Brown, 1988) or from a Rasch perspective (based on the
results of this study), we are convinced that very different tests would
surface for the Russia and Japan samples because different items were
functioning well in the two samples. This is consistent with the starting
point for this study which was that cloze items are just another “family of
item types” (Mullen, 1979, p. 21). In fact, cloze is no more than “a
technique for producing tests, like any other technique” (Alderson, 1979,
p. 226), though, according to this study, they provide a more effective set
of items from the Rasch perspective than from the CTT viewpoint. Thus,
there is no reason to “…think that cloze tests are somehow different from
30 Chapter One
other tests” (Brown, 2013, p. 26), and we should no doubt pilot and revise
cloze tests just as we would any other tests to tailor them to the specific
range of abilities involved. However, we should also consider selecting
items based on the recommendations of this study, and thereby extend the
notion of rational deletion in useful ways.
References
Alderson, J. C. (1979). The cloze procedure and proficiency in English as
a foreign language. TESOL Quarterly, 13(2), 219-227.
Bachman, L. F. (1982). The trait structure of cloze test scores. TESOL
Quarterly, 16(1), 61-70.
—. (1985). Performance on cloze tests with fixed-ratio and rational
deletions. TESOL Quarterly, 19(3), 535-555.
Baker, R. L. (1987). An investigation of the Rasch model in its application
to foreign language proficiency testing. Doctoral thesis University of
Edinburgh, UK.
Bond, T., & Fox, C. M. (2007). Applying the Rasch model: Fundamental
measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence
Erlbaum Associates.
Brown, J. D. (1988). Tailored cloze: Improved with classical item analysis
techniques. Language Testing, 5(1), 19-31.
—. (1989). Cloze item difficulty. JALT Journal, 11(1), 46-67.
—. (1993). What are the characteristics of natural cloze tests? Language
Testing, 10(2), 93-116.
—. (1998). An EFL readability index. JALT Journal, 20(2), 7-36.
—. (2005). Testing in language programs: A comprehensive guide to
English language assessment. New York: McGraw-Hill.
—. (2013). My twenty-five years of cloze testing research: So what?
International Journal of Language Studies, 7(1), 1-32.
Brown, J. D., Janssen, G., Trace, J., & Kozhevnikova, L. (2013). Using
cloze passages to estimate readability for Russian university students:
A preliminary study. In M. A. Kulinich, V. A. Levchenko, E. G.
Kashina, L. A. Kozhevnikova, & E. A. Sokolova (Eds.),
ɉɪɨɮɟɫɫɢɨɧɚɥɶɧɨɟ ɪɚɡɜɢɬɢɟ ɩɪɟɩɨɞɚɜɚɬɟɥɟɣ ɚɧɝɥɢɣɫɤɨɝɨ ɹɡɵɤɚ ɜ
ɭɫɥɨɜɢɹɯ ɦɨɞɟɪɧɢɡɚɰɢɢ ɨɛɪɚɡɨɜɚɬɟɥɶɧɨɣ ɫɢɫɬɟɦɵ». Ɇɚɬɟɪɢɚɥɵ
ɦɟɠɞɭɧɚɪɨɞɧɨɣ ɧɚɭɱɧɨ-ɩɪɚɤɬɢɱɟɫɤɨɣ ɤɨɧɮɟɪɟɧɰɢɢ. ɋɚɦɚɪɚ, 25-
26 ɦɚɪɬɚ 2013. [English language teacher professional development:
Scaling New Heights. A Collection of Conference Papers. Samara,
March 25th–26th, 2013]. Samara, Russia: Samara State University.
Brown, J. D., Yamashiro, A. D., & Ogane, E. (1999). Tailored cloze:

Three ways to improve cloze tests. University of Hawai‘i Working
Papers in ESL, 17(2), 107-129.
Brown, J. D., Yamashiro, A. D., & Ogane, E. (2001). The Emperor’s new
cloze: Strategies for revising cloze tests. In T. Hudson & J. D. Brown
(Eds.), A focus on language test development (pp. 143-161). Honolulu,
HI: University of Hawai‘i Press.
Coniam, D. (2013). A preliminary inquiry into using corpus word
frequency data in the automatic generation of English language cloze
tests. CALICO Journal, 14(2-4), 15-33.
Ebel, R. L. (1979). Essentials of educational measurement (3rd ed.).
Englewood Cliffs, NJ: Prentice-Hall.
Francis, W. N., & Kuþera, H. (1979). Brown Corpus manual. Providence,
RI: Brown University Department of Linguistics.
Francis, W., & Kuþera, H. (1982). Frequency analysis of English usage:
Lexicon and grammar. Boston: Houghton Mifflin.
Hale, G. A., Stansfield, C. W., Rock, D. A., Hicks, M. M., Butler, F. A., &
Oller, J. W. (1988). MultipleǦchoice cloze items and the Test of English
as a Foreign Language: ETS research report series #1. Princeton, NJ:
Educational Testing Service.
Hoshino, A., & Nakagawa, H. (2008). A cloze test authoring system and
its automation. In Advances in web based learning–ICWL 2007 (pp.
252-263). Berlin: Springer.
Kuþera, H., & Francis, W. (1967). Computational analysis of present-day
American English. Providence, RI: Brown University Press.
Lee-Ellis, S. (2009). The development and validation of a Korean C-Test
using Rasch Analysis. Language Testing, 26(2), 245-274.
Linacre, J. M. (2014a). Facets many-facet Rasch analysis computer
program (v. 3.71.4). Retrieved from:
http://www.winsteps.com/facets.htm
—. (2014b). Winsteps & Facets comparison. Retrieved from:
http://www.winsteps.com/winfac.htm
Markam, P. L. (1985). The rational deletion cloze and global
comprehension in German. Language Learning, 35(3), 423-430.
Mullen, K. (1979). More on cloze tests as tests of proficiency in English as
a second language. In E. J. Briere, & F. B. Hinofotis (Eds.), Concepts
in language testing: Some recent studies (pp. 21-32). Washington, DC:
TESOL.
Revard, D. (1990). Tailoring the cloze to fit: Improvement of cloze tests
through classical item analysis. Unpublished scholarly paper.
Honolulu, HI: University of Hawai’i at Manoa.
32 Chapter One
Stubbs, M. (2004). Language corpora. In A. Davies & C. Elder (Eds.), The

handbook of applied linguistics (1st ed.) (pp. 106-132). Malden, MA:
Blackwell.
Taylor, W. L. (1953). Cloze procedure: A new tool for measuring
readability. Journalism Quarterly, 30(4), 414-438.
Appendix 1-A: List of participating universities.

In Japan:
Dokkyo University
Fukuoka Teacher’s College
Fukuoka University of Education
Fukuoka Women’s University
International Christian University
International University of Japan
Kanazawa University
Kansei Gakuin University
Meiji University
Saga University
Seinan Gakuin University
Soai University
Sophia University
Tokyo University of Agriculture and Technology
Toyama College of Foreign Languages
Toyama University
Toyo Women’s Junior College
Waseda University
In the Russian Federation:

Chelyabinsk Law Institute
Chelyabinsk State University
International Market Institute (Samara)
Kazan branch of the Russian International Academy for Tourism
Kazan Military Institute
Kazan State Technical University
Kolomna State Pedagogical University
Krasnodar State University
Mordovian State University
Moscow State University
Novocherkassk Polytechnic Institute
Novosibirsk State University
Orenburg State University
Presidential Cadet College (Orenburg)
Rostov/Don Institute of Management, Business and Law
Rostov/Don State University
Ryazan State University
34 Chapter One
Samara Aerospace university

Samara State Academy of Social Sciences and Humanities
Samara State University
Samara State University of Architecture and Civil Engineering
Saratov State Pedagogical University
Saratov State University
Smolensk University for the Humanities
Solykamsk State Pedagogical University
South-Ural State University
St. Petersburg State University
Surgut State University
Syktyvkar State University
Syzran branch of Samara State Technical University
Taganrog Institute of Management and Economics
Taganrog State Pedagogical University
Togliatti Academy of Management
Tomsk Polytechnic University
Ulyanovsk State University Institute for International Relations
Volga State University of Technology (former Mari State Technical
University)
Voronezh State University
Voronezh State University of Architecture and Civil Engineering
Appendix 1-B: Example cloze test

(Adapted from Brown, 1989).
Name_________________________________________________
(Last) (First)
Native Language________________________________________
DIRECTIONS:
x Read the passage quickly to get the general meaning.
x Write only one word in each blank. Contractions (example: don’t)
and possessives (John’s bicycle) are one word.
x Check your answers.
NOTE: Spelling will not count against you as long as the scorer can read
the word.
EXAMPLE: The boy walked up the street. He stepped on a piece of ice.

He fell (1)____________, but he didn’t hurt himself.
A FATHER AND SON

Michael Beal was just out of the service. His father had helped him get
his job at Western. The (1)____________ few weeks Mike and his father
had lunch together almost every (2)____________. Mike talked a lot about
his father. He was worried about (3)____________ hard he was working,
holding down two jobs. “You know,” Mike (4)____________, “before I
went in the service my father could do just (5)____________ anything.
But he’s really kind of tired these days. Working two (6)____________
takes a lot out of him. He doesn’t have as much (7)____________. I tell
him that he should stop the second job, but (8)____________ won’t listen.
(continues for 30 items...)
Answer Key
past
day
how
said
about
jobs
energy
he
…
36 Chapter One
Appendix 1-C: Actual word frequencies.

Word Count Word Count Word Count
the 109 fire 3 mine 2
of 50 found 3 months 2
and 44 good 3 most 2
to 37 groups 3 need 2
a 33 him 3 negative 2
in 32 if 3 new 2
as 18 inches 3 off 2
he 16 into 3 one 2
at 13 Jacob 3 our 2
it 13 light 3 out 2
with 13 long 3 percent 2
for 12 might 3 possible 2
is 11 more 3 quartz 2
was 11 never 3 religion 2
I 10 only 3 Saint 2
that 10 powder 3 say 2
on 9 shy 3 school 2
be 8 than 3 see 2
his 8 think 3 something 2
not 8 thought 3 spread 2
you 8 two 3 system 2
about 7 wings 3 tall 2
all 7 work 3 then 2
had 7 against 2 this 2
how 7 also 2 though 2
no 7 America 2 too 2
they 7 an 2 town 2
were 7 any 2 while 2
by 6 because 2 wife 2
first 6 before 2 working 2
said 6 bills 2 world 2
so 6 book 2 Other words
who 6 both 2 with 1 each 1,337
are 5 came 2
but 5 cattle 2 Total 1,496
from 5 children 2
or 5 company 2
up 5 compared 2
Word Count Word Count Word Count

very 5 control 2
what 5 dead 2
been 4 develop 2
Christ 4 did 2
feet 4 dollar 2
full 4 during 2
have 4 each 2
own 4 every 2
people 4 father 2
Piglet 4 few 2
she 4 fish 2
some 4 friends 2
still 4 Galileo 2
their 4 got 2
there 4 here 2
these 4 hit 2
we 4 immediately 2
went 4 Kano 2
when 4 knows 2
which 4 let 2
behind 3 limited 2
boy 3 M.I.6 2
cage 3 many 2
could 3 May 2
day 3 me 2
38 Chapter One
Appendix 1-D: FACETS input file for 50 cloze tests with

10 anchor items.
The abbreviated FACETS input file shown below illustrates how
multiple tests can be entered into a single analysis. Note that this analysis
contained three facets (test-takers, test version, and items), and the coding
for each is displayed, with comments displayed in italics and separated
from analyzable data with a semi-colon. Test-takers were coded by group,
test number, and examinee ID for ease of identification. Test versions
were coded by the corresponding cloze test number. Items were coded
both by test version and item number (e.g., item 430 = item 30 on test 4).
Notice that the first set of items is designated as anchor items and include
a measurement value.
Facets = 3 ; Test takers, test version, items

*
Labels =
1, Test-Takers ; Coded by group, test number, and examinee ID
100101 ; Group 1, test 01, examinee ID 01
…
*
2, test version ; 50 tests
01 = cloze 1 ; Cloze test 01
…
50 = cloze 50 ; Cloze test 050
*
3, Items, A ; Test items with anchor values (A)
31 = A31,-0.5 ; Anchor item 1 with a Measure of -0.50
…
40 = A40,-0.17 ; Anchor item 10 with a Measure of -0.17
101 = T01-01 ; Cloze test 01, item 01
…
130 = T01-30 ; Cloze test 01, item 30
201 = T02-01 ; Cloze test 02, item 01
…
5030 = T50-30 ; Cloze test 50, item 30
*
Data = ; Data with two response lines per examinee.
100101,1,101-
130,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0; Examinee #1
cloze item responses
100102,1,101-130,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0
…
105041,50,5001-
5030,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
100101,1,31-40,0,0,0,,,0,0,0,0,0; Test taker #1 anchor item response
100102,1,31-40,0,0,1,,,0,1,0,0,0
…
105041,50,31-40,0,1,1,,,1,1,0,0,0
CHAPTER TWO
ESTIMATING ABSOLUTE PROFICIENCY LEVELS

IN SMALL-SCALE PLACEMENT TESTS
WITH PREDEFINED ITEM DIFFICULTY LEVELS
KAZUO AMMA
Abstract
In traditional placement tests a candidate’s proficiency level is
assessed based on the total test score. This estimation is inappropriate for
two reasons. Firstly, it may be affected by the arbitrary combination of test
items with varying difficulties. Too many relatively difficult items might
lead to underestimating a candidate’s true proficiency level, while too
many easy items might result in overestimation. Secondly, looking at the
total placement test score is not informative enough because it does not
show the absolute proficiency level of the candidates. This chapter
proposes using a logistic regression analysis, which properly assesses a
candidate’s proficiency level as well as the confidence interval, provided
the difficulty level of individual test items is defined in advance. The
difficulty scale can be any standardized proficiency scale (e.g., the
Common European Framework of Reference) as long as the test item
difficulty is projected on it. This technique can further allow continuous
estimation for incomplete performance (i.e., when an open-ended answer
is partially correct) as well as binary scoring (i.e., correct/incorrect). As
the main output of this kind of analysis is the proficiency level on the
difficulty scale, candidates can be placed directly in the corresponding
class level. As long as item difficulty information is provided, the
estimation can be conducted regardless of the number of candidates taking
the placement test and it can be applied to individualized online learning
programs.
Estimating Absolute Proficiency Levels in Small-Scale Placement Tests 41
Introduction
The goal of a placement test is to place the candidates on a proficiency
scale with properly defined descriptions of the target language behaviour.
The assessment of the candidates’ proficiency level should be absolutely
stipulated (i.e., criterion-referenced) and thus should not be affected by
arbitrary addition/deletion of test items. In other words, the candidate’s
proficiency measure must stay constant however many items there are in
the test that are above or below his/her proficiency level. If test items are
too difficult, the candidate will hardly answer them correctly; if they are
too easy, the candidate will quite probably get the answers right. But these
results should not affect a candidate’s proficiency assessment. As an
alternative to traditional score-based assessment, this chapter proposes a
psychometric solution to estimate the candidates’ proficiency level which
refers to a set of proficiency criteria defined in advance. Such assessment
has often been done manually and subjectively with reduced reliability as
a result. Logistic regression analysis, however, is a statistical tool that
calculates the estimated mean proficiency level of a candidate as well as
the range of the confidence interval. The chapter also reports on two
candidates in a small-scale placement test and shows how the logistic
regression analysis describes the different characteristics of their
proficiency. The discussion that follows employs a psychometric rather
than a psychological perspective (Henning, 1992). Although placement
testing involves a number of issues such as test design, reliability, validity,
and decision-making (Fulcher, 1997; Plakans & Burke, 2013; Wall,
Clapham, & Alderson, 1994), the setting of item difficulty and the
streaming of difficulty levels in the examples included are assumed to be
accurate and reliable, in order to make the argument simple and clear.
Background
Although the estimation procedure presented here is independent of
any particular language proficiency/difficulty model in theory, its practical
application is based on a linear grading system of proficiency/difficulty.
Among various second language (L2) proficiency scales the most
comprehensive and influential is the Common European Framework of
Reference for Languages (CEFR). It describes what a learner can do when
he/she has reached a certain proficiency level. The following are
descriptors of overall reading comprehension in six levels of proficiency
(Council of Europe, 2001, p. 69). Table 2-1 refers only to reading skills
42 Chapter Two
since the placement test to be described later in the chapter deals almost
exclusively with reading comprehension.
Table 2-1: Descriptors of overall reading comprehension (Source:

Council of Europe, 2001, p. 69).
Level Descriptor
A1 Can understand very short, simple texts a single phrase at a time,
picking up familiar names, words and basic phrase and rereading
as required.
A2 Can understand short, simple texts on familiar matters of a
concrete type which consist of high frequency everyday or job-
related language. Can understand short, simple texts containing
the highest frequency vocabulary, including a proportion of
shared international vocabulary items.
B1 Can read straightforward factual texts on subjects related to
his/her field and interest with a satisfactory level of
comprehension.
B2 Can read with a large degree of independence, adapting style and
speed of reading to different texts and purposes, and using
appropriate reference sources selectively. Has a broad active
reading vocabulary, but may experience some difficulty with
low frequency idiom.
C1 Can understand in detail lengthy, complex texts, whether or not
they relate to his/her own area of speciality, provided he/she can
reread difficult sections.
C2 Can understand and interpret critically virtually all forms of the
written language including abstract, structurally complex, or
highly colloquial literary and non-literary writings. Can
understand a wide range of long and complex texts, appreciating
subtle distinctions of style and implicit as well as explicit
meaning.
In the application of the estimation procedure to the placement test

discussed in this chapter, the cognitive complexity was assumed to grow
in equal steps as the level rises. Although the CEFR scale does not
guarantee this assumption, the independent variable (i.e., the
difficulty/proficiency level) had to be an interval scale in order to conduct
a logistic regression analysis, which will be explained in the next section.
With the CEFR scale in mind as reference, 10 difficulty levels were set
in the actual placement test. As stated above, the five major grades from
Basic to Advanced were divided in progressive steps, and each major

grade had two minor steps to ensure adjustment of relative
difficulty/easiness within the major step (see Table 2-2).
Table 2-2: Item difficulty levels and required proficiency in the

placement test (Adapted from Amma, 2013).
Level Grade Task required

10 Advanced Can critically evaluate the text
9 Advanced (corresponding to C2).
8 High-Intermediate Can reformulate messages
7 High-Intermediate (corresponding to C1).
6 Low-intermediate Can understand simple implicatures
5 Low-intermediate (corresponding to B1 and B2).
4 Beginning Can understand literal meaning of
3 Beginning sentences (corresponding to A2).
2 Basic Can understand short and simple ideas
1 Basic (corresponding to A1).
In Table 2-2 above, the correspondence between the tasks involved in

the placement test and the can-do descriptors in the CEFR is not clearly
guaranteed. As Weir (2005) contends, “CEFR is not seen as a prescriptive
device but rather a heuristic, which can be refined and developed by
language testers to better meet their needs” (p. 298). The CEFR is vague
and fails to take account of contextual factors such as purpose of task
completion, response format (i.e., true/false or short answer), time
constraints for the task, channel of communication (telephone, face to
face, etc.), discourse type, text length, topic or content knowledge, lexical
competence, and structural and functional competence (Weir, 2005). Thus,
the CEFR should include descriptions comparable to language tasks and
“operationalize criterial distinctions between levels in their [test writers’]
tests” (Weir, 2005, p. 298).
Attempts to clarify the target proficiency do not seem very successful
so far. In a recent study on the identification of criterial features of learner
language, Hawkins and Buttery (2010) described learner error types of
grammatical structures comparable to the CEFR levels, based on the
understanding that the “CEFR levels are underspecified with respect to
key properties that examiners look for when they assign candidates to a
particular proficiency level and score in a particular L2” (p. 2). However,
the more specific the descriptor becomes, the less likely the target task is
generaliseable, since the context, linguistic or extralinguistic, may act as a
44 Chapter Two
determinant factor, including such performance variables as content

knowledge and willingness to communicate. For instance, the statement
‘It’s a fine day’ uttered by a cyclist in heavy rain can be either incorrect, if
taken literally, or correct, if taken ironically. Thus, it is often difficult to
make an accurate correspondence between a single language item,
especially grammar and vocabulary, and a proficiency level.
Staying in the realm of competence, Harsch and Rupp (2011) propose
a ‘level-specific approach’ in assessing writing skills in which “tasks are
used that are each targeted at one specific level,” thus “written responses
of the students are then assessed by having trained raters assign a fail/pass
rating using level-specific rating instruments” (p. 2). The contrast between
‘level-specific’ and ‘multilevel’ approaches can be comparable to certain
aspects of language literacy. For instance, whether a foreign language
learner can write his/her name correctly in the generally accepted
orthography in the target language pertains to the level-specific domain,
i.e., a learner in a certain level can perform the task either correctly or
incorrectly with little grey zone in between. On the other hand, the
absolute size and depth of the learner’s vocabulary cannot be assessed but
by the relative relevance of its use in the discourse he/she creates, i.e., a
learner with poor vocabulary may be able to communicate better than
someone with richer vocabulary in certain situations. According to Harsch
and Rupp (2011), some examples of specific purposes of tasks for B1 are
“pass on information”, “provide reasons for actions and comments”, and
text types for B1 are “write notes and messages”, “write personal letters,
simple formal letters and emails”, and “write reports/articles” (p. 12).
Although the tasks are specifically defined, the success/failure of the
candidates’ performance still depends to a large extent on the rater’s
subjective judgment. Meanwhile the task characteristics assigned to the
CEFR levels by Harsch and Rupp (2011) are linearly prescribed from
“very short prompts on concrete topics” (A1) to “topics of personal as well
as general interest” (B1) to “concrete as well as abstract, also unfamiliar
topics” (C1) (p. 12). The present study first follows this binary pass/fail
judgment criterion in the placement test, and considers later whether a
partial acknowledgement of success (e.g., assigning Level 4 to an
incomplete performance for a Level 6 task) is plausible.
Throughout this chapter, it is assumed that the CEFR levels and the
proficiency levels in the present placement test refer to points on a
proficiency/difficulty scale, but not bands of proficiency/difficulty that
stretch from a certain point on the scale to another. The names of grades in
Table 2-2 were given merely to help the reader understand the direction of
the scale. Thus, if a candidate’s proficiency is judged as above Level 4 but
below Level 5, for instance, he/she should be placed in a class ‘Low-

intermediate (lower)’.
To conclude this section, the criteria in Table 2-2 are one manifestation
of the general guidelines by the CEFR. The accuracy of the criteria of the
placement test may probably affect the result of analysis, but this is a
separate issue and the focus here is on the psychometric procedure of
placement on the assumption that the criteria adopted in this study are
appropriate for the present purpose.
Logistic Regression Analysis

The main problem with traditional test scoring pertains to its
reliability. Traditionally, the proficiency level of test takers is estimated by
the total score of a test, while the validity of test items, i.e., whether their
difficulty matches the test-taker’s ability, is not guaranteed. As a result, if
a test writer has written too many easy items, the candidate will get a
relatively high score. In contrast, if the test writer prepares too many
difficult items, the candidate is more likely to get a lower score than when
he/she is provided with appropriate items. Consider an imaginary case of
item difficulty levels (Table 2-3), where it is assumed that the difficulty
rises in the following order: A1 < A2 < B1 < B2 < C1.
Table 2-3: Description of imaginary test items.
Item #1 #2 #3 #4 #5
Points 10 10 10 10 10
Difficulty A1 A2 B1 B2 C1
If a candidate’s true proficiency (which is yet unknown) is just above

A2, she/he will most likely pass #1 and #2 but fail the rest of the items,
thus obtaining 20 points out of 50 (or 40%). The rationale behind this
prediction is based on the relative distance between item difficulty and
test-taker ability. Henning (1987) describes the logic concisely as follows:
“The farther person ability is below item difficulty, the more unlikely will
be success in responding to the item. Similarly, the farther person ability is
above item difficulty, the more unlikely will be failure in responding to the
item.” (p. 122)
This pass rate will change if we add #6 with C1 difficulty: (10 + 10 + 0

+ 0 + 0 + 0) / 60 = 33.3%; if we add #6 with A1 difficulty again: (10 + 10
+ 0 + 0 + 0 + 10) / 60 = 50%. Thus, setting a cut-off score by an absolute
46 Chapter Two
point is arbitrary and unreliable. As long as the candidates are placed in

relative order (as in ordinary entrance examinations) this scoring system
will work. But if the decision maker wants to know what they can do, the
score-based assessment does not provide the necessary information.
Another problem with the traditional approach is that it is difficult to
specify the test taker’s proficiency level on the absolute scale. In the case
of ordinary entrance examinations the decision maker may accept the
candidates from the top scorer to someone on the borderline. But in
placement tests, the information of the order of test-takers alone does not
help the decision maker understand what proficiency level they are in.
To solve these problems, it is proposed here using a nominal logistic
regression analysis. Simply put, it is a nonlinear regression model applied
for estimating probabilities of categories as dependent variables when a
continuous variable is given as an independent variable. In the example of
Table 2-3 above, if a candidate passes items with A2 difficulty repeatedly
and fails items with B1 difficulty even though the responses include
errors, his/her ability is estimated somewhere between A2 and B1.
Nominal logistic regression analysis deals with this kind of case where
item difficulty levels are a continuous variable and pass/fail responses are
categorical variables. In mathematical terms, the probability of a
candidate’s performance (i.e., pass/fail) is predicted by the following
equation (1):
(1) p(x) = EXP(ß0 + ß1 x)/(1 + EXP(ß0 + ß1 x))
where x represents the difficulty level and p(x) represents the probability
of either pass or fail of the item. ß0 and ß1 are parameters that characterize
this candidate, and EXP represents an exponent of e, a constant equal to
2.71828 or the base of natural logarithm. It is where the probability of
response is 0.5 that the candidate’s ability matches the item difficulty.
This formula for logistic regression can be found in various books on
multivariate statistics, e.g., Lloyd (1999). Logistic regression is an analytic
method often used in item analysis. Figure 2-1 below shows an example of
a graphical representation of logistic regression analysis applied to a test
item in an actual test conducted for Japanese university students of English
as a foreign language (EFL) (N = 316) in which one grammatically correct
sentence should be chosen out of four options: (a) Grandma went
shopping, (b) Grandma went to shop, (c) Grandma went shop, and (d)
Grandma went to shopping (Amma, 2001). Note that the responses are
reduced to either ‘Pass’ (for option (a)) or ‘Fail’ (for other options or no
answer).
The graphical representation of probability of responses created by the

statistical package JMP (SAS Institute, 2012) shows how likely one is to
pass this item given a certain proficiency level. The horizontal axis shows
the candidates’ proficiency by means of the total score of the test. The
vertical width of the bottom layer at the total score of a certain candidate
indicates his/her probability of correct response. The total score at which
the vertical line crosses the 50% probability level corresponds to the
estimated difficulty of the item.
Figure 2-1: Sample representation of logistic regression analysis of a test item

(Adapted from Amma, 2001).
Unlike ordinary logistic regression analyses in which all test takers’

responses are incorporated in a single item, the author proposes a single-
person summary of all responses in one test. If we accumulate the pass/fail
responses to items of varying difficulties, we can estimate the candidate’s
proficiency level, which is indicated by the vertical line that crosses the
50% pass/fail probability level. Figure 2-2 is a sample image of a
graphical output of logistic regression analysis, taken from another study
(Amma, 1990). It illustrates how an S-shaped logistic curve predicts the
transition of pass/fail probability (vertical axis) of test items as the
difficulty of items (horizontal axis) varies. The proficiency of this
candidate (ID = S005) is comparable to the horizontal point at which the
vertical probability measures 0.5, indicated by the dotted line. An S-
48 Chapter Two
shaped (or reverse-S-shaped, depending on the order of categories in

layers) logistic curve can be observed in any sound proficiency test.
Nominal logistic regression is a kind of non-linear regression for
estimating categorical probability when the independent variable is
continuous and dependent variable categorical (JMP, 2002).
Figure 2-2: Sample representation of logistic regression analysis of a candidate

(Adapted from Amma, 1990).
Linacre (1987) describes a very basic mechanism of calculating

proficiency in logits:
“Let’s say that we know that a particular person has a 75% chance of
succeeding on a question about state capitals. This 75% probability of
success means that he has 3 chances succeeding to 1 of failing, so that the
scale value is the natural logarithm of 3/1 = log (3) = 1.1 logarithmic units
(“logits”). A 50% chance of success, or a 50% chance of failure would be 1
chance of succeeding to 1 chance of failing, giving a scale value of the
logarithm of 1/1 = log (1) = 0.” (pp. 4-5)
Linacre’s BASIC program can produce a summary report of a candidate

after taking 10 test items (see Table 2-4 below). Because Linacre’s interest
was to pick up a proper item for efficient estimation of a person’s
proficiency in a computer adaptive test (CAT), the item difficulty
information was already stored in the item bank, and this information was
to be re-estimated by Rasch model calculation every time the test was

administered. This is where a placement test is different from CAT. In
CAT, the candidate’s proficiency is obtained as a result of adaptation by
convergence over several trials, and the item to be presented is chosen
after every trial. In a placement test, in contrast, all items must be prepared
in advance and the estimation of a candidate’s proficiency is done after all
the items are presented. In other words, it is possible that a placement test
includes improper items either too easy or too difficult—even though the
level setting of individual items is correct—which are likely to affect the
result in a traditional assessment method.
Table 2-4: Summary report of a candidate (Adapted from Linacre

1987, p. 25).
Item Difficulty Right/Wrong

2 96 Right
24 99 Right
1 104 Wrong
25 114 Right
7 106 Wrong
13 111 Right
12 105 Right
15 109 Wrong
3 85 Surprisingly Wrong
18 103 Right
Note: This candidate scored in the range from 101 to 115 at about 108 after 10
questions.
The advantage of logistic regression analysis is that the estimated

proficiency level is independent of the combination of easy items and
difficult items, as long as the responses are consistent. Figure 2-3 below is
the result of a simulation using the data in Table 2-5, where there are
supposedly too many easy items.
50 Chapter Two
Item Difficulty Response

#1 2 Pass
#2 2 Pass
#3 2 Pass
#4 2 Pass
#5 2 Pass
#6 2 Pass
#7 2 Pass
#8 2 Pass
#9 8 Fail
#10 8 Fail
The estimated proficiency level is 5.0. Compare it with Figure 2-4

based on another set of simulation data in Table 2-6 where there are too
many difficult items.
Item Difficulty Response

#1 2 Pass
#2 2 Pass
#3 8 Fail
#4 8 Fail
#5 8 Fail
#6 8 Fail
#7 8 Fail
#8 8 Fail
#9 8 Fail
#10 8 Fail
The estimated proficiency level is again 5.0. This robustness alone

makes logistic regression analysis advantageous over the traditional
method of simple summing.
52 Chapter Two
Estimation of Proficiency Level and Confidence Interval

The purpose of the present placement test was to judge whether the
two candidates who wished to be transferred to another department were
capable of the language tasks involved in the regular course. That is,
whether the candidates’ pass or fail depends totally on their estimated
proficiency levels. Unlike ordinary entrance examinations, the decision
maker can reject the candidates, if they do not reach the required
proficiency level.
The test material included several types of questions from lower-order
reading comprehension to higher-order inferencing and paraphrasing or
summarising. Formal grammar and vocabulary were also involved as part
of the reading skill. The test items were partly multiple-choice and partly
open-ended. Below are two sample test items, one multiple-choice and the
other open-ended.
1. Education
The literacy rate in Finland is 99% and the number of newspapers and
books printed per capita is one of the highest in the world. The nine-year
comprehensive school (peruskoulu) is one of the most equitable systems in
the world—tuition, books, meals and commuting to and from school are
free. All Finns learn Swedish and English in school and many also study
German or French. (Source: Lehtippu, 1996, p. 31).
#3. How much is the education cost in Finland?

a. Basically free.
b. Parents pay 10% of the entire cost.
c. The state supports about half of the expense.
d. Free only for primary and secondary education.
2. The Conquest of Reality

The idea of a revival was closely connected in the minds of the Italians
with the idea of a rebirth of ‘the grandeur that was Rome’. The period
between the classical age, to which they looked back with pride, and the
new era of rebirth for which they hoped was merely a sad interlude, ‘The
Time Between’. Thus the idea of a rebirth or renaissance was responsible
for the idea that the intervening period was a Middle Age—and we still use
this terminology. (Source: Gombrich, 1984, pp. 167-169).
#14. Why was the period before the Renaissance called the ‘Middle Age’
(underlined part)?
What was special about this test was that, unlike ordinary entrance
examinations, the test writer specified a difficulty level for each item as he
wrote the test. Since the test writer was in contact with the teaching staff
who were familiar with the level of teaching and goals of the curriculum,
it was easy to connect the difficulty levels to the proficiency required in
the specific course (see Table 2-2).
After the test was administered the candidates’ individual responses for
open-ended questions were judged as either pass or fail, depending on
whether they satisfied the required proficiency of the item in question. In
the case of multiple-choice items, the responses were simply pass or fail.
The rater’s job was to calculate the estimated proficiency level of the
candidates using the pass/fail information. For example, candidate A
passed items #6, #10, and #17 whose difficulty levels were all 6 but failed
items #8, #12 which were in Level 7. From this fact alone we may infer
that her proficiency estimate is somewhere between 6 and 7. But we also
have to consider the contradictory responses. She failed the relatively easy
item #19 (Level = 6), and passed the relatively difficult item #14 (Level =
7). Candidate B had more such contradictory responses (see Table 2-7).
Table 2-7: Item difficulty and responses (arranged in the order of

difficulty levels).
Item Level Candidate A Candidate B

#1 2 Pass Pass
#2 2 Pass Pass
#3 4 Pass Pass
#4 4 Pass Fail
#5 4 Pass Fail
#7 5 Pass Pass
#11 5 Pass Pass
#6 6 Pass Pass
#9 6 Pass Pass
#10 6 Pass Fail
#19 6 Fail Fail
#17 6 Pass Fail
#8 7 Fail Fail
#12 7 Fail Fail
#14 7 Pass Pass
#16 8 Fail Fail
#20 8 Fail Fail
#13 10 Fail Fail
#15 10 Fail Fail
#18 10 Fail Fail
54 Chapter Two
A logistic regression analysis conducted by JMP yielded Figures 2-5

and 2-6 indicating the probability of pass and fail as a function of the
difficulty level. The difficulty level (‘Predicted Adjusted level’) at which
the fail rate is 0.50 is the point of the candidate’s proficiency level. The
two candidates’ proficiency levels and their 95% confidence intervals
(indicated by the arrows in Figures 2-5 and 2-6) were calculated by the
‘inverse prediction’ procedure (Table 2-8).
Figure 2-5: Logistic fit of Candidate A with confidence interval. Bottom layer =
‘fail’.
Figure 2-6: Logistic fit of Candidate B with confidence interval. Bottom layer =
‘fail’.
Table 2-8: Statistics of the candidates’ proficiency levels and

Candidate Level Lower 95% Upper 95%

A 6.924 6.600 7.482
B 6.925 6.216 8.001
In the actual data, both Candidate A and Candidate B are estimated to

have similar levels of proficiency, close to high-intermediate in Table 2-2.
But the confidence interval shows that Candidate B is less stable than
Candidate A. Since the proficiency levels have been obtained, the next
step is to judge whether each candidate will pass or fail, depending on the
critical proficiency level that the institution requires.
Advanced Estimation of Proficiency Level

and Confidence Interval
The analyses so far made were based on a simplified version of the
data. The actual test included some items with open-ended questions,
instead of multiple-choice. If the answer was not perfect an exact level
was assigned directly. For instance, #15 was originally Level 10, but
56 Chapter Two
Candidate A’s performance was judged as equivalent to Level 5.

Candidate A had three such cases and Candidate B had five (Table 2-9).
The problem here was that the rater had to deal with two different
kinds of data—binary and continuous data—on one scale. However, an
exact rating does not fit logistic regression analysis, since the dependent
variable must be categorical in nominal logistic regression analysis. So the
rater doubled the responses, changed the difficulty level to the rated level,
split the weight, and finally made one response “Pass” and the other
“Fail”, because the exact rating means the probability of the candidate’s
answering the item correctly is 50% (Tables 2-10 and 2-11).
Table 2-9: Item difficulty and responses by exact scoring.
Item Level Weight Candidate A Candidate B

#1 2 3 Pass Pass
#2 2 3 Pass Pass
#3 4 3 Pass Pass
#4 4 3 Pass Fail
#5 4 3 Pass Fail
#7 5 3 Pass Pass
#11 5 3 Pass Pass
#6 6 3 Pass Pass
#9 6 4 Pass Pass
#10 6 4 Pass 5
#19 6 6 Fail Fail
#17 6 8 Pass 4
#8 7 3 Fail Fail
#12 7 3 Fail Fail
#14 7 8 Pass Pass
#16 8 8 Fail 7
#20 8 8 6 Fail
#13 10 8 Fail 4
#15 10 8 5 Fail
#18 10 10 6 4
Table 2-10: Adjusted responses for Candidate A (part).
Item Level Weight Candidate A

#11 5 3 Pass
#15 5 4 Pass
#15 5 4 Fail
#10 6 4 Pass
#20 6 4 Pass
#20 6 4 Fail
#18 6 5 Pass
#18 6 5 Fail
#19 6 6 Fail
#17 6 8 Pass
#12 7 3 Fail
#14 7 8 Pass
#16 8 8 Fail
#13 10 8 Fail
Table 2-11: Adjusted responses for Candidate B (part).
Item Level Weight Candidate B

#13 4 4 Pass
#13 4 4 Fail
#17 4 4 Pass
#17 4 4 Fail
#18 4 5 Pass
#18 4 5 Fail
#10 5 2 Pass
#10 5 2 Fail
#11 5 3 Pass
#19 6 6 Fail
#12 7 3 Fail
#16 7 4 Pass
#16 7 4 Fail
#14 7 8 Pass
#20 8 8 Fail
#15 10 4 Fail
58 Chapter Two
In the case of Candidate A, item #15 has a new level, it is now Level 5
because the rater judged her performance as corresponding to Level 5, the
weight is 4, half the original weight, and one response is ‘Pass’, and the
other one is ‘Fail’. The result of this process shows an expansion of
confidence interval as well as a drop in proficiency level, even more
notably with Candidate B (see Table 2-12 below).
Table 2-12: Comparison of estimated proficiency levels and

Candidate Scoring Level Lower 95% Upper 95%

A dichotomous 6.924 6.600 7.482
A exact 6.552 6.089 7.253
B dichotomous 6.925 6.216 8.001
B exact 5.401 3.872 6.774
Discussion
It appears that the exact estimation leads to decreased accuracy even
though it was intended to increase it. One reason for this seeming decrease
in reliability would be the possible inconsistency of the subjective rating
with the rest of the dichotomous data; the finer the rating is as one
pinpoints the level, the less accurate the conclusion becomes than when
one stays in a rough estimation. That is, an exact judgement that a
candidate’s proficiency is at Level 6.0 when his/her true proficiency is at
Level 6.5 is less accurate than a vague judgement that the candidate’s
proficiency is somewhere below Level 10. It is a matter of whether we
trust the rater’s case judgement or the latent ability structure that the
candidate is assumed to follow. In other words, should our analysis be
data-driven or model-driven? This question may remind us of the contrast
between Item Response Theory (IRT) and Rasch Model analysis. If we
understand the nature of variability in human behaviour, however, the
reality may more likely lie in obscurity than clearly focused measurement.
One reservation has to be made with this estimation method using
logistic regression analysis. The present data happened to have no missing
data. Where there are some, they will be ignored from calculation because
the difficulty level in equation (1) is not obtained. Revuelta (2004) points
out that raters cannot distinguish a simple accidental absence of data from
intentional avoidance of answering, and proposes a correction programme.
Although he deals with self-adaptive tests, the same possibility may occur
in placement as well as other time-constrained tests when the candidate
discards an item after attempting to solve it and, finally, deciding it is too

difficult.
To end our discussion, let us consider the application of the present
technique to two cases. First, we can guarantee the equivalence of two
tests, to some extent, where both test items and test takers are completely
different. A good example is the National Center Test in Japan (National
Center for University Entrance Examinations, n.d.). It is a nationwide
public examination for admission to most of the universities/colleges in
Japan, taken by roughly 530,000 students (about 45% of the population at
age 18). Although it is one of the largest high-stakes tests in Japan, there
has been no known valid means of equating test results across years. But,
since the test writing criteria which the test writers rely on remain more or
less the same, the test-takers’ absolute proficiency can be estimated with
reference to a standardised proficiency scale (such as the CEFR) by
specifying difficulty levels of items at the time of writing the items.
Because it is a large-scale examination, it is convenient to reevaluate the
item difficulty and to adjust the criteria for assigning difficulty levels.
Another application can be made to small-scale online proficiency tests
such as M-Reader (M-Reader, n.d.). M-Reader is a collection of quizzes
for graded readers intended to promote extensive reading for students. The
result of students’ taking the quiz serves to verify that they have read the
books. The quizzes are written by voluntary teachers, so it may be an extra
job to assign a difficulty level to each item, but a little additional effort as
an option would let the students realise which proficiency level they are in.
It would also help quiz writers to control the difficulty level by referring to
the difficulty criteria. It is worth adopting the logistic regression analysis
in estimating the students’ proficiency because the combination of easy
items and difficult items is not controlled. Besides, since the estimation
business is carried out for individual students, rather than in a large group
of test takers at one time in institutional examinations, the proposed means
of estimation is advantageous for its adaptability to on-demand requests.
Conclusion
This chapter described how the use of logistic regression analysis made
it possible to successfully estimate test-takers’ absolute proficiency levels,
which was impossible by traditional placement based on raw scores.
Proficiency was described in terms of a set of levels, which included the
rough can-do statements (Table 2-2). This qualitative information is useful
when the candidates are streamed into classes. The information of the
60 Chapter Two
confidence interval helps judge whether the estimated proficiency level is

reliable or not.
The strongest advantage of the present procedure is its application to
small-size tests. Thus, even if there are only one or two test-takers—in
which case IRT cannot be used—an estimation of the test-takers’
proficiency levels is obtainable, as long as the item difficulty is reliably
specified. A large-scale standardised test, such as the Test of English for
International Communication (TOEIC) or one of the Cambridge tests, may
contain reliable item information, and in that sense it is advised to take
such a test. However, it is more meaningful for teachers to write test items
reflecting the local context and judge the acceptability of candidates by
matching their proficiency descriptions with the curriculum contents. In
fact, large-scale standardised tests do not provide detailed feedback of
individual item responses.
The present procedure is capable of further refinement. Instead of the
rough descriptors of difficulty levels, we could use the CEFR, IELTS
(International English Language Testing System), OPI (Oral Proficiency
Interview), or any other predefined proficiency scale. There is a high
practical need for estimating candidates’ proficiency with reference to the
established scales, but studies in criterion-referenced testing (CRT) have
only referred to the application of IRT in the connection of item difficulty
with person ability. Brown and Hudson (2002), for example, explain how
examinees are characterised by specific can-do statements in combination
with IRT. When the item level information is predefined, we do not need
IRT, which requires a vast number of test takers.
A possible drawback in the proposed process would be the difficulty in
securing stable item difficulty information, hence the need for test-writer
training. The question of whether the difficulty levels are accurately
specified is a crucial issue with respect to the validity/reliability of the
entire test. Furthermore, the validity/reliability is also affected by the
number of test items prepared. Obviously, when the quality of test items is
high in terms of internal consistency, a relatively small number of test
items will suffice. However, these two are independent issues outside the
present focus and it is assumed that the difficulty specification is done
with high credibility.
Further study is needed for the treatment of partial assignment of
proficiency levels. Some open-ended items can end up with responses
whose degree of perfection is judged by a rater. These partial assignments,
provided that they correspond to the difficulty levels initially defined, are
an absolute estimation of the candidate’s proficiency. They have more
information than reduced responses of pass/fail. We need to investigate
further validation of the incorporation of these responses harmoniously

into other dichotomous responses for which the true proficiency level is
expressed in inequality equations (i.e., if one passes an item of Level 5,
his/her true level is at 5 or above).
References
Amma, K. (1990). Unpublished internal document. [A grammar test
conducted at a private university in Tokyo.]
—. (2001). Variations of parsing strategies among EFL learners of
different proficiency levels. Ronso (Bulletin of the Faculty of
Humanities, Tamagawa University), 41, 79-115.
—. (2013). Criterion-referenced testing in small-scale placement: A case
study. Paper presented at the English Language Education and the
CEFR in Japan, JACET Kanto Chapter Meeting, June 16, Aoyama
Gakuin University, Tokyo, Japan.
Brown, J. D., & Hudson, T. (2002). Criterion-referenced language testing.
Cambridge: Cambridge University Press.
Council of Europe. (2001). Common European framework of reference for
languages: Learning, teaching, assessment. Cambridge: Cambridge
University Press.
Fulcher, G. (1997). An English language placement test: Issues in
reliability and validity. Language Testing, 14(2), 113-139.
Gombrich, E. H. (1984). The story of art. Oxford: Phaidon Press.
Harsch, C., & Rupp, A. A. (2011). Designing and scaling level-specific
writing tasks in alignment with the CEFR: A test-centered approach.
Language Assessment Quarterly, 8(1), 1-33.
doi: 10.1080/15434303.2010.535575.
Hawkins, J. A., & Buttery, P. (2010). Criterial features in learner corpora:
Theory and illustrations. English Profile Journal, 1(1), 1-23.
doi: 10.1017/S2041536210000103.
Henning, G. (1987). A guide to language testing: Development,
evaluation, research. Rowley, Massachusetts: Newbury House.
—. (1992). Dimensionality and construct validity of language tests.
Language Testing, 9(1), 1-11.
JMP. (2002). JMP Version 5: Statistics and graphics guide. Cary, North
Carolina: SAS Institute.
Lehtippu, M. (1996). Finland: A lonely planet travel survival kit.
Hawthorn, Victoria, Australia: Lonely Planet Publications.
62 Chapter Two
Linacre, J. M. (1987). A computer program for adapting testing by

microcomputer. MESA Psychometric Laboratory, Memorandum
No.40, University of Chicago.
Lloyd, C. (1999). Statistical analysis of categorical data. New York: John
Wiley & Sons.
M-Reader (n.d.). Retrieved from:
http://mreader.org/mreaderadmin/s/html/about.html
National Center for University Entrance Examinations (n.d.). Retrieved
from: http://www.dnc.ac.jp/
Plakans, L., & Burke, M. (2013). The decision-making process in
language program placement: Test and nontest factors interacting in
context. Language Assessment Quarterly, 10(2), 115-134.
Revuelta, J. (2004). Estimating ability and item-selection strategy in self-
adapted testing: A latent class approach. Journal of Educational and
Behavioral Statistics, 29(4), 379-396.
SAS Institute. (2012). JMP version 10.0.0. Cary, North Carolina: SAS
Institute.
Wall, D., Clapham, C., & Alderson, J. C. (1994). Evaluating a placement
test. Language Testing, 11(3), 321-344.
Weir, C. (2005). Limitations of the Common European Framework for
developing comparable examinations and tests. Language Testing,
22(3), 281-300.
CHAPTER THREE
BILINGUAL LANGUAGE ASSESSMENT

IN EARLY INTERVENTION:
A COMPARISON OF SINGLE-
VERSUS DUAL-LANGUAGE TESTING
CAROLINE A. LARSON, SARAH CHABAL,

AND VIORICA MARIAN
Abstract
Despite a growing number of bilingual children enrolled in Early
Intervention language services, methods of administering language
assessments to bilingual children are not standardized. This study reports
clinically-meaningful differences in bilingual children’s receptive and
expressive language outcomes when their language skills are assessed in
the primary language versus in both the primary and secondary languages.
Eleven Spanish-English speaking children (ages 1;11 to 2;11) with
language delay enrolled in Early Intervention were assessed using The
Rossetti Infant-Toddler Language Scale (Rossetti, 1990) in their primary
language only, and then in both their primary and secondary languages.
When assessed in only one language, bilingual children’s language skills
were underestimated by 1.4 months for receptive language and 2.2 months
for expressive language; language delay was overestimated by 4.7% for
receptive language and by 7.8% for expressive language. Single-language
assessments would lead to inappropriate Early Intervention referral for 3
of the 11 tested children. It is therefore suggested that assessing bilingual
children in only one language leads to a significant underestimation of
receptive and expressive language abilities and a significant overestimation
of language delay. Consequently, the efficacy, reliability, and validity of the
assessment are compromised and best practice as mandated by speech-
language pathology certification organizations is not achieved.
64 Chapter Three
Introduction
The number of bilingual children in the United States, as well as
throughout the world, is rapidly growing, due, in part, to globalization,
migration, and an increased prevalence of bilingual education options. For
example, of school-age children in the United States, 22% speak a
language other than English in the home (Lowry, 2011). Within certain
areas, such as large cities, an even higher percentage of families speak
more than one language in the home. For instance, a language other than
English is spoken in 35.5% of Chicago residences (United States Census
Bureau, 2013). Children in these homes who are developing more than one
language are generally believed to have language disorders at a similar
rate as children acquiring only one language (Kohnert, 2010). As a result,
the caseload makeup for speech-language pathologists often includes
children with language delay who are developing bilinguals.
When young monolingual and bilingual children fail speech-language
screenings or are referred by pediatricians due to speech-language concerns,
they undergo language assessment to determine eligibility for Early
Intervention services. For example, in Illinois a child is considered eligible
for speech-language services when he or she demonstrates a 30% or more
delay in one or more areas of speech, language, or communication, when
he or she presents with a medical diagnosis that typically results in
developmental delay, or when he or she is determined to be at risk of
substantial developmental delay (Illinois Department of Children and
Family Services, 2003; Illinois Department of Human Services
Community Health and Prevention Bureau of Early Intervention, 2009).
Eligibility for speech-language services through the Early Intervention
program in the United States is often determined based on assessment
outcomes of The Rossetti Infant-Toddler Language Scale (Rossetti, 1990).
The Rossetti is a criterion-referenced assessment of preverbal and verbal
areas of communication and interaction for children up to three years of
age. The skill age at which all criteria are demonstrated and the resulting
percent of receptive or expressive language delay relative to chronological
age decide the children’s eligibility for Early Intervention.
The Rossetti is often used in the Early Intervention program as it is
familiar to Early Intervention clinicians across disciplines (e.g.,
occupational therapists, social workers, etc.) (Marchman & Martinez-
Sussmann, 2002) and because few other assessment tools cover a similar
breadth of developmental domains within the birth to three age range. Like
many assessments structured for use with young children (e.g., Bzoch,
League, & Brown, 2003; Hedrick, Prather, & Tobin, 1984; Marchman &
Bilingual Language Assessment in Early Intervention 65
Martinez-Sussmann, 2002; Rescorla, 1989; Wetherby & Prizant, 1993),

The Rossetti is primarily informal, follows a checklist format, and involves
multiple sources reporting the presence or absence of specified skills. The
Rossetti is often preferred over other assessments due to ease of administration
in the home environment and applicability to the Early Intervention
program assessment requirements (Illinois Department of Human Services
Community Health and Prevention Bureau of Early Intervention, 2009).
Background
Despite its use within the Early Intervention program, methods of
administering The Rossetti assessment to bilingual children are not
standardized. When The Rossetti is used to assess bilingual children,
accepted practices include measuring language abilities in only the child’s
primary language, in only the child’s secondary language, or across both
developing languages.
One concern with assessing bilingual children’s language skills in only
their primary or secondary language is that developing bilinguals with
language delay often display uneven skill distribution and shifting
development across languages, as well as individual variation in their
developmental trajectories (Kohnert, 2010). For example, a child may
have relatively even expressive vocabulary skills in Spanish and English,
but demonstrate more advanced verb conjugation skills in English. Even in
typically-developing bilingual children, language acquisition is
characterized by variable timeframes and patterns of development, which
cause difficulty in obtaining valid assessment outcomes (e.g., Kohnert &
Goldstein, 2005; Marian, 2008; Marian, Faroqi-Shah, Kaushanskaya,
Blumenfeld, & Sheng, 2009). Therefore, single-language assessment of
developing bilinguals may not accurately reflect their language abilities
and may not be best practice. Indeed, previous research with school-age
bilinguals suggests that both languages should be measured and
considered as a composite in order to reduce the risk of misdiagnosis and
inappropriate individualized education plans (Kohnert, 2008; Kohnert,
2010; Marian et al., 2009; Roseberry-McKibbin, Brice, & O’Hanlon,
2005).
While such risks in the school-age population are well documented,
there is little research examining language assessment methods with birth
to three-year-old bilingual children who have language delays (Dollaghan
& Horner, 2011). Within typically-developing populations, the
language(s) of assessment can affect measures of young bilinguals’ total
vocabulary size (Core, Hoff, Rumiche, & Señor, 2013; Hoff, Core, Place,
66 Chapter Three
Rumiche, Señor, & Parra, 2012; Thordartottir, Rothenberg, Rivard, &

Naves, 2006), grammatical ability (Hoff et al., 2012), and syntax
(Thordartottir et al., 2006). Similar metrics are likely to be impacted by the
language of assessment for children with language delays. Understanding
how bilingual children’s assessments are impacted by the use of single- or
dual-language practices is important for early and accurate detection of
language disorders. Early assessment allows for the provision of Early
Intervention speech-language services to the young bilingual population,
which results in faster gains and possible prevention or minimization of
deficits (National Joint Committee on Learning Disabilities, 2006; Paul,
2007; Woods & Wetherby, 2003).
Because the Early Intervention population often includes bilingual
children with language delays, the current study aimed to determine
whether young bilingual children’s language assessment outcomes were
different when evaluated in only one language as opposed to in both of the
children’s developing languages.
The Study
The study reported in this chapter looked at the differences in
expressive and receptive language measures on The Rossetti for birth to
three year old bilingual children with language delay when they were
assessed in their primary language versus in both their primary and
secondary languages. It was hypothesized that assessment outcomes
provide a more accurate picture of the developing bilingual’s language
level when skills are measured across both developing languages.
Therefore, it was predicted that when administering The Rossetti to young
bilingual children with language delay in only one language, outcomes
will underestimate language abilities and overestimate language delay.
Participants
Participants were 11 children (2 girls; 9 boys) of Hispanic descent
ranging in age from 1;11 to 2;11 (Mean = 2;5, SD = 0;4.8), born in the
United States to bilingual Spanish-English speaking parents. All
participants included in the study were assigned to Early Intervention
speech-language services and required annual or 6-monthly Early
Intervention mandated reassessment. All participants passed a hearing
screening within one year of the testing date. Verbal consent was obtained
from the participants’ parents prior to the evaluation.
Information about participants’ demographic information, linguistic
backgrounds, and language skills was obtained from parent reports and
Early Intervention initial evaluation reports (see Table 3-1). Five
participants were reported to use English as their primary language; six
participants were reported to use Spanish as their primary language. On
average, participants made 78% of their expressions in their primary
language (SD = 10.8%) and 22% of their expressions in their secondary
language (SD = 10.8%).
Table 3-1: Demographic information for study participants.
Secondary Language
Secondary Language
Primary Language
Primary Language
Type of Diagnosed
% Expression in
% Expression in
Participant
Gender
Delay
Age
1 M 1;11 English Spanish 80 20 Language

3 F 2;4 Spanish English 80 20 Language
4 M 2;11 Spanish English 75 25 Language
6 M 2;6 English Spanish 90 10 Developmental
7 M 1;11 English Spanish 85 15 Developmental
9 F 2;1 Spanish English 90 10 Language
Materials
Participants were assessed according to Early Intervention standards
using The Rossetti Infant-Toddler Language Scales (Rossetti, 1990) at
home with the presence of a parent, the treating therapist (first author), and
an interpreter who had been assigned by the program to the child’s case at
the onset of service provision. The Rossetti assesses skills across
developmental domains including Interaction-Attachment (e.g., ‘Plays
away from familiar people’), Pragmatics (e.g., ‘Uses words to protest’),
Gesture (e.g., ‘Gestures to request action’), Play (e.g., ‘Stacks and
68 Chapter Three
assembles toys and objects’), Language Comprehension (e.g., ‘Identifies

four objects by function’), and Language Expression (e.g., ‘Uses sentence-
like intonational patterns’), and is used with young children ages birth-
three years (1990).
Because our interest is in the assessment of children’s language skills,
the current study focused on the expressive and receptive language
domains of The Rossetti. Within each domain, children’s skills were
assessed within three month intervals (e.g., 21-24 months of age).
Receptive language measures included: total number of words understood;
the ability to follow two-step directions; the ability to identify body parts;
the ability to answer wh-questions; and the ability to identify objects by
category. Expressive measures included: total number of words spoken;
the frequency with which the child expressed two word phrases; the ability
to verbalize two different needs; the ability to use words to interact with
others; and the ability to imitate animal sounds. Because a child may not
spontaneously produce all of these behaviors within the context of a single
session with the clinician, scores on each domain were credited with equal
weight based on parent report, assessor observation, and/or assessor
elicitation. Behaviors not observed or elicited by the parent or assessor
were considered not yet present.
The evaluator was a state licensed and Early Intervention credentialed
practicing speech-language pathologist-clinical fellow. All interpreters
were Early Intervention credentialed Spanish-English bilinguals who were
familiar to the child and family. Interpreters were assigned to each child at
the onset of Early Intervention service provision.
Procedure
The Rossetti parent questionnaire and test criterion are available in
Spanish and English; however this study’s administration used only the
English questionnaire and test criterion, as an interpreter was present to
translate the questions from English to Spanish. Parent interviews were
completed in English with Spanish interpretation prior to the assessment to
determine participants’ demographic and linguistic backgrounds, and then
with The Rossetti parent questionnaire during the assessment period.
Follow-up questions and clarification questions were used as needed to
ensure adequate and appropriate interpretation of assessment questions.
Within each assessment period, The Rossetti was administered twice:
first only in the participant’s primary language (i.e., credit was only given
for skills demonstrated or reportedly observed in the primary language),
and then in the child’s primary and secondary languages (i.e., credit was
given for skills in either and/or both languages). During primary language
administration, all activities were conducted in the child’s primary
language only, and the child received credit for skills demonstrated in that
language only. For example, a child whose primary language was Spanish
would not receive credit for a skill demonstrated in English. During dual-
language administration, all activities were conducted in a ratio that
matched the parent-reported ratio of Spanish to English expression.
Children were awarded credit for all skills, regardless of their language of
demonstration. Because the assessment accounts for skills that parents
have observed but that may not have been demonstrated during the
assessment period, and because it is a criterion-referenced assessment with
general skill benchmarks, practice effects across single- and dual-language
assessments were not problematic.
The parent interview, primary language assessment, and dual-language
assessment occurred within the same contact period. Sessions lasted
approximately one hour and involved child-directed and therapist-directed
structured play activities, similar to a typical therapy session (e.g., shared
storybook reading, symbolic play with a toy farm, and putting together
puzzles).
Scoring and Data Analysis

The assessment was scored and analyzed by the treating therapist with
adherence to testing procedures. Skills observed or elicited by the
assessing therapist were scored online, and parent-reported skills were
credited offline within one week of administration. All assessment reports
were reviewed by the assessing therapist’s clinical fellowship mentor.
The Rossetti assigns age levels based on the presence of all skills
within a domain’s three month interval. In order to be scored within an age
range, the child must have demonstrated all skills within that interval (i.e.,
if one or more skills from a given level were not present, the child was
assigned a lower age level for that domain). Skills were awarded if they
were observed or elicited by the parent, evaluator, or other reporter (e.g.,
daycare teacher or caregiver). Children were assessed at the highest
reported skill level (e.g., if parents reported that the child used two-word
phrases frequently but the evaluator elicited two-word phrases only
occasionally, the skill was assigned as ‘uses two-word phrases frequently’
(Rossetti, 1990).
Percent language delay was calculated by dividing the child’s lowest
assessed age by his or her chronological age, multiplying that number by
100, and then subtracting 100 (Rossetti, 1990; West Virginia Department
70 Chapter Three
of Human Resources, 2009). For example, a child with a chronological

age of 30 months who demonstrated a receptive language age of 21-24
months would present with a 30% delay in receptive language.
Results
All data were analyzed using paired t-tests to compare outcomes when
assessments were conducted in only the child’s primary language versus
his or her primary and secondary languages. Results revealed that single-
language outcomes underestimated the participants’ receptive and
expressive language skills.
Primary Language Testing

When assessed in the primary language only, participants’ average
receptive language skill age was 20.5 months (SD = 5.8 months),
representing a mean delay of 28.4% (SD = 16.2%). Average expressive
language skill age was 18.8 months (SD = 6.4 months), representing an
average delay of 34.1% (SD = 18.3%). When including scattered skills
(i.e., all ages at which skills were demonstrated), single-language
assessment revealed a highest receptive skill-age average of 21.8 months
(SD = 5.9 months) across participants and a highest expressive skill-age
average of 22.4 months (SD = 5.1 months).
Dual-Language Testing
When assessed in both primary and secondary languages, participants’
receptive skill age was 21.8 months (SD = 6.2 months), representing a
delay of 23.6% (SD = 15.8%). Expressive skill age was 21 months (SD =
6.6 months), representing an average delay of 26.3% (SD = 19%). When
accounting for scattered skills (i.e., skill distribution) across both
languages, average highest receptive skill-age was 24.3 months (SD = 4.9
months) and average highest expressive skill-age was 24.3 months (SD =
4.9 months).
Single- Versus Dual-Language Testing

The data were compared using t-tests. The results of the analyses
suggest that assessment in only the primary language significantly
underestimated receptive skill age by an average of 1.4 months (SD = 1.6
months, t(10) = 2.8868, p < .05) (see Table 3-2 and Figure 3-1) and
expressive skill age by an average of 2.2 months (SD = 1.4 months, t(10) =
5.1640, p < .05) (see Table 3-3 and Figure 3-2). Primary language
assessment also significantly overestimated the language delay by 4.7%
(SD = 5.7%, t(10) = 2.7368, p < .05) for receptive skills and by 7.8% (SD
= 5.4, t(10) = 4.8348, p < .05) for expressive skills (see Figure 3-3). The
findings also suggest that single-language assessment significantly
underestimated scattered skills by 2.5 months (SD = 1.8 months, t(10 =
4.5000, p < .05) in the receptive domain and 1.9 months (SD = 2.0 months,
t(10) = 3.1305, p < .05) in the expressive domain.
Discussion
The results of the present study confirm that assessing bilingual
children in only one language leads to a significant underestimation of
participants’ receptive and expressive language abilities and a significant
overestimation of their language delay. Scattered skill measurement,
which provides treatment planning and skill distribution information, was
also significantly underestimated. As a result of obtaining inaccurate
assessment outcomes, eligibility determination and treatment planning are
therefore compromised when assessing language skills in only one
language, and implementation of best practice (ASHA, 2010) is not
achieved. We conclude that clinicians working with bilingual children
must measure highest skill levels across both languages to obtain accurate
diagnostic and treatment planning information.
72 Chapter Three
Table 3-2: Receptive language ability as indexed by The Rossetti

(1990) Language Comprehension subtest.
Skill Age Assessment Highest Skill Age

Assessment
Primary Language Only
Primary Language Only

Skill Age Difference

% Delay Difference
(months; % delay)
(months; % delay)
Dual Language
Dual Language
Participant
(months)
(months)
(months)
(months)
1 15; 35% 18; 21% -3 -14% 15 21 -6
2 24; 20% 24; 20% -0 -0% 24 27 -3
3 27; 4% 27; 4% -0 -0% 27 27 -0
4 27; 25% 30; 17% -3 -8% 27 30 -3
5 21; 25% 21; 25% -0 -0% 24 27 -3
6 12; 60% 12; 60% -0 -0% 15 18 -3
7 15; 35% 18; 22% -3 -13% 15 18 -3
8 18; 49% 21; 40% -3 -9% 21 24 -3
9 18; 14% 18; 14% -0 -0% 21 21 -0
10 30; 14% 33; 6% -3 -8% 33 33 -0
11 18; 31% 18; 31% -0 -0% 18 21 -3
Mean 20.5; 28.4% 21.8; 23.6% -1.4* -4.7%* 21.8 24.3 -2.5*
Note: * = significant difference at p < .05
30 * * Primary Language
25
Age in Months
20 Primary and Secondary

Language
15
10
5
0
Skill Age Highest Skill
Age
Figure 3-1: Participants’ receptive language assessment results using The Rossetti
(1990) Language Comprehension subtest. Error bars represent standard errors and
asterisks indicate significant differences at p < .05.
30
* * Primary Language
25
Age in Months
Primary and
20
Secondary Language
15
10
5
0
Skill Age Highest Skill
Age
Figure 3-2: Participants’ expressive language assessment results using The Rossetti
(1990) Language Expression subtest. Error bars represent standard errors and
74 Chapter Three
Table 3-3: Expressive language ability as indexed by The Rossetti

(1990) Language Expression subtest.
Skill Age Assessment Highest Skill Age

Assessment
Only (months; % delay)

% Delay Difference
Primary Language
Primary Language
(months; % delay)
Dual Language
Dual Language
Only (months)
Participant
(months)
(months)
(months)
1 18; 21% 21; 9% -3 -12% 18 21 -3
2 24; 20% 24; 20% -0 -0% 24 27 -3
3 21; 25% 24; 14% -3 -11% 21 27 -6
4 24; 33% 27; 25% -3 -8% 33 33 -0
5 24; 14% 24; 14% -0 -0% 27 27 -0
6 9; 70% 9; 70% -0 -0% 27 27 -0
7 12; 48% 15; 35% -3 -13% 21 24 -3
8 18; 49% 21; 40% -3 -9% 21 24 -3
9 15; 28% 18; 14% -3 -14% 18 18 -0
10 30; 14% 33; 6% -3 -8% 21 24 -3
11 12; 53% 15; 42% -3 -11% 15 15 -0
Mean 18.8; 21; 26.3% -2.2* -7.8%* 22.4 24.3 -1.9*
34.1%
Note: * = significant difference at p < .05
35.00%
* * Primary Language
30.00%
25.00% Primary and

Percent Language Delay
Secondary
20.00% Language
15.00%
10.00%
5.00%
0.00%
Comprehension Expression
Figure 3-3: Percent language delay in primary-language-only assessment and in

dual-language assessment using The Rossetti (1990) Language Comprehension
and Language Expression subtests. Error bars represent standard errors and
Clinical Implications
The results of the present study are relevant for Early Intervention
initial evaluation and ongoing assessment methods. Frequently, initial
evaluations assess developing bilinguals in the primary language or
secondary language only, or the evaluation report does not discuss the
language of assessment. Consequently, questions may be drawn as to the
accuracy of children’s eligibility determination, as well as their speech-
language treatment planning. For example, if dual-language assessment
protocols are not followed, three of the eleven tested participants would
receive inappropriate referral for Early Intervention services. Although
these three participants would meet the 30% delay criterion when assessed
in only one language and could therefore be eligible for Early Intervention
services, when assessed across both of their languages, these participants’
language skills would fall within the average range for bilingual children.
Assessing children in only one language and inappropriately referring
them for services may cause these children’s families to direct limited
familial resources to the children’s treatment, as well as cause undue stress
on the family. Additionally, occupying a finite number of clinicians and
76 Chapter Three
limited funding is not warranted for these children. Children who are
significantly delayed and who actually meet the eligibility requirements
may linger on a waitlist or receive no services as children whose
development is age-appropriate receive treatment. Furthermore, not
accounting for a child’s second language perpetuates negative bias against
bilingual language learners and the differences in their course of language
development as compared to monolingual language development.
Appropriate treatment planning may also be impacted by single-
language assessment as treating therapists develop therapeutic goals and
establish the language of treatment based on the children’s initial
evaluation reports. Developing a treatment plan based on inaccurate
assessment outcomes and skill distribution information is not best practice,
and may hinder the child in reaching his or her full communicative
potential. Also, due to a lack of continuity and infrequent contact between
assessing and treating therapists in Early Intervention, the treating
therapist may not be able to determine how and in what language the
child’s skills were measured based on unreported or inaccurately-reported
language of assessment in the initial evaluation reports. Consequently, the
Early Intervention language assessment process must accurately and
thoroughly account for developing bilinguals’ composite language skills.
The research presented here has direct implications for how language
assessments should be structured. Prior to initiating the assessment process
for children who are developing more than one language, the assessor
must complete a thorough case history with the child’s parent or caregiver,
utilizing interpretation services as necessary. The case history should
include information related to medical history and current health status
(e.g., birth weight, hospitalizations, familial medical history),
developmental milestones (e.g., age the child first walked, first words),
linguistic environment (e.g., primary language, language input/output,
community language), and concerns regarding the child’s language skills
(e.g., the child uses less than five true words and jargoning to
communicate). Assessments should then measure the child’s highest
language skill across both developing languages, as well as scattered skills
and other qualitative information (e.g., the child produces the pronouns ‘I’
and ‘me’ in English and ‘me’ in Spanish independently, but is able to also
produce ‘yo’ in Spanish given support). For example, a child with a
primary language of English and secondary language of Spanish who is
able to follow 2-step directions in English and 1-step directions in Spanish
should receive credit for following 2-step directions. Measuring the
highest reported and observed language skills across languages ensures
that all of the child’s skills are given credit. As a result, the assessment
yields a more appropriate eligibility determination.
Conclusion
To conclude, we have shown that assessing bilingual children in only
the primary language can underestimate their language abilities, and may
result in inaccurate eligibility determination and over-identification of
language delays. Therefore, it is vital that language assessments in
children acquiring multiple languages account for abilities across all
developing languages. Measuring children’s skills in all developing
languages (as opposed to skills in only one language) yields a more
accurate and complete assessment, which has immediate benefits for
appropriate service eligibility determination and treatment planning.
While our current findings provide support for the use of dual-
language assessments when determining children’s eligibility for Early
Intervention services, future research will need to explore the use of
single- versus dual-language assessments as evaluated by independent
raters. Although concerns of examiner bias in the present study were
minimized because all evaluations were thoroughly reviewed and
approved by a non-treating clinician not involved in the present study,
more rigorous evaluation methods are prudent to ensure that the
differences between single- and dual-language assessments are reproducible
across a variety of contexts and populations. Assessment outcomes will
also need to be evaluated across other diagnostic tools (e.g.,
Communication and Symbolic Behavior Scales by Weatherby & Prizant,
(1993); The Language Development Survey by Rescorla (1989); etc.).
Finally, future research will need to investigate the magnitude of
misdiagnoses by expanding the participant selection to more diverse
groups of language speakers (e.g., sequential language learners) and
demographic makeups (e.g., high versus low socioeconomic status). By
ensuring that all children receive accurate diagnoses and referrals for Early
Intervention treatment, best practice standards will be met and increased
therapeutic success will be achieved.
References
American Speech-Language-Hearing Association. (2010). Code of ethics.
Retrieved from: http://www.asha.org/Code-of-Ethics/
Bzoch, K. A., League, R., & Brown, V. (2003). The receptive-expressive
emergent language scale third edition. Austin, TX: Pro-Ed.
Core, C., Hoff, E., Rumiche, R., & Señor, M. (2013). Total and conceptual
78 Chapter Three
vocabulary in Spanish-English bilinguals from 22 to 30 months:

Implications for assessment. Journal of Speech, Language, and
Hearing Research, 56(5), 1637-1649.
Dollaghan, C. A., & Horner, E. A. (2011). Bilingual language assessment:
A meta-analysis of diagnostic accuracy. Journal of Speech, Language,
and Hearing Research, 54, 1077-1088.
Hedrick, D. L., Prather, E. M., & Tobin, A. R. (1984). Sequenced
inventory of communication development-Revised. Torrance, CA:
Western Psychological Services.
Hoff, E., Core, C., Place, S., Rumiche, R., Señor, M., & Parra, M. (2012).
Dual language exposure and early bilingual development. Journal of
Child Language, 39(1), 1–27.
Illinois Department of Children and Family Services. (2003). Title 89:
Social services. Retrieved from:
www.wiu.edu/ProviderConnections/pdf/Rule_500.pdf
Illinois Department of Human Services Community Health and Prevention
Bureau of Early Intervention. (2009). Early Intervention service
descriptions, billing codes and rates: Early Intervention provider
handbook. Retrieved from:
www.wiu.edu/ProviderConnections/pdf/ServiceDescriptionManual09-
10.pdf
Kohnert, K. (2008). Language disorders in bilingual children and adults.
San Diego, CA: Plural Publishing.
—. (2010). Bilingual children with primary language impairment: Issues,
evidence and implications for clinical actions. Journal of
Communication Disorders, 43(6), 456-473.
Kohnert, K., & Goldstein, B. (2005). Speech, language, and hearing in
developing bilinguals: From practice to research. Language, Speech,
and Hearing Services in Schools, 36(3), 169-171.
Lowry, L. (2011). Bilingualism in young children: Separating fact from
fiction. Retrieved from: www.hanen.org/Helpful-Info/Articles.aspx
Marchman, V. A., & Martinez-Sussmann, C. (2002). Concurrent validity
of caregiver/parent report measure of language for children who are
learning both English and Spanish. Journal of Speech, Language, and
Hearing Research, 45(5), 983-997.
Marian, V. (2008). Bilingual research methods. In J. Altarriba, & R. R.
Heredia (Eds.), An introduction to bilingualism: Principles and
processes (pp. 13-38). Mahwah, NJ: Lawrence Erlbaum.
Marian, V., Faroqi-Shah, Y., Kaushanskaya, M., Blumenfeld, H. K., &
Sheng, L. (2009). Bilingualism: Consequences for language, cognition,
development, and the brain. The ASHA Leader, 14, 10-13.
National Joint Committee on Learning Disabilities. (2006). Learning

disabilities and young children: Identification and intervention.
Retrieved from: www.ldonline.org/article/11511/
Paul, R. (2007). Language disorders from infancy through adolescence:
Assessment and intervention. St. Louis, MO: Mosby Elsevier.
Rescorla, L. (1989). The Language Development Survey: A screening tool
for delayed language in toddlers. Journal of Speech and Hearing
Disorders, 54(4), 587-599.
Roseberry-McKibbin, C., Brice, A., & O’Hanlon, L. (2005). Serving
English language learners in public school settings. Language, Speech,
and Hearing Services in Schools, 36(1), 48-61.
Rossetti, L. (1990). The Rossetti Infant-Toddler Language Scale: A
measure of communication interaction. East Moline, IL: Linguisystems.
Thordartottir, E., Rothenberg, A., Rivard, M. E., & Naves, R. (2006).
Bilingual assessment: Can overall proficiency be estimated from
separate measurement of two languages? Journal of Multilingual
Communication Disorders, 4(1), 1-21.
United States Census Bureau. (2013). State and county QuickFacts.
Retrieved from: quickfacts.census.gov/qfd/states/17000.html
West Virginia Department of Human Resources. (2009). WV birth to three
percentage conversion chart. Retrieved from:
www.wvdhhr.org/birth23/files/wvbtt_perc_conv_%20chart.pdf
Wetherby, A. M., & Prizant, B. (1993). Communication and symbolic
behavior scales. Baltimore, MD: Paul H. Brookes Publishing.
Woods, J. J., & Wetherby, A. M. (2003). Early identification of and
intervention for infants and toddlers who are at risk for Autism
Spectrum Disorder. Language, Speech, and Hearing Services in
Schools, 34, 180-193.
CHAPTER FOUR
FREQUENCY AND CONFIDENCE

IN LANGUAGE LEARNING STRATEGY USE
BY GREEK STUDENTS OF ENGLISH
PENELOPE KAMBAKIS-VOUGIOUKLIS
AND PERSEPHONE MAMOUKARI
Abstract
The study reported in this chapter involved twelve Greek learners of
English in an oral administration of a translated and validated version
(Gavriilidou & Mitits, 2013) of the Strategy Inventory for Language
Learning (SILL) questionnaire (Oxford, 1990). There were two
innovations in this study, the first of which concerns the participants’
reporting of not only the frequency of use of each language learning
strategy (LLS), but also of their confidence in the effectiveness of each
strategy. The employment of this extra parameter provided the researcher
the potential to identify factors in learner strategy use not usually detected
by the indication of frequency use only. The second innovation concerns
the use of the bar (Kambaki-Vougioukli & Vougiouklis, 2008) instead of
the usual Likert scale, as this can be more flexible for both the participants
and the researcher. The results of the study showed deviations between the
frequency of strategy use and students’ confidence in the effectiveness of
the language learning strategies indicating that learners either appreciated
the effectiveness of a strategy but they did not know how to use it or that
they used a strategy without firmly believing in its usefulness. These
findings suggest the need for pedagogical interventions in order to raise
the learners’ awareness of language learning strategies and how to use
them effectively. Additionally, more proficient learners reported a higher
frequency and confidence in LLS use than their less proficient peers, while
the age of the learners did not seem to affect LLS use.
Language Learning Strategy Use by Greek Students of English 81
Introduction
Language learning strategies (LLS) are the conscious or semi-
conscious mental processes employed for language learning and language
use (Cohen, 2003). Research has shown that strategies may facilitate
language learning. As a consequence, strategic behavior has greatly
concerned research in language learning (Chamot, 2007; Wharton, 2000).
Moreover, there is enough convincing evidence that language learning
strategies can and should be taught (Chamot, 2005; Cohen & Macaro,
2007; Graham & Macaro, 2008).
Research has also indicated that the use of language learning strategies
can often be unclear since it depends on various factors, such as the
learners’ age, their target language proficiency, and the socio-cultural
context (see Tragant & Victori, 2012 and references therein). Moreover,
the different methodological tools selected to investigate use of LLS may
lead to discrepancies between studies.
Background
Strategy Inventory for Language Learning (SILL)
Oxford’s (1990) Strategy Inventory for Language Learning (SILL)

questionnaire has maintained its reliability, validity, utility (Oxford, 1996)
and, consequently, its popularity among researchers for more than three
decades. SILL measures how frequently learners use memory, cognitive,
comprehension, metacognitive, affective and social language learning
strategies, as described by Oxford (1990). SILL is used to identify the
level of strategy use (low, medium, high) and the statistical tool used to
measure this frequency is the 5-point Likert scale. Most studies on LLS
have employed this measurement for comparable results. Recently,
however, there have been researchers who argue that SILL has a lot more
potential not yet investigated and identified. For instance, Bull and Ma
(2001, p. 174) introduced the Learning Style-Learning Strategies addition
to SILL to measure “similarity between individual learning strategies”,
which may raise learner awareness of LLS use and usefulness.
Confidence is also an important, yet not systematically studied, factor
in the process of language learning. It has been investigated in association
with communication strategies (Kambaki-Vougioukli, 1992a, 1992b,
2001) and among regular student populations in Greece (Mathioudakis &
Kambaki-Vougioukli, 2010). Also, Intze and Kambaki-Vougioukli (2009)
and Intze (2010) investigated confidence in association with the strategy
82 Chapter Four
of guessing among Muslim learners of Greek as a second/foreign language

and found statistically significant differences between males and females,
with the latter being better at guessing and more confident too, compared
to their male peers.
The use of the SILL questionnaire hides a potential danger that may be
related to the actual time of the questionnaire completion and generates
certain questions:
x How familiar are the learners with the certain strategies mentioned
in the questionnaire?
x Are they sure they really employ the strategies they claim they do
or do they think so because they have heard the teacher or their
peers emphasize their importance?
Although one would assume that when learners claim they use a
strategy, they are most likely to consider it effective, we have reasons to
believe that this might not probably be the case. A series of studies
(Kambaki-Vougioukli, 2012, 2013) included confidence along with
frequency in the SILL questionnaire, namely the learners were asked to
specify not only how frequently they used each strategy but also how
confident they felt of its effectiveness. Results from these studies indicate
that when the learners claim they use a strategy, this does not necessarily
imply that they consider it effective as evidenced by the low confidence
scores. There have also been cases where learners claimed they did not use
a strategy but seemed confident that this strategy would help them in
language learning.
Finally, the close relation between the learners’ proficiency and the
frequency of strategy use would be of particular interest together with the
measurement of their confidence of strategy effectiveness.
Moreover, SILL questionnaires are generally in written form and their
data analysis process is usually quantitative. However, the oral
administration of SILL may glean important insights by stimulating the
learners’ individual experiences and by allowing the expression of
attitudes, feelings and behaviors, possibly opening up new topic areas. A
researcher might be able to better explain why a particular response was
given through a qualitative analysis of such results, alongside a
quantitative one.
Proficiency and Age in LLS Use

It has been found that more advanced learners are usually more
proficient in LLS use than less advanced learners (Magogwe & Oliver,
2007). However, there are studies that show no such connection (e.g.,
Phillips, 1991). Discrepancies across studies in this respect may be due to
differences between the participants’ cultural background (Psaltou-Joycey,
2008) and/or to the different ways in which proficiency is measured,
namely, based on the learners’ grades or the learners’/teachers’ relevant
opinions or independent proficiency tests (Tragant & Victori, 2012). In
addition to that, there is the question of whether advanced strategy use is
the outcome or the reason for high proficiency levels and there seems to be
a bidirectional relationship between the two, and interference in both ways
(Green & Oxford, 1995; MacIntyre, 1994; McDonough, 1999).
Similarly, inconclusiveness in the literature regards how age affects
LLS use. In short, while more mature learners are expected to be more
resourceful in LLS use, such an expectation is not verified in all studies
(Psaltou-Joycey & Sougari, 2010). As far as the interaction between age
and proficiency is concerned, Tragant and Victori (2006) conducted a
study with Spanish adolescent learners of English as a foreign language
(EFL), where the learners’ English proficiency was based on their school
grades. Results from this study showed that LLS use is affected by age,
irrespective of proficiency. It should be mentioned, however, that the
methodological instrument in the latter study was not the SILL
questionnaire.
Strategy Use by Learners in Greece

Kazamia (2003) investigated the strategy profile of Greek adults
learning English as a foreign language and the way they perceive the
tolerance of ambiguity while learning a foreign language. Kambakis-
Vougiouklis, Mamoukari, Agathopoulou and Alexiou (2013) during their
research with Muslim children learning English as a foreign language,
having Turkish as their native language and Greek as a second language,
also recorded the level of confidence of the students about the usefulness
of the strategies.
Accordingly, Psaltou-Joycey (2008) mainly focused on the effect of
factors such as age, proficiency and cultural background of university
students learning Greek as a second language. In Gavriilidou and Psaltou-
Joycey (2009), light is shed on issues such as the definition of strategies,
ways of recording them, strategies employed by effective learners, factors
84 Chapter Four
that influence the choice of strategies and the teaching of strategies. Another
study by Gavriilidou and Papanis (2010) investigated the effectiveness of
direct strategy teaching with suggested activities for Muslim students. In
2009, Psaltou-Joycey and Kantaridou investigated multilingualism in
relation to the use of learning strategies as well as learning styles.
Learning strategies are also investigated in the project “ĬĮȜȒȢ 2012”
with the translation and validation of the SILL questionnaire in Greek and
Turkish, aiming to the collection of useful data regarding learning strategy
use.
The previous studies suggest that there is close relation between the
learners’ proficiency and the frequency of strategy use, however, there has
been no recording of the learners’ confidence that the strategies they
employ are actually effective towards their learning.
The study reported in this chapter is part of a larger research project
that involved both Turkish-Greek bilingual students and native-Greek
students. Students provided their responses to the SILL questionnaire in
the form of oral protocols, i.e., face-to-face interviews, in order to have
their frequency of strategy use and confidence of strategy effectiveness
recorded. The oral administration of the SILL allowed interviewees to ask
for clarifications and the researcher to pose further questions and reach
more accurate conclusions about students’ use of language learning
strategies.
The Study
This study is part of a wider investigation conducted in Thrace, Greece
with two groups of learners of English, one Muslim, i.e., native speakers
of Turkish, and the other Christian, i.e., native speakers of Greek. The
terms Christian and Muslim are conventionally used to distinguish the two
groups. The Muslim group results have already been presented elsewhere
(Kambakis-Vougiouklis, Mamoukari, Agathopoulou, & Alexiou, 2013). In
this chapter, we focus on the LLS of the Greek native speakers and both
their frequency of strategy use as well as their confidence in the
effectiveness of those strategies. The study was guided by the following
research questions:
1. How and to what extent does investigating the learners’ confidence

in the effectiveness of a strategy, provide more information
regarding LLS use?
2. Are there any problematic items in the initial version of the SILL
questionnaire, i.e., items that are not well understood by the
learners?
3. Is the learners’ strategic behavior affected by their proficiency in
English (in combination with their age), and if so, how?
Participants
The learners in our study were all Greek and were recruited from the
first three grades of a public secondary school in Thrace (a prefecture in
the north-east of Greece). There were a total of 12 participants (six male
and six female), aged 12-15 years and learning English as a foreign
language. The sample comprised: four students from each grade: two of
low and two of high level in English; one male and one female in each
proficiency level. The learners’ level of English language proficiency was
estimated according to their performance in class and their course grades
by their English teacher, who was also one of the investigators in the
present research study. Learners of intermediate English language
proficiency were not included in the sample because previous research
found differences in LLS use only between learners of low and high
proficiency in the target language (Magogwe & Oliver, 2007).
An Alternative Statistical Tool: The [01] Bar

Kambaki-Vougioukli and Vougiouklis (2008) and Kambaki-
Vougioukli et al. (2011), investigating the possible hidden potential in the
SILL questionnaire, introduced an alternative way of measuring the
learners’ responses, which is the use of a bar [01] instead of the
conventionally used Likert scales on the assumption that such a tool
facilitates the collection and processing of the data. Responses on the bar,
i.e., 0_________________________________1, range from 0, which
represents a completely negative answer or attitude, to 1, which represents
a completely positive answer or attitude.
The greatest difference between using a Likert scale and using the bar
lies in the fact that the completion of the SILL questionnaire using a Likert
scale assumes that the learners fully understand the usually fine difference
between the different grades of the scale. On the other hand, through the
use of the bar, learners are allowed to indicate their use or attitude towards
a strategy by cutting the bar at any point they think that expresses their use
or attitude towards any item. There is no influence to their responses to the
questions by their linguistic knowledge, as it is mostly a hands-on
86 Chapter Four
procedure that requires them to ‘feel’ or sense their position on the bar,
rather than consciously think of the wording or having to choose from any
suggested division pre-arranged for them. Replacing the discrete character
of Likert scales by a fuzzy one, such as that of the bar, seems even more
suitable when a questionnaire is not in the learners’ mother tongue and
where insufficient linguistic knowledge of the target language may distort
the validity of the questionnaire. Similarly, at the results processing stage,
when using a Likert scale, researchers must decide in advance how many
divisions will be used. By contrast, such an initially predetermined
decision is not required by the employment of the bar. Moreover, it is
possible to process the same data using different subdivisions, for a
number of reasons including that of comparability with different research
studies.
The bar was first introduced at a length of 10 cm but was later
modified at 6.2 cm, which is the Golden Ratio of 10. The reason for this
change is that, as argued, since human eyes are used to the decimal
system, people can easily divide a 10 cm long bar equally, which is not
desirable in our case. On the other hand, a bar length of 6.2 avoids familiar
divisions, leaving the participant free to choose from an infinite number of
points (Vougiouklis & Kambaki-Vougioukli, 2011). Finally, Kambaki-
Vougioukli et al. (2011) compared the fuzzy bar with the Likert scale in an
application of a departmental evaluation questionnaire among all students
of the Department of Education in Alexandroupolis, Greece, asking the
students to specify which method they preferred. The results yielded an
overwhelming majority of 98% in favour of the bar.
Instruments and Data Collection Procedure

The questionnaire used in this study was the Greek version of the 50-
item SILL (Oxford, 1990) translated and validated by Gavriilidou and
Mitits (2013). Each question was followed by two separate bars. The first
bar was for measuring frequency of LLS use and the second one for
measuring confidence in the effectiveness of each strategy, as exemplified
in Figure 4-1.
Lannguage Learninng Strategy Use by Greek Studdents of English
h 87
Figure 4-1: A
An example froom the SILL qu
uestionnaire em
mploying the [0
01] bar for
frequency andd confidence.
The queestionnaire wasw orally ad dministered too all learnerrs during

individual innterviews witth their Engliish teacher. T The learners explained
e
their decisioon each time thhey marked where
w they cutt either of the bars. The
learners hadd been briefly instructed by the teacher-reesearcher abou ut how to
fill in the SILL questionnaire using g the bar, w which was something
completely nnew to them; they seemed to t understand it straight away. Then,
they were assked to pay attention to thee fact that nott only did they y have to
indicate how w often they used
u gy, but also hhow confident they felt
a strateg
with each sstrategy, or, inn other words, how effecttive they thou ught each
strategy wass. At this speccific moment, most learnerss reacted by saaying that
if they claim m they oftenn use a strategy, this impllies they conssider this
strategy effeective. They were,
w then, tolld that this miight not be neecessarily
so and that iit was an issuee to be investiigated. All intterviews were recorded
with the learrners’ consentt.
It was hyypothesized thhat when con nfidence was hhigher than frrequency,
then this straategy might need
n to be systematically taaught to learneers. If, on
the other haand, there wass lower confid dence than thee actual frequeency, one
could assumme that the leaarners probablly used the strrategy as a rou utine, not
really apprecciating its valuue.
Resu
ults
Within thhe content-analysis techniqque, all the ansswers were no
ormalized
into groupss on the bassis of two crriteria: (a) cconfidence, wherew the
deviation beetween frequeency of use an nd confidence in the effectivveness of
each strateggy for everyy single quesstion was exxamined; and d (b) the
questionnairre comprehennsion (wording g of the quesstions that miight have
caused somme problems). Also, a decision was maade on the (arbitrary)
convention tthat if the diffference between the confideence and the frequency
f
88 Chapter Four
scorings was 6 on the 6.2 bar, then it was negligible and no further
investigation was necessary. If it were higher, we estimated that it would
need investigation.
Confidence and Strategy Use

The main questions that concern this part of the analysis were:
1. Are the learners confident that the strategy they employ is effective
so they score high confidence where they score high frequency of
use?
2. Do they use certain strategies often but they are not sure of their
effectiveness, so they score lower in confidence?
3. Do they rarely use a strategy but nevertheless score high
confidence in this strategy?
Certain SILL items drew our attention regarding the way the learners
perceived and answered these items, always in relation to the confidence
factor. The items that were of greatest interest are the following:
Q.3 (Memory Strategy): Combining the image with the sound of a new
word. Eight out of the twelve participants scored equally in frequency and
confidence. Two scored lower in confidence, while two students appeared
to be rather puzzled, and one of them paused for quite some time before
scoring. Pausing for quite some time was translated as a problematic
behavior and was recorded accordingly. The student either did not
understand the description of the strategy or was in confusion of whether
s/he did that or not.
Q.5 (Memory Strategy): I use flashcards in order to remember the new
words (with the new word on one side and the definition or other
information on the other side). Two out of the twelve students scored with
no apparent deviation between frequency and confidence, while ten
students scored much lower in frequency than in confidence. One of the
students asked what exactly was implied by the word ‘flashcards’. The
interviewer explained the word so that the student could proceed with the
scoring. The majority of the students did not use the strategy even though
they considered it to be rather effective. This could be translated as a need
for instruction on how to actually make better use of the strategy, or as a
fact related to the participants’ age, who as teenagers may not use
flashcards for learning as much as very young learners.
Q6. (Memory Strategy): I physically act out new English words. Four
students scored equally in frequency and confidence, whereas eight
students scored remarkably lower in frequency. However, it is believed

that the students were not fully aware of the meaning of the verb ‘to act
out’, as one of them commented positively and said that she would speak
in English with her aunt. In this case, there is again the issue of under-
using the strategy in the learning process, even though most subjects feel
confident that the strategy is helpful. However, considering the age of the
students helps us explain why they did not choose to make use of the
strategy: low confidence and embarrassment to expose oneself is quite
common during adolescence. More reinforcement would be necessary so
as to enhance the use of this language learning strategy.
Q14. (Cognitive Strategy): I watch English language TV shows spoken
in English or go to movies. Six students scored minimal frequency and
very high confidence, while only one student scored higher in frequency.
There were four students who had the same level in both frequency and
confidence. This is another case of insufficient instruction of the strategy,
so the majority of the students do not use it while learning the second
language. Since the majority of students scored lower in frequency, clearly
denoting that they do not make use of that strategy, the need for
instruction is apparent, so that the students can fully exploit the benefits of
this strategy and achieve higher levels of language proficiency.
Q15. (Cognitive Strategy): I read books and magazines. Seven out of
the twelve students scored very high in confidence, despite the low scoring
in frequency. Three students scored equally on both bars, whereas two
students marked a much lower scoring in frequency compared to the high
scoring in confidence. The majority scored higher in confidence, leaving
the frequency bar under-scored. It is possible that if the students had been
taught the use and value of this strategy, they could have incorporated it in
the learning process.
Q37. (Meta-Cognitive Strategy): ǿ have clear goals for improving my
English skills. Seven students scored equally on both bars (confidence and
frequency). Five students out of the twelve gave a higher score on the
confidence bar. Three students made a connotation to their grades and
commented on that. However, they did not make any reference to their
goals, as if they did not fully understand the meaning of the English word
“goals”, as well as what setting goals entails and they could not fully
understand what they were supposed to do. In this case, there are almost
equal numbers in the students who either scored higher in confidence, or
equally on both bars, frequency and confidence. Still, instruction could be
given so as to clarify what exactly most students feel that helps them learn
the language.
90 Chapter Four
Q44. (Affective Strategy): I talk about the way I feel when learning
English. Eight out of the twelve students scored equally on both bars,
except for four students who marked a higher score on the confidence bar.
Most of them made no comments, apart from one who admitted that he
does feel stressed when he talks in English and another student, who
wanted to make sure he got it right, asking, or rather repeating the
statement, as if seeking for further explanation (which was not provided,
as he immediately proceeded with the scoring). The students seem to be
aware of the strategy and also have the confidence that the particular
strategy is helpful.
Problematic Perception of Questionnaire Items

During the administration of the SILL, there were certain items that
were not easily understood by the participants and needed further
explanation.
Question 4. (Memory Strategy): The use of rhymes. Eight students
scored higher in confidence whereas only one scored higher in frequency.
One of the students (advanced level) paused to understand the question
and another student (beginner) asked for further clarification, somewhat
puzzled and did not seem to fully understand the question. Both of the
students were girls. Most of the students scored higher in the confidence
bar.
Question 41. (Affective Strategy): Rewarding oneself when achieving
a goal in the target language. Six students scored higher in confidence,
and there were three that scored lower in confidence, and also three others
that had equal scoring both in the confidence and frequency bar. Two
students were embarrassed and one of them giggled not knowing what to
answer. He was encouraged by the interviewer to do so. Another student
made a rather long pause, as if she was trying to understand the question.
There was also a case of a student who directly asked for clarification, not
understanding what the interviewer meant by ‘rewarding’.
Question 46. (Social Strategy): Asking English speakers to correct me
when I talk. Six students scored equally both in frequency and confidence,
whereas four had a higher score in confidence. There were three cases of
students who were unable to reply to the question, probably because they
never had the experience of interacting with an English speaker before and
consequently were unable to answer, so the interviewer had to provide a
real life example taken from their student-life experience, in order to
encourage them to score on the bars.
Question 21. (Cognitive Strategy): Finding the meaning of an English

word by dividing it into parts that I understand. Five students gave an
equal score to both bars, and four out of the twelve students had
considerably lower scores in confidence than frequency. In this particular
question, the students had a hard time providing an answer, as some of
them could not understand what way they should answer. More
specifically, five students gave a wrong score, meaning they scored very
high in the frequency bar, whereas they reported that they always tend to
translate the words into Greek in order to understand them. Four of those
students were beginners. There were also six cases that required further
clarification, four of whom directly asked the interviewer what the
question actually meant, and what the required information was.
Question 27. (Compensatory Strategy): Reading English without
looking up every new word. Six students had equal scoring and four scored
higher in confidence. The interviewer noticed that what the students
reported orally, was not in accordance with the score they marked on the
frequency bar, indicating that they may not be fully aware of the actual
meaning of the reported strategy. In the case of a fifth student, the above
mistake passed unnoticed, and there was no cohesion of the verbal and the
written data. A sixth student had difficulty figuring out the meaning of the
statement, and made a rather awkward pause before scoring. The advanced
students tended to make awkward pauses in order to think about the
answer they should give. They were not interrupted until some time had
passed, so they were asked if there was anything wrong. Similar issues
were encountered with the following questions:
Q9: I say or write new English words several times.

Q11: I practice the sounds of English.
Q12: I start conversations in English.
Q22: I make summaries of information that I read or hear in English.
Q34: I plan my schedule so I will have enough time to study English.
Q39: I try to relax whenever I feel afraid of using English.
Q42: I notice if I am tense or nervous when I am using English.
On the other hand most of the somehow ‘problematic’ questions that

troubled the group of the beginners, lead the students to express their
uncertainty and ask for clarification, so as to make sure they are on the
right track. Such cases were in the following questions: Q18-I look for
words in my own language that are similar to new words in English
(twice); Q20-I try to find the meaning of a new word by dividing it into
parts; Q33-I try to find out how to be a better learner of English; Q43-I
92 Chapter Four
encourage myself to speak English even when I am afraid of making a

mistake. There were also cases of long pauses or hesitations in questions:
Q16-I write notes, messages, letters in English; Q17-I first skim an
English passage, then go back and read it; Q19-I try to find patterns in
English; Q29-If I can’t think of an English word, I use a word or phrase
that means the same thing. As it appears from the scoring on the bars,
most of the students scored very high in confidence, although the scoring
on the frequency bar was not as high. This could be explained as a need
for further instruction, so that the students are not only confident that the
particular strategy is of great importance but they also know how to use it
in order to enhance their language learning.
Other Problematic Areas

There were certain questions in the questionnaire that were particularly
problematic and might possibly need attention/revision before any future
administration of the instrument. For example, the question ‘I try not to
translate word-for-word and I read English without looking up every new
word’ caused a lot of confusion. Although some learners stated that they
avoid looking up words they still scored towards the left end of the bar (0),
which means they actually translate and look up words in dictionaries.
Others stated that they do look up words but scored towards the right end
of the bar (1), as if stating that they avoid it. Most of the learners asked for
clarification, others who did not, scored not in compliance with the
comment they made, contradicting themselves.
In total, 4 out of the 6 students in the beginner level either gave wrong
scoring or had to ask for clarification, whereas 3 out of the 6 advanced
students did the same thing.
The above prove the important advantage of the oral administration of
the SILL, combined with individual interviews as it allows clarifications
and may prevent subjects from making the wrong assumptions about any
of the SILL items. Lastly, it became apparent that the negatively worded
items were particularly problematic and they may need to be reworded in
future administrations of the instrument.
There were also questions that appeared to be quite similar, e.g., ‘I ask
English speakers to correct me when I talk’ and ‘I ask for help from
English speakers’. Learners told the researcher that one of them could
have been eliminated. To be more precise, 4 out of the 6 beginner students
gave almost identical scoring to both questions, and so did all of the
advanced students. The comments that the learners made indicated that
these two questions were identical in the mind of the students, and
therefore they extracted similar information from them. The questions that
were more problematic than others in the sense that they needed further
explanation or the students misinterpreted if no clarification was given
were questions Q21-I try not to translate word-for-word, and Q27-I read
English without looking up every new word. However, questions Q46-I
ask English speakers to correct me when I talk and Q48-I ask for help
from English speakers were dealt as if they expressed the same strategy
and therefore had the same impact on the students.
With regards to the two questions above, ‘I ask English speakers to
correct me when I talk’ and ‘I ask for help from English speakers’, there
was one more interesting observation. The lower level students reported
that they do not use these strategies but they believe that seeking help and
correction from others could help them. In contrast, the majority of the
advanced learners scored lower in confidence and some of them verbally
stated that they do not wish to be corrected or that they do not consider it
as a helpful strategy, i.e., they do not like it. This could be explained as a
refusal of the higher proficiency students to be corrected, as these students
are the ones who usually perform well, not only in English, but in other
subjects as well. Consequently, they might consider the correction by a
native speaker or by their teacher as a failure or negative exposure that
could cause them to ‘lose face’.
Discussion
Concerning the first question of this research study about whether
confidence affects learners’ choice of strategy, it was shown that in a
number of items there was great deviation between frequency of use and
confidence in the effectiveness of the strategy. This could imply an appeal
for strategy instruction, as the learners appear to be confident that the
specific strategy might help them, even if their frequency of use indicates
that they do not use the strategy often or even not at all in some cases. This
is an important finding as it demonstrates the difference between what is
used and what is considered useful. However, such an assumption would
have been impossible without the introduction of the parameter of
confidence and without the use of the bar.
As for the second question, there were a number of items in the
questionnaire that were identified as problematic items. These items
caused confusion among learners or were considered similar, and often
resulted in incorrect responses. Probably, such items need to be revised
and reworded before using this instrument again (see Dörnyei, 2003, as
well as Roszgowski & Soven, 2010 for suggesting similar improvements
94 Chapter Four
in questionnaires).
Finally, concerning the third research question, namely how
proficiency in English affects the learners’ strategic behaviour, it is
evident that it does, as the level of the students, beginners-advanced,
seems to influence not only their perception of the actual items of the
questionnaire, but also their recorded responses.
Conclusion
In the current study there is the obvious limitation of the small number
of participants since it was a pilot study. Future research with a larger
sample would allow quantitative analyses and correlations that would
provide more valid conclusions. The above limitation should also be taken
into serious consideration as, due to the way of the administration of the
suggested instrument–oral administration and individual interviews–the
fact that it cannot be applied to a large number of learners provides us with
very restricted data.
As a general conclusion, we could point out that apart from certain
improvement and/or changes that need to be performed on the
questionnaire to make it more appropriate for the specific learners, the
need for instruction is apparent as it will boost the learners’ strategy use so
as to make them more efficient and autonomous, and probably encourage
and reinforce their self-study. Moreover, the format of the data-collection
could be adapted, so that a bigger number of participants could be
included, and therefore more valid information could be extracted through
the use of a differentiated format of the same questionnaire, aiming to its
administration with larger groups of learners.
Acknowledgement
This study is part of the Thales Project MIS 379335. It was carried out
within the National Strategic Reference Frame (Ǽ.Ȉ.Ȇ.ǹ.) and was co-
funded by the European Union (European Social Fund) and the national
resources.
References
Bull, S., & Ma, Y. (2001). Raising learner awareness of language learning
strategies in situations of limited resources. Interactive Learning
Environments, 9(2),171-200.
Chamot, A. U. (2005). Language learning strategy instruction: Current

issues and research. Annual Review of Applied Linguistics, 25, 112-
130.
—. (2007). Accelerating academic achievement of English language
learners: A synthesis of five evaluations of the CALLA Model. In J.
Cummins & C. Davison (Eds.), The international handbook of English
language learning (pp. 317-331). Norwell, MA: Springer Publications.
Cohen, A. D. (2003). The learner’s side of foreign language learning:
Where do styles, strategies, and tasks meet? International Review of
Applied Linguistics in Language Teaching, 41(4), 279-291.
Cohen, A. D. & Macaro, E. (2007). Language learner strategies: Thirty
years of research and practice. Oxford: Oxford University Press.
Dörnyei, Z. (2003). Questionnaires in second language research:
Construction, administration, and processing. Mahwah, New Jersey:
Lawrence Erlbaum Associates.
Gavriilidou, Z., & Mitits, L. (2013). Adaptation of the Strategy Inventory
for Language Learning (SILL) for students aged 12-15 into Greek: A
pilot study. Paper presented at the 21st International Symposium of
Theoretical and Applied Linguistics, April 5th-7th, School of English,
Aristotle University of Thessaloniki, Greece.
Gavriilidou, Z., & Papanis, A. (2010). The effect of strategy instruction on
strategy use by Muslim pupils learning English as a second language.
Journal of Applied Linguistics, 25, 47-63.
Gavriilidou, Z., & Psaltou-Joycey, A. (2009). Language learning
strategies: An overview. Journal of Applied Linguistics, 25, 11-25.
Graham, S., & Macaro, E. (2008). Strategy instruction in listening for
lower-intermediate learners of French. Language Learning, 58(4), 747-
783.
Green, J. M., & Oxford R. (1995). A closer look at learning strategies, L2
proficiency, and gender. TESOL Quarterly, 29(2), 261-297.
Intze, P. (2010). Accuracy and confidence in Modern Greek vocabulary of
native and non-native speakers in Western Thrace (in Greek). Doctoral
thesis, Democritus University of Thrace, Greece.
Intze, P., & Kambaki-Vougioukli, P. (2009). Lexical guessing: Accuracy
and confidence of pupils of Greek as a first or second language.
Journal of Applied Linguistics, 25, 65-83.
Kambaki-Vougioukli, P. (1992a). Greek and English readers’ accuracy
and confidence when inferencing meanings of unknown words.
Proceedings of the 6th International Symposium on the Description
and/or comparison of English and Greek (pp. 89-112). Thessaloniki,
Greece: Aristotle University.
96 Chapter Four
—. (1992b). Accuracy and confidence of Greek learners guessing English

word meaning. Doctoral thesis, University of Wales, U.K.
—. (2001). īȜȦııȚțȒ ĮʌȠțĮĲȐıĲĮıȘ ʌĮȚįȚȫȞ İʌĮȞĮʌĮĲȡȚȗȠȝȑȞȦȞ Įʌȩ ĲȘȞ
ʌȡȫȘȞ ǼȈȈǻ: MȚĮ ȥȣȤȠȖȜȦııȚțȒ ʌȡȠıȑȖȖȚıȘ. PHASIS, ICPBMGS, 4,
71-81.
—. (2012). SILL revisited: Confidence in strategy effectiveness and use of
the bar in data collecting and processing. In Z. Gavriilidou, A.
Efthymiou, E. Thomadaki, & P. Kambaki-Vougioukli (Eds.), Selected
papers of the 10th ICGL (pp. 342-353). Komotini, Greece: Democtitus
University of Thrace.
—. (2013). Bar in SILL questionnaire for multiple results processing:
Users’ frequency and confidence. Sino-US English Teaching, 10(3),
184-199.
Kambaki-Vougioukli, P., & Vougiouklis T. (2008). Bar instead of a scale.
Ratio Sociologica, 3, 9-56.
Kambaki-Vougioukli, P., Karakos A., Lygeros N., & Vougiouklis, T.
(2011). Fuzzy instead of discrete. Annals of Fuzzy Mathematics and
Informatics (AFMI), 2(1), 81-89.
Kambakis-Vougiouklis, P., Mamoukari P., Agathopoulou, E., & Alexiou,
T. (2013). Oral application of SILL questionnaire using the bar for
frequency and evaluation of strategy use by Muslim pupils in Thrace.
Paper presented at the 21st ISTAL, April 5th-7th, Aristotle University of
Thessaloniki, Greece.
Kazamia, V. (2003). Language learning strategies of Greek adult learners
of English: Volumes I and II. Doctoral thesis, The University of Leeds,
UK.
MacIntyre, P. D. (1994). Toward a social psychological model of strategy
use. Foreign Language Annals, 27(2), 185-195.
Magogwe, J. M., & Oliver, R. (2007). The relationship between language
learning strategies, proficiency, age and self-efficacy beliefs: A study
of language learners in Botswana. System, 35(3), 338-352.
Mathioudakis, ȃ., & Kambaki-Vougioukli, P. (2010). The adjectival
identity of Ulysses in Kazantzakis’ ȅDYSSEY through the fuzzy sets.
Paper presented at the 4th European Congress of Greek Studies,
September 9th-12th, University of Granada, Spain. Retrieved from:
http://www.eens.org/EENS_congresses/2010/Mathioudakis_Nikolaos_
Kambaki-Vougioukli_Penelope.pdf
McDonough, S. H. (1999). Learner strategies. Language Teaching, 32(1),
1-18.
Oxford, R. L. (1990). Language learning strategies: What every teacher
should know. New York: Newburry House.
—. (1996). Employing a questionnaire to assess the use of language

learning strategies. Applied Language Learning, 7(1&2), 25-45.
Phillips, V. (1991). A look at learner strategy use and ESL proficiency.
CATESOL Journal, 4, 57-67.
Psaltou-Joycey, A. (2008). Cross-cultural differences in the use of learning
strategies by students of Greek as a second language. Journal of
Multilingual and Multicultural Development, 29(3), 310-324.
Psaltou-Joycey, A., & Kantaridou, Z. (2009). Plurilingualism, language
learning strategy use and learning style preferences. International
Journal of Multilingualism, 6(4), 460-474.
Psaltou-Joycey, A., & Sougari, A. (2010). Greek young learners’
perceptions about foreign language learning and teaching. In A.
Psaltou-Joycey & M. Mattheoudakis (Eds.), Advances in research on
language learning and teaching: Selected papers (pp. 387-401).
Thessaloniki: The Greek Applied Linguistics Association.
Roszkowski, M. J., & Soven, M. (2010). Shifting gears: Consequences of
including two negatively worded items in the middle of a positively
worded questionnaire. Assessment & Evaluation in Higher Education,
35(1), 113-130.
Tragant, E., & Victori, M. (2006). Reported strategy use and age. In C.
Munõz (Ed.), Age and foreign language learning rate. The BAF
project (pp. 208-236). Clevedon: Multilingual Matters.
Tragant, E., & Victori, M. (2012). Language learning strategies, course
grades, and age in EFL secondary school learners. Language
Awareness, 21(3), 293- 308.
Vougiouklis, T., & Kambaki-Vougioukli, P. (2011). On the use of the bar.
China-USA Business Review, 10(6), 484-489.
Wharton, G. (2000). Language learning strategy use of bilingual foreign
language learners in Singapore. Language Learning, 50(2), 203-244.
CHAPTER FIVE
THE DEVELOPMENT OF A VOCABULARY TEST

TO ASSESS THE BREADTH OF KNOWLEDGE
OF THE ACADEMIC WORD LIST
LEE-YEN WANG
Abstract
This research study investigated English as a foreign language (EFL)
college students’ vocabulary acquisition of a group of 52 Academic Words
that were excluded from the national wordlist for high school students in
Taiwan. The study found that the 52 omitted words were acquired
significantly less by both the freshman and senior students in Taiwan
compared with the other non-omitted 518 academic words. In addition, 38
of the 52 omitted words are also on the Academic Vocabulary List (AVL),
which was made available in 2013 (Davies, 2012), as part of the Corpus of
Contemporary American English (COCA). Again, these 38 shared words
were also acquired significantly less than the non-omitted academic words
by the same groups of freshman and senior students in this research study.
These findings highlight the limitations of having a centrally controlled
national wordlist for students to learn, as anything omitted from that list
will have a high probability of being missed subsequently in later
acquisition.
Introduction
In Taiwan, before students can be admitted to a college, they have to
take at least one mandatory national exam, and English is a required
subject. Much like other Asian countries, in Taiwan college entrance
examinations are high-stakes exams (Guo, 2005; Ng & Renshaw, 2009)
and students are encouraged to study hard for them early in their
The Development of a Vocabulary Test 99
childhood. In this environment, students’ English proficiency in high

school highly correlates with how they will perform in college (Luo, 2005).
Therefore, vocabulary acquisition during this period in students’ education
can be crucial. The vocabulary included in the English teaching materials
in Taiwan has to conform to the guidelines established by the English
Reference Word List (ERWL). This list determines what teachers teach
and what students learn. In general, teaching, learning, and testing form a
highly connected relationship. Due to the pressure of the high-stakes
exams, high school teachers try their best to cram all the listed words into
their lessons while students work hard to master them. After high school,
there is no word list available to guide English language teaching and
learning and the Ministry of Education (MOE) in Taiwan does not specify
other vocabulary lists to augment the ERWL for college students. In
previous research (Wang, 2015), it was found that there were 52 academic
words omitted from the ERWL. This finding provided a unique
opportunity for investigating the subsequent acquisition of the 52 omitted
academic words relative to the rest of the ERWL words.
Background
High school English education in Taiwan is under the guidance of the
ERWL, which is published and released by the College Entrance and
Examination Center (CEEC), a non-profit organization commissioned by
the MOE in Taiwan to administer the nationwide college entrance
examination (CEEC, 2015). Wang (2015) compared the ERWL with
West’s General Service List (GSL) (West, 1953), Coxhead’s (2000)
Academic World List (AWL), the 5,000 Frequency Dictionary (Davies &
Gardner, 2010) from the Corpus of Contemporary American English
(COCA), and the most frequent 500,000 words in the COCA. The study
concluded that the ERWL is a reasonable wordlist, but the comparison
study identified a set of 52 AWL words which were omitted from the
ERWL probably because when Jeng (2005) was compiling the ERWL, he
was unaware of the availability of the AWL by Coxhead (2000). This
study leverages this finding as a prism to investigate the acquisition of
vocabulary by English as a foreign language (EFL) college students in
Taiwan from the perspective of these 52 academic words.
Vocabulary learning is essential for language development. Wilkins
(1972) states that “without vocabulary, nothing can be conveyed” (p. 111),
not ignoring the importance of grammar, but stressing the role of
vocabulary in conveying ideas. Lexical knowledge is essential in all skill
100 Chapter Five
areas, including writing (Engber, 1995; Laufer & Waldman, 2011),

listening (Chang, 2007; Nation, 2006), speaking (Joe, 1998; Koizumi,
2013), and fluency (Harrington, 2006; Pellicer-Sanchez & Schmitt, 2012).
Vocabulary is, in general, categorized into three main groups, which
include the GSL-based high-frequency words (West, 1953), academic
vocabulary, and technical and low-frequency words (Nation, 2001).
West’s GSL was built using a corpus of 5 million words (West, 1953), and
it can cover between 80% and 90% of the words in a text. Technical
vocabulary is highly subject-dependent and its usage is discipline-specific.
Researchers and scholars also explored vocabulary items or phrases with a
higher frequency of occurrence in academic texts than in other genres
(Anderson 1980; Coxhead, 2000; Cummins & Man, 2007; Gardner &
Davies, 2014; Simpson & Mendis, 2003), namely, academic vocabulary.
The AWL, a list of 570 word families, was created from academic texts, in
four main areas: the arts, law, science, and commerce. These four areas
were further divided into 28 subjects (Coxhead, 2000, 2011) and the
academic words were categorized in ten sublevels with the highest
frequency group in Level 1 (Coxhead, 2000). Hyland and Tse (2007)
duplicated Coxhead’s corpus study with a corpus of 3.3 million words.
Their findings showed that Coxhead’s AWL has a skewed distribution
among the subcorpora, and 27% of the AWL items have very low
occurrences in at least one of the three subcorpora. Futrthermore, it was
found that the AWL can only account for 78% of the words when the
AWL is coupled with the GSL, rather than the usual 85% reported by
Coxhead (Hyland & Tse, 2007). This led Hyland and Tse to conclude that
there may be no universal academic vocabulary that can satisfy all the
academic disciplines. Eldridge (2008) argued that Coxhead compiled a set
of more general vocabulary than the GSL, that the AWL serves a
supportive function and the words are “not likely to be glossed by the
content teacher” (Flowerdew, 1993, p. 236). However, Coxhead (2011)
reviewed the impact and significance of the AWL and showed that the
AWL has a deep influence in the field of English language teaching and
learning.
While the AWL (Coxhead, 2000) can achieve 10.1% of text coverage
in the 3.5 million words of academic texts in her collection, Gardner and
Davies (2014) reported that their academic vocabulary list (AVL) can
cover 14% of their academic corpus of about 120 million words. This is
achieved by having 1,991 academic word families and 3,015 lemmas, but
without the GSL as the base. Since COCA’s AVL is a recent product, this
study will also investigate how many of the 52 omitted AWL words
overlap with COCA’s AVL and further examine the acquisition of these
overlapped words by Taiwanese students.
Properly assessing a learner’s vocabulary knowledge plays a
significant role in facilitating efficient language teaching and learning.
Assessment can evaluate vocabulary from a variety of perspectives: size,
depth, fluency, and other cognitive and association skills (Meara, 2002;
Meara & Wolter, 2004; Read, 2000; Schmitt, 2014; Sonbul & Schmitt,
2013; Tannenbaum, 2006). Read (1993, 2000) developed tests to assess
word association, including knowledge of collocation, derivative forms of
a stem word, and polysemous meaning senses.
Nation (2001) differentiated the passive vocabulary that a person can
understand in reading and listening, from the active vocabulary, which a
person can use in speaking and writing. A convenient way to measure
students’ passive or receptive vocabulary knowledge is through a checklist,
where students mark YES/NO (Y/N) to self-report whether they know
these words (Meara & Buxton, 1987). These Y/N tests are incapable of
surveying vocabulary depth (Laufer & Goldstein, 2004; Read, 2000), but
they are simple to administer and they are effective in assessing
vocabulary size or breadth (Read, 2007). However, the self-reporting
nature of Y/N tests requires researchers to implement further checks to
assess the reliability of the self-reporting. Pseudowords were introduced to
the Y/N tests in the reading comprehension assessment by Anderson and
Freebody (1983), with two approaches to creating a pseudoword. The first
one is to add a prefix or suffix to a real word, e.g., ‘steal’ becomes
‘stealment.’ Modifying the vowel or the consonant, one or two at a time,
forms the second method of constructing a pseudoword. Pseudowords
cannot be counted as real words, so they are dealt with as ‘hit’ and ‘false
alarm.’ A hit indicates that the pseudoword is correctly recognized as a
pseudoword, while a false alarm is when the test-taker claims to know a
pseudoword as if it were a real word. The relative numbers of these two
measures can indicate if the test result can be trusted Meara (1992).
The Study
This study is based on the findings by Wang (2015) that there is a set
of 52 AWL words that were omitted from the ERWL. A question that
arises is whether the omitted vocabulary items can be acquired later in
college and how they are acquired. In addition, because the COCA also
released the AVL, this study leverages the AVL to investigate the number
of overlaps between the omitted AWL words and the AVL and the
102 Chapter Five
acquisition of these overlapped words. To solicit students’ knowledge

about vocabulary, a Y/N test was used as a convenient tool, knowing it
could be affected by incorrect self-reports from respondents. To address
this deficiency, a popular approach is to embed questions with
pseudowords in the test.
Research Questions
In the current study, three questions were posited to investigate the
acquisition of the 52 omitted academic words in the AWL relative to the
set of non-omitted AWL words:
1. Is there a significant difference between the acquisition of the

non-omitted academic words and the omitted academic words
between freshman and senior college students in Taiwan?
2. Does the inclusion of pseudowords affect the design of the Y/N
questions in the vocabulary test?
3. For the omitted words that are both on the AWL and the AVL, is
there a significant difference in the acquisition of the overlapped
words between freshman and senior college students in Taiwan?
Participants
Two classes of freshman and one class of senior students from a
private university in Taipei, Taiwan participated in this study. This
university has a policy of assigning students to freshman English classes
by the level of their General Scholastic English Ability Test (GSEAT),
which is a mandatory exam every high school senior student in Taiwan
has to take in early February of each year. There are 15 levels in the
English subject. The freshman students who participated in the study were
at the GSEAT Level 13 or above, which is in one of the top percentiles of
76% to 82% of all students who took the exam. One student was at Level
15, which is in the top percentile of 88% to 100%. These two classes of
freshman students had the highest GSEAT score in the university. A total
of 50 freshman students participated in this study. A further 39 senior
students participated in this study from the English department. Table 5-1
summarizes some general information about the study participants.
Table 5-1: Participant information.
Participants No. Male Female Age Range First Language

Freshmen 50 21 29 18-20 Chinese
Seniors 39 4 35 21-24 Chinese
Compilation of Y/N Checklist

There are 570 word families in the AWL. It is not feasible to use the
Y/N checklist to ask students to report on so many questions at one time.
With 52 omitted words, this leaves 518 words to be sampled. The formula
introduced by Krejcie and Morgan (1970) was used to perform a sampling.
Using this method, a total of 220 words from the AWL at a 5.02% margin
of error were selected for inclusion in the Y/N checklist.
A further 82 pseudowords were chosen for inclusion in the Y/N
checklist, with a margin of error 9.07%, again calculated from the formula
by Krejcie and Morgan (1970). If a 5% margin of error were to be
maintained, it would require 160 pseudowords, and the overall questions
would exceed the size of an A3 paper. The words in the pseudoword set
were modified by changing a consonant or a vowel according to the
approaches by Anderson and Freebody (1983). In this approach,
pseudowords are fabricated in synforms (Laufer, 1988), which can be
confusing to Y/N test-takers.
Finally, these two sets (220 words sampled from the AWL and 82
pseudonyms) were then combined with the 52 omitted academic words to
form the final Y/N checklist set of 354 words. The sampling process did
not randomly select the words from each of the 10 levels in the AWL but
treated all the words as a single pool. Table 5-2 shows how many of the
354 words on the Y/N checklist belonged to which AWL Level while
Table 5-3 provides a list of the 52 omitted AWL words. Since the number
of words per level differed, there was a need to check if this represented a
uniform distribution across the ten academic levels defined in the AWL by
Coxhead (2000). On average each level should have 35.4 words because
the total is 354 words with 10 levels. A Chi-square analysis was performed
to use 35.4 as the expected value for each level to check against the
subtotal in each level, and it was not statistically significant (Ȥ2 (9.842, 9)
= 0.3634, p< .05). Therefore, the sampling process was proven to be a
uniform distribution.
104 Chapter Five
Table 5-2: Distribution of the sample words across the AWL Levels.
AWL Level 1 2 3 4 5 6 7 8 9 10 Total

Set 1 (non-omitted) 26 31 18 32 24 22 23 14 23 7 220
Set 2 (omitted) 1 0 5 4 3 7 7 10 5 10 52
Set 3 (pseudowords) 12 10 7 4 10 9 8 11 8 3 82
Total 39 41 30 40 37 38 38 35 36 20 354
Table 5-3: Omitted AWL words with Level information.
Words Level Words Level Words Level

adjacent 10 discrete 5 intrinsic 10
aggregate 6 domain 6 invoke 10
albeit 10 empirical 7 legislate 1
amend 5 entity 5 levy 10
append 8 fluctuate 8 negate 3
arbitrary 8 hierarchy 7 notwithstanding 10
attribute 4 hypothesis 4 offset 8
automate 8 ideology 7 ongoing 10
concurrent 9 implicate 4 paradigm 7
constrain 3 incidence 6 parameter 4
convene 3 incorporate 6 practitioner 8
criteria 3 infrastructure 8 predominant 8
deduce 3 inhibit 6 protocol 9
denote 8 innovate 7 qualitative 9
deviate 8 integral 9
Table 5-4 presents the list of the 82 pseudowords included in the Y/N
checklist. In this table, the pseudowords, the AWL Levels, and the original
words for the pseudowords are presented. For instance, ‘abandon’ was
modified to ‘abendon’ using vowel modification. The word ‘availabler’
was created by adding the suffix ‘-er’ to the word ‘available’. The word
word ‘breif’ was created from ‘brief’ with an ‘ie’ and ‘ei’ swap. The
purpose was to gauge if pseudowords derived from legitimate academic
words in the AWL would produce different responses between freshman
and senior students.
Table 5-4: Pseudowords with Level information and the original AWL
words.
Pseudoword L Original W Pseudoword L Original W

abendon 8 abandon coerperate 6 cooperate
acadamy 5 academy coopler 7 couple
accompeny 8 accompany crecter 1 create
acheive 2 achieve desing 2 design
acknowladge 6 acknowledge dymension 4 dimension
assast 2 assist emphesis 3 emphasis
attein 9 attain enhiance 6 enhance
availabler 1 available enture 3 ensure
banefit 1 benefit evaruate 2 evaluate
breif 6 brief eventuel 8 eventual
chellange 5 challenge exband 5 expand
chairter 8 chart espend 5 expand
circumstiance 3 circumstance flexibel 6 flexible
coite 6 cite fourmula 1 formula
clerify 8 clarify genertion 5 generation
coharent 9 coherent hypothisis 4 hypothesis
commance 1 commence imblicit 8 implicit
commision 2 commission implisit 8 implicit
cmunicate 4 communicate imbly 3 imply
consast 1 consist inkome 1 income
contrdict 8 contradict individaul 1 individual
ratioe 5 ratio salect 2 select
regulete 2 regulate spedific 1 specific
restract 2 restrict sdyle 5 style
refeal 6 reveal supmit 7 submit
ritid 9 rigid tehnology 3 technology
skope 6 scope salect 2 select
whereask 5 whereas widspread 8 widespread
inevatable 8 inevitable morme 9 norm
106 Chapter Five
Pseudoword L Original W Pseudoword L Original W

indegrity 10 integrity nowithstanding 10 notwithstanding
invoulve 1 involve occubie 4 occupy
jornual 2 journal overlape 9 overlap
justefy 3 justify persive 2 perceive
layert 3 layer percist 10 persist
leglel 1 legal predede 6 precede
marture 9 mature precite 5 precise
midia 7 media prioridy 7 priority
purcue 5 pursue prosibit 7 prohibit
qualitetive 9 qualitative morme 9 norm
thoery 1 theory uniforn 8 uniform
thezis 7 thesis uniqeu 7 unique
violit 9 violate vesible 7 visible
Note: W = Word; L = Level.
Final Y/N Checklist

The words in each of the three sets were tagged (omitted, non-omitted,
and pseudowords) and then combined to be randomized. The randomized
words were numbered consecutively, from 1 to 354. The Y/N checklist
done on paper and the instructions asked students to indicate if they knew
each word.
Since there have been no studies to have such an extensive checklist to
be completed by students, a pilot test was conducted with a group of six
students, four freshmen and two seniors, to see how much time it would
take for the students to complete the test. The checklist completion time
was found to be between 11 and 16 minutes, so the formal checklist
completion time was set to 20 minutes, and it was found to be sufficient.
The checklists for both senior and freshman students were
administered and collected in the first three weeks of the fall semester
when freshmen had just started their college study in Taiwan. SPSS
PASW Statistics 18 was used for data analysis.
An appropriate metric in this study was to compare the relative ratios
between the omitted words and non-omitted words. The following
variables were needed to explain this metric:
x OC: the total count of ‘Yes’ responses in the omitted AWL words
in the checklist.
x OT: the total number of omitted AWL words in the ERWL (N =

52).
x NC: the total count of ‘Yes’ responses in the non-omitted AWL
words.
x NT: the total number of non-omitted words determined by the
sampling process (N= 220).
x PC: the total number of ‘hits’ (i.e., ‘Yes’ responses) for
pseudowords.
x PT: the total number of pseudowords (N = 82).
x PP: PC/PT (the total number of ‘hits’ (i.e., ‘Yes’ responses) for
pseudowords divided by the total number of pseudowords).
x OP: OC/OT (the total count of ‘Yes’ responses in the omitted
AWL words in the checklist divided by the total number of omitted
AWL words in the ERWL).
x NP: NC/NT (the total count of ‘Yes’ responses in the non-omitted
AWL words divided by the total number of non-omitted words
determined in the sampling process).
x RP: NP/OP (the value derived from calculating the NP divided by
the value derived after calculating NT).
The concept and calculation of PP, OP and NP is straightforward,

however RP needs to be explained. If students’ knowledge of the omitted
AWL words has the same distribution as the non-omitted words, the
relative ratio will be 1. For instance, if a student reports 110 ‘Yes’ out of
220 and 26 ‘Yes’ for the 52 omitted words, we will have the following:
NP = NC110/NT220 = 0.5
OP = OC26/OT52 = 0.5
RP = NP/OP = 1
If this student keeps the same NC, 110 ‘Yes’, but only 39 ‘Yes’ out of
52 (OT), the calculation will then be as follows:
NP = NC110/NT220 = 0.5
OP = OC39/OT52 = 0.75
RP = NP/OP = 0.667
On the other hand, if this student keeps the same NC, 110 ‘Yes’, but
reports only 13 ‘Yes’ out of 52 (OT), the calculation will then be as
follows:
108 Chapter Five
NP = NC110/NT220 = 0.5
OP = OC13/OT52 = 0.25
RP = NP/OP = 2
Evidently, if the relative ratio (RP) between NP and OP is much larger

than 1, it indicates that the count of the omitted words in the checklist (i.e.,
the number of omitted words a student has indicated that he/she knows) is
relatively much smaller than the non-omitted ones (i.e., the number of
non-omitted words a student has indicated that he/she knows) and vice
versa. When the RP value is close to 1, it means these two counts are also
close to each other. If OP is zero (i.e., a student did not know any of the
omitted words), the calculation for RP is reversed, but the reversion
applies to the whole calculation.
Results and Discussion

With regards to the first research question, whether there is a
significant difference between the acquisition of the non-omitted academic
words and the omitted academic words between freshman and senior
college students in Taiwan, the results of the statistical analyses are shown
below in Tables 5-5 and 5-6.
Table 5-5 shows that the variable NC for the senior students had a
value, on average, of 176.41 words out of 220, and this is 80.2%. The total
count of ‘Yes’ responses in the omitted AWL words (in variable OC) for
the freshman and senior students were 9.74 (18.7%) and 16.692 (32.1%),
respectively. It is clear that freshman students knew an average of 73.1%
of academic words that are on the ERWL, but only 18.7% of academic
words that are not on the list. The situation improves considerably in the
senior students, but the percentage ratio is still 80.2% to 32.1%. The ratio
of NP/OP indicates the ratio between the total number of non-omitted
words to the total number of omitted academic words. This was 4.868 and
3.183 for the freshmen and seniors respectively, and both are significantly
greater than 1. Table 5-6 presents the results of an ANOVA comparison,
and it shows that there is a statistically significant difference between
freshman and senior students in this ratio measurement.
Table 5-5: Descriptive statistics (freshmen = 50, seniors = 39).
95%
95% CI
Year DV Mean SD Min Max Range CI
Upper
Lower
NC 160.80 26.439 102 204 102 144.23 174.12
Freshmen
NP .731 .1202 .4634 .927 .464 .6556 .791437

OC 9.74 4.861 2 21 19 8.80 16.24
OP .187 .093 .0385 .404 .365 .1693 .312
NP/OP 4.869 2.400 2.202 13.473 11.270 3.226 5.733
NC 176.41 31.263 100 218 118 176.27 191.52
NP .802 .142 .455 .991 .536 .80122 .871
Seniors
OC 16.69 8.037 2 35 33 14.39 19.82

OP .321 .1545 .039 .673 .635 .277 .381
NP/OP 3.183 2.068 1.45 13.710 12.264 2.322 4.123
Note: SD = Standard Deviation; DV = Dependent Variable; CI = Confidence
Interval; NT, NP, OT, OP, NP/OP are defined in the previous section.
Table 5-6: Comparison between freshmen and seniors in their

knowledge of non-omitted/omitted academic words using ANOVA.
Type III Averaged Observed

Source df F Sig Ș2
Sum of Square Sum of Square Power
Between 62.256 1 62.256 12.177 0.001 0.123 0.932
Within 444.802 87 5.113
Sum 507.058 88
Note: df = degrees of freedom; F = ANOVA result; Sig = Significance level; Ș2 =
effect size.
The combined results from Tables 5-5, and 5-6 indicate that the ratio
of the non-omitted to the omitted academic words for senior students was
lower than that for freshman students; and both ratios are significantly
greater than 1, which indicates that these two groups of words are not
known by the students in equal proportion. This can be attributed to the
110 Chapter Five
washback effect (Shohamy, Donitsa-Schmidt, & Ferman, 1996) as high

school teachers in Taiwan are not teaching words that are not covered in
the ERWL. Instead, they are drilling heavily the words that are defined in
the ERWL. It is not surprising then that even senior English majors fail to
acquire a significant amount of the omitted AWL words. These results are
a clear indication that words that are not in the ERWL will likely not be
learned by the students.
The second research question in this study concerned the inclusion of
the pseudowords and how this might affect the design of the Y/N test. The
inclusion of pseudowords in the test aimed to assess how the respondents
handle the checklist and the quality of their self-reports. Some researchers
argue against the value of the pseudowords because overestimations and
false alarms are not a concern as the error rate due to false alarms for high
proficiency students in general and Asian students in particular is low
(Harrington & Carey, 2009; Mochida & Harrington, 2006; Stubbe, 2012).
In this study, both the freshman and the senior groups were composed of
high proficiency students. The internal consistency of the test was
calculated using Cronbach’s Alpha (see Table 5-7) and it was found to
have the lowest value for Set 3 (pseudowords) most likely because the
pseudowords were harder to recognize than the real words. For Set 1
(non-omitted words) Cronbach’s Alpha was 0.969, and this was higher
than the other two sets. The total checklist was composed of the three sets
combined (S1+S2+S3) and it had a Cronbach’s Alpha of 0.964 which was
less than that of S1+S2 (0.973). Usually the longer the questionnaire the
better the value of Cronbach Alpha. Since the inclusion of pseudowords in
the test decreased Cronbach’s Alpha, and since it was also a low reliability
construct as evidenced by its 0.759 Cronbach’s Alpha, the conclusion in
this study is not favorable for including the pseudowords in the checklist
as suggested in other research (Harrington, 2006; Harrington & Carey,
2009; Stubbe, 2012).
The key question is then: did the use of pseudowords in the Y/N
checklist make a difference between freshman and senior students? A
MANOVA was performed to see if the vector mean of the 82
pseudowords in Set 3 was different between freshman and senior students.
The results showed that there was no difference between the two groups of
students (F (10, 78) = 1.223, p = .095, Ș2 = 0.905, Observed Power =
0.422).
Table 5-7: Reliability analysis.
Category Set 1 Set 2 Set 3 S1+S2 S1+S2+S3

No. of items 220 52 82 272 354
Cronbach’s Alpha 0.969 0.871 0.759 0.973 0.964
Note: Set 1 = non-omitted words; Set 2 = omitted words; Set 3 = pseudowords.
Finally, the last question this study attempted to answer was whether
there was a significant difference in the acquisition of the omitted words
that were both on the AWL and AVL by freshman and senior students.
Gardner and Davies (2014) reported that 451 out of the 570 AWL are in
the first 4,000 most frequent lemmas in the COCA. This leaves an
interesting question about how many of the 52 omitted words are in the
AVL. From COCA’s perspective, this would show how important the
omitted words are. A comparison was performed, and it was found that 38
of them were in COCA’s AVL. This indicates that from the perspective of
COCA’s AVL, these 38 words should not have been omitted by the
ERWL.
Table 5-8 shows the comparison in students’ performance between the
52 omitted academic words and the subset of 38 words (see Table 5-9) that
are in the new academic vocabulary list released by Gardner and Davies
(2014).
Table 5-8: Descriptive statistics for the 52 and 38 omitted items.
Year 95% 95%

Set 1: freshmen Mean Std Min Max Range CI CI
4: seniors Lower Upper
OP1/NP1 4.869 2.4 2.202 13.473 11.27 3.226 5.733
52 Items
OP4/NP4 3.183 2.068 1.45 13.71 12.264 2.322 4.123
OP1/NP1 5.559 5.654 1.62 28.5 26.88 3.114 8.004
38 Items
OP4/NP4 3.229 2.091 1.37 10.02 8.65 2.418 4.040
Note: RP= OP/NP˗Std = Standard deviation; CI = Confidence Interval
112 Chapter Five
Table 5-9: Omitted academic words and their COCA ranks and
frequencies.
AWL Level
AWL Level
AVL Rank
AVL Rank
Frequency
Frequency
COCA
COCA
COCA
COCA
Word Word
implicate 4 319 12,440 invoke 10 805 2,685

hypothesis 4 345 11,533 thesis 7 813 2,613
incorporate 6 370 10,580 aggregate 6 854 2,334
attribute 4 373 10,491 arbitrary 8 905 1,990
domain 6 481 7,319 concurrent 9 996 1,571
constrain 3 485 7,237 whereby 10 1003 1,535
underlie 6 499 6,818 discrete 5 1008 1,516
ongoing 10 533 6,114 adjacent 10 1026 1,390
hierarchy 7 537 6,012 denote 8 1028 1,377
innovate 7 545 5,901 fluctuate 8 1202 856
scenario 9 555 5,761 intrinsic 10 1243 784
practitioner 8 571 5,482 simulate 7 1288 714
empirical 7 591 5,127 negate 3 1357 607
entity 5 619 4,578 deviate 8 1396 544
parameter 4 631 4,426 deduce 3 1413 517
paradigm 7 637 4,356 qualitative 9 1444 470
infrastructure 8 673 3,913 append 8 1736 203
predominant 8 720 3,330 offset 8 1916 68
integral 9 777 2,891 legislate 1 1970 37
Note: AWL = Academic Word List; COCA = Corpus of Contemporary American
English; AVL = Academic Vocabulary List.
Looking at the results of the analysis, it is obvious that the shorter list
(38 items) was more difficult for the freshman students than for the senior
students because for the senior students the OT4/NP4 in the 52 items was
3.183 while the OT4/NP4 in the 38 items was 3.229. In contrast, these two
values for the freshmen were 4.869 and 5.559, respectively. Furthermore,
the ANOVA comparison between the freshmen and the seniors for the
short list was found to be significant (F (1, 87) = 10.401, p = .002, Ș2 =
0.107, Observed Power = 0.891), indicating that there is a significant
difference in the acquisition of the 38 words between the freshman and the
senior students, with seniors outperforming the freshmen. Nevertheless,

the ratio of OT/NP for the senior students is still far from the ideal
situation, which is 1. This would occur if the acquisition of the omitted
words was the same as the acquisition of the non-omitted words.
Conclusion
The study reported in this chapter shows that there is indeed a
significant difference between freshman and senior college students in
Taiwan in their receptive knowledge of academic words that are on the
ERWL list and those that are not on the list. The freshmen knew an
average of 160.8 words out of the 220 sampled words or 73.1%, while the
seniors reported that they knew 176.41 words out of the 220, or 80.18%.
In contrast, the freshman and the senior students reported that they knew
9.74 (18.7%) and 16.69 (32.09%) words out of the omitted 52 words,
respectively. The average ratio of the non-omitted words to the omitted
words was 4.869 times for the freshman students and 3.183 times for the
seniors. The gap becames smaller as students learned more English, but it
was still far from the ideal 1:1 ratio. Further research is needed to find out
the reasons for senior English majors failing to acquire the omitted AWL
words. In addition, pseudowords were found not to have a discriminating
effect between freshman and senior students in this study. A
cross-examination with the new academic vocabulary list from COCA
found that 38 of the omitted AWL) words are also present on the new list
(AVL), and there is also a significant difference in acquiring these
overlapped words between both groups of students. Although vocabulary
testing is better in context (Read, 2004; Read & Chapelle, 2001), this
354-item checklist serves to show a significant learning discrepancy. The
findings of this study can be useful to the CEEC in order to help refine and
augment the ERWL, while college and high school English language
instructors can use the information in this study to help find ways to teach
the omitted words to their students.
References
Anderson, J. (1980). The lexical difficulties of English medical discourse
for Egyptian students. English for Specific Purposes Newsletter, 37, 4.
Anderson, R. C., & Freebody, P. (1983). Reading comprehension and the
assessment and acquisition of word knowledge. In B. A. Hutson (Ed.),
114 Chapter Five
Advances in reading/language research, Vol. 2 (pp. 231–256).

Greenwich, CT: JAI Press.
Bauer, L., & Nation, P. (1993). Word families. International Journal of
Lexicography, 6(4), 253-279.
CEEC. (2015). GSAT and AST. Retrieved from:
http://www.ceec.edu.tw/CeecEnglishWeb/E07Process_GSAT.aspx
Chang, A. C. S. (2007). The impact of vocabulary preparation on L2
listening comprehension, confidence and strategy use. System, 35(4),
534-550.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2),
213-238.
—. (2011). The academic word list ten years on: Research and teaching.
TESOL Quarterly, 45(2), 355-362.
Cummins, J., & Man, E. Y. F. (2007). Academic language: What is it and
how do we acquire it? In J. Cummins & C. Davison (Eds.),
International Handbook of English Language Teaching (pp. 797-810).
New York: Springer.
Davies, M., & Gardner, D. (2010). A frequency dictionary of
contemporary American English: Word sketches, collocates, and
thematic lists. London: Routledge.
Davies, M. (2012). Corpus of contemporary American English
(1990–2012). Retrieved from: http://corpus.byu.edu/coca/
Eldridge, J. (2008). No, There isn’t an ‘academic vocabulary,’ But…”: A
reader responds to K. Hyland and P. Tse’s “Is there an ‘academic
vocabulary’?”. TESOL Quarterly, 42(1), 109-113.
Engber, C. A. (1995). The relationship of lexical proficiency to the quality
of ESL compositions. Journal of Second Language Writing, 4(2),
139-155.
Flowerdew, J. (1993). Concordancing as a tool in course design. System,
21(2), 231-244.
Gardner, D., & Davies, M. (2014). A new academic vocabulary list.
Applied Linguistics, 35(3), 305-327.
Guo, Y. (2005). Asia’s educational edge: Current achievement in Japan,
Korea, Taiwan, China, and India. Lanham MD: Lexington Books.
Harrington, M. (2006). The Yes/No test as a measure of receptive
vocabulary knowledge. Language Testing, 23(1), 73-98.
Harrington, M., & Carey, M. (2009). The on-line Yes/No test as a
placement tool. System, 37(4), 614-626.
Hyland, K., & Tse, P. (2007). Is there an “academic vocabulary”?. TESOL
Quarterly, 41(2), 235-253.
Jeng, H. H. (2005). The methodologies and principles for compiling the

vocabulary list of college entrance examination center. Education
Research Monthly, 138, 5-17.
Joe, A. (1998). What effect do text-based tasks promoting generation have
on incidental vocabulary acquisition? Applied Linguistics, 19(3),
357-377. doi: 10.1093/ applin/ 19.3.357.
Koizumi, R. (2013). Vocabulary and speaking. The Encyclopedia of
Applied Linguistics. Hoboken, NJ: Wiley.
Krejcie, R. V., & Morgan, D. W. (1970). Determining sample size for
research activities. Educational and Psychological Measurement, 30(3),
607-610.
Laufer, B. (1988). The concept of ‘synforms’ (similar lexical forms) in
vocabulary acquisition. Language and Education, 2(2), 113-132.
Laufer, B., & Goldstein, Z. (2004). Testing vocabulary knowledge: Size,
strength, and computer adaptiveness. Language Learning, 54(3),
339-436.
Laufer, B., & Waldman, T. (2011). VerbǦnoun collocations in second
language writing: A corpus analysis of learners’ English. Language
Learning, 61(2), 647-672.
Meara, P., (1992). EFL vocabulary tests. ERIC Clearinghouse.
—. (2002). Review Article: The rediscovery of vocabulary. Second
Language Research, 18(4), 393-407.
Meara, P., & Buxton, B. (1987). An alternative to multiple choice
vocabulary tests. Language Testing, 4(2), 142–154.
Meara, P., & Wolter, B. (2004). V_links: Beyond vocabulary depth.
Angles on the English Speaking World, 4, 85-96.
Nation, I. S. P. (1983). Testing and teaching vocabulary. Guidelines, 5(1),
12-25.
—. (2001). Learning vocabulary in another language. New York:
Cambridge University Press.
—. (2006). How large a vocabulary is needed for reading and listening?
Canadian Modern Language Review, 63(1), 59-81.
Ng, C. H., & Renshaw, P. D. (2009). Reforming learning: Concepts,
issues and practice in the Asia-Pacific region: An introduction. The
Netherlands: Springer.
Pellicer-Sanchez, A., & Schmitt, N. (2012). Scoring yes-no vocabulary
tests: Reaction time vs. nonword approaches. Language Testing, 29(4),
489-509.
Read, J. (1993). The development of a new measure of L2 vocabulary
knowledge. Language Testing, 10(3), 355-371.
116 Chapter Five
—. (2000). Assessing vocabulary. Cambridge: Cambridge University

Press.
—. (2004). Plumbing the depths: How should the construct of vocabulary
knowledge be defined? In P. Bogaards & B. Laufer (Eds.), Vocabulary
in a second language: Selection, acquisition and testing (pp. 209-227).
Amsterdam: John Benjamins.
—. (2007). Second language vocabulary assessment: Current practices and
new directions. International Journal of English Studies, 7(2),
105-125.
Read, J., & Chapelle, C. (2001). A framework for second language
vocabulary assessment. Language Testing, 18(1), 3-32.
Schmitt, N. (2014). Size and depth of vocabulary knowledge: What the
research shows. Language Learning, 64(4), 913-951.
Shohamy, E., Donitsa-Scmidt, S., & Ferman, I. (1996). Test impact revisited:
Washback effect over time. Language Testing, 13(3), 298-317.
Simpson, R., & Mendis, D. (2003). A CorpusǦbased study of idioms in
academic speech. TESOL Quarterly, 37(3), 419-441. doi:
10.2307/3588398.
Sonbul, S., & Schmitt, N. (2013). Explicit and implicit lexical knowledge:
Acquisition of collocations under different input conditions. Language
Learning, 63(1), 121-159.
Stubbe, R. (2012). Do pseudoword false alarm rates and overestimation
rates in Yes/No vocabulary tests change with Japanese students’
English ability level? Langauge Testing, 29(4), 471-488.
Tannenbaum, K. R. (2006). Relationships between word knowledge and
reading comprehension in third-grade children. Scientific Studies of
Reading, 10(4), 381-398.
Wang, L-Y. (2015). A study of the national high school English wordlist
in Taiwan. In C. Gitsaki, M. Gobert, & H. Demirci (Eds.), Current
issues in reading, writing and visual literacy: Research and practice
(pp. 134-156). Newcastle, UK: Cambridge Scholars Publishing.
West, M. (1953). A general service list of English words. London:
Longman, Green company.
Wilkins, D. (1972). Linguistics in language teaching. London: Arnold.
ISSUES IN THE CREATION OF ASSESSMENT
AND EVALUATION TOOLS
CHAPTER SIX
ASSESSMENT FOR LEARNING;

ASSESSMENT FOR AUTONOMY
MARIA GIOVANNA TASSINARI
Abstract
It is generally accepted that autonomy is a matter of degree, or degrees,
which fluctuate and therefore it could be assumed that the way in which
language teaching and learning is approached can make a significant
difference to the degree of autonomy and consequently the degree of
autonomy may influence language learning. It is also well-documented
that assessment plays an influential role in learning and that, like
autonomy, assessment is also a matter of degrees: the greater the degree of
involvement of the learner in the assessment process, the greater the
degree of autonomy that can be achieved. Although assessment in
language learning and teaching contexts is usually intended as assessment
of the language competencies, it is the intention of this chapter to show
that assessment of learning competencies and of competencies for
autonomy should play a role in a curriculum aiming at fostering learner
autonomy and reflexive learning. The research project reported in this
chapter was conducted in a German higher education context and involved
the development and use of a dynamic model of autonomy. Once the
nature of autonomy had been examined and the views of theorists and
practitioners in the field had been taken into account, dimensions of
autonomy and their sub-elements were integrated within a dynamic model
for initiating and continuing pedagogic dialogue between students and
their teachers/advisers at the Freie Universität Berlin. The model of
autonomy proved to be reliable, provided a clear picture to learners and
their teachers/advisers, and showed its potential to be used iteratively.
Being online, the model is available for anyone to use and its value lies in
the formulation of a profile for learners, which helps them understand their
Assessment for Learning; Assessment for Autonomy 119
learning and their degree of autonomy, builds self-awareness and enhances

their metacognitive skills.
Introduction
Assessment plays a central role in language teaching and learning. In
institutional forms of assessment, such as tests or certifications, we can
often observe that entire curricula and syllabi are directed to making
learners able to ‘pass the test’ (Prodromou, 1995). In teaching and learning
settings aiming at the development of learner autonomy, the capacity of
the learner to assess/evaluate their own progress and their learning process
is a pivot in the development of learner autonomy (Dam, 1995).
Empowering learners to assess their own language competencies and their
learning process is therefore one of the main challenges for language
educators. This can be done while putting in place formative assessment
modes, in which learners play an active role through self- or peer-
assessment. These forms of assessment are referred to in the literature as
assessment for learning and assessment as learning (Boud, 2000; Colbert
& Cummings, 2014).
Although assessment in second language education is mostly intended
as assessment of language competencies, in the literature on learner
autonomy researchers have started to investigate, besides assessment
modes of language competencies, also forms of assessing the learner’s
disposition and capacity for autonomy, i.e., self-directing their own
learning (Sheerin, 1997). This chapter describes an approach towards
assessment of/for autonomy developed in a German higher education (HE)
setting, at the Freie Universität, Berlin (FUB). The project was undertaken
as part of the author’s doctoral studies, with the aim of encouraging and
promoting the autonomy of the language learners involved, young adult
language learners in HE, using tools which, combined with advising,
aimed to increase learner awareness and to explore the possibility of
greater learner empowerment in the language learning process.
The study was conducted while the author was setting up a self-access
centre, the Centre for Independent Language Learning (CILL), at the FUB.
Recognizing that an essential element of the self-access centre was to
provide learners and learning facilitators, such as teachers and advisers,
with various forms of support for developing learner autonomy, the study
was undertaken with the aim of (i) operationalizing the notion of learner
autonomy on the basis of a critical analysis of existing definitions and
descriptions, and (ii) developing a tool for supporting the learners’
awareness and reflection on their learning competencies.
120 Chapter Six
After describing the theoretical background of the study, the

exploration which led to the development of the dynamic model of
autonomy, and the different stages of its development, this chapter
illustrates the investigation conducted with learners and teachers at the
Language Centre of the FUB, and its results, in particular the learners’
feedback, as a way to deal with the crucial questions concerning
assessment of and for autonomy. Afterwards, a brief comparison between
the dynamic model of learner autonomy and other existing models for
assessing or measuring learner autonomy is made and finally conclusions
are drawn.
Operationalizing Learner Autonomy

The dynamic model of learner autonomy, which is the focus of this
chapter, is based upon extensive research and a critical analysis of several
definitions and descriptions of learner autonomy, which elicited core
aspects, components, contextual aspects and possible degrees of learner
autonomy (see Benson, 2001; Dickinson, 1987; Holec, 1981; Little, 1991;
Littlewood, 1996, 1999; Martinez, 2005; Oxford, 2003).
The first step towards the model was formulating a definition of
learner autonomy focusing on learners’ general competencies in different
learning contexts and situations. As a result of this research, learner
autonomy was defined as the metacapacity, i.e., the second order capacity,
of the learner to take control of their learning process to different extents
and in different ways according to the learning situation. Learner
autonomy is a complex construct, a construct of constructs, entailing
various dimensions and components. Essential components of learner
autonomy are:
x a cognitive and metacognitive component (cognitive and

metacognitive knowledge, awareness, learners’ beliefs);
x an affective and a motivational component (feelings, emotions,
willingness, motivation);
x an action-oriented component (skills, learning behaviours,
decisions);
x a social component (learning and negotiating learning with
partners, advisers, teachers).
Learner autonomy is inasmuch a metacapacity, as essential to learner

autonomy is the learner’s capacity to activate an interaction and a balance
among these dimensions in different learning contexts and situations.
Based upon this definition and the descriptions of characteristics of

autonomous learners, of learners’ attitudes, behaviours and strategies (see,
among others, Breen & Mann, 1997; Candy, 1991; Oxford, 1990), a set of
descriptors of attitudes towards learning, of language learning competencies
and behaviours, for each component of learner autonomy, was developed.
The components were then put into relation with each other through/in the
dynamic model.
Within an explorative-interpretative research approach, the first
versions of the dynamic model and the descriptors were discussed in
workshops with experts, first at the Centre de Recherches et d’Applications
Pédagogiques en Langues (CRAPEL), at the Université de Lorraine, and
then at the Language Centre of the FUB; subsequently, both dynamic
model and descriptors were further developed according to the results of
the validation. The first versions of the dynamic model and the descriptors
were in German and French. The validated German version was tested
with students at the CILL and students and teachers of a language module
at the Language Centre, and then translated into English.
The research approach combined therefore a theoretical-analytical
approach which comprised the analysis of definitions and descriptions in
the literature, the development of a definition of learner autonomy and of
the dynamic model, with an empirical, explorative-interpretative approach,
constituted by the validation of the dynamic model by experts and by the
testing of the model conducted with students and teachers.
The Dynamic Model of Learner Autonomy

The dynamic model of learner autonomy (available at http://www.
sprachenzentrum.fu-berlin.de/slz/index.html), as shown in Figure 6-1,
takes the form of a sphere and entails the dimensions previously identified
as being characteristic of learner autonomy: an action-oriented dimension,
a cognitive and metacognitive dimension and an affective and
motivational dimension. A social component is integrated into each of
these dimensions. Distinctions between these dimensions remain on the
theoretical level since they are all closely intertwined; however, on the
practical level, having these distinctions is useful to both learners and
teachers or advisers, so that they can better reflect on competencies, skills
and strategies and use opportunities that present themselves in the
language learning context to improve and enhance them.
Each component of the dynamic model has a set of descriptors, with
specific statements about competencies, skills and behaviours, formulated
as can-do statements (for example, ‘I can organise a time and a place for
122 Chapter Six
my learning’; ‘I can set myself a task’; ‘I can recognize my strengths and

weaknesses as a learner and reflect on these’). Together, these descriptors
constitute a checklist covering all the main aspects of autonomous
language learning; however, they are not intended to be exhaustive or
normative, but rather serve as a spectrum of competencies which function
as a tool in raising learners’ awareness of autonomous language learning
processes in a higher education setting. The full list of descriptors can be
seen online.
Figure 6-1: The dynamic model of learner autonomy (Source: Tassinari, 2010, p.
203).
This model is structurally and functionally dynamic. It is structurally

dynamic, because each component is directly related to all the others (as
shown by the arrows in Figure 6-1). It is functionally dynamic, because
learners can decide to enter the model from any component and move
freely from one component to another without following a given path,
according to their needs and purposes. For example, they can start with
‘planning’ if they would like to focus on this aspect of the learning
process, and then move to ‘evaluating’, or to ‘motivating myself’, or to

any other component they want to reflect on. This dynamic feature is an
essential characteristic of the model, and makes it possible both to account
for the complexity of learner autonomy and to operationalize it, breaking it
down in smaller portions. On the online version of the dynamic model, the
interrelationships among the components and the descriptors are
represented by hyper-textual links.
Assessment and Autonomy

Learner autonomy, as many researchers and practitioners agree, is a
matter of degree or stages within a continuum, namely the autonomy-
heteronomy continuum (Everhard, 2015). The variety of forms learner
autonomy can take in disparate contexts and situations calls, on the one
hand, for a deeper analysis and differentiation of this complex capacity,
which can be undertaken, among others, through specific assessment tools,
in order to better investigate the construct and to help researchers and
practitioners to better scaffold its development. On the other hand,
assessing how autonomous learners are, especially in institutional
contexts, expose us to the risk of “unnecessarily inserting autonomy into
the regimes of accountability and assessment that dominate our
professional lives” (Benson, 2015, p. xi). One of the crucial questions
related to the assessment of autonomy is: Why should we assess autonomy
(see also Lai, 2011, cited in Everhard, 2015, p. 29)? Moreover, the next
relevant question is: Who should conduct the assessment? The learner or
the teacher/adviser?
The aim of the assessment approach underlying the dynamic model is
to make learners (more) aware of their potential as autonomous learners,
and to allow them to be initiators of and responsible for the (self-)
assessment process. The results of the self-assessment should not be
considered from the perspective of assessment of learning, or of
accountability; on the contrary, it should be seen as a chance for reflection
in and on the learning process, as an assessment for learning.
Self-Assessment with the Dynamic Model of Learner Autonomy

A prerequisite for the self-assessment process is the learner willingness
to assess their own learning competencies. Therefore, the use of the
dynamic model is voluntary, rather than imposed. The self-assessment
process can be conducted either in language advising or in classroom
settings and entails four steps: a) Getting started; b) Choosing components
124 Chapter Six
and descriptors; c) Assessing one’s own competencies; and d) Comparing

perspectives. According to the setting, these steps may slightly differ. In
the first stage, information is elicited about learners’ previous experience
of and beliefs on autonomous language learning. In advising settings, this
reflection can be conducted by the learner alone, using the questions in the
‘Getting started’ section, or with the language adviser during the advising
session. In classroom settings, the reflection can be conducted with a
partner or in small groups and eventually discussed in plenary. This
process of reflection can be very useful, since learners’ perceptions,
beliefs and previous experiences may strongly influence their attitude
towards (autonomous) language learning, their decision-making, and,
ultimately, the learning process itself (see Barcelos & Kalaya, 2011;
Cotterall, 1995, 1999; Navarro & Thornton, 2011).
In the second stage (i.e., choosing components and descriptors), it is
important that learners actually select the aspects of the learning process
that they would like to reflect upon. Whereas in classroom settings the
choice of components can be linked to specific tasks, such as
implementing an individual learning plan, or choosing resources and tasks
for individual objectives, in self-directed learning and language advising
settings learners should select themselves the focus of their self-
assessment according to their needs. For each component, learners can
also choose which descriptors they believe are relevant to their particular
language learning process.
The third step is self-assessment, where learners can choose between
three answers for each descriptor, which are: ‘I can’, ‘I want to learn this’
and ‘This isn’t important for me’. In advising settings, this step can be
tackled when the student is alone, rather than in the advising session, so
that they are free of pressure regarding time and they can answer the
questions within an environment in which they feel comfortable. In
classroom settings, the self-assessment can be done individually or with a
partner. For learners who are not experienced in self-assessment, it may be
useful to exchange views and experiences with a partner.
The final stage (i.e., comparing perspectives) involves discussing the
outcome of the self-assessment either with the adviser or with the teacher.
This is a key element in the evaluation process, involving a pedagogical
dialogue where the adviser/teacher and learner reflect together in order to
compare their perspectives on the learner’s competencies and on the
learning process. This pedagogical dialogue is the core of the assessment
process and crucial for the development of learner autonomy (Little,
1995). In language advising settings, the process followed in the dialogue
is that suggested by Kelly (1996). Firstly, the adviser listens to the learner
(active listening). The adviser then asks questions seeking further details
and clarification, reformulates the learner’s statements, summing up the
information elicited. Finally, the adviser aims to focus on what the
learner’s next priorities are and asks for their next steps.
This style of pedagogical dialogue is useful in that learners are not left
alone to cope with self-assessment, in which they may have the tendency
to be too kind to themselves or too strict. Without specific criteria with
which to judge, learners might have difficulty in assessing their
competencies in various situations. Having the descriptors allows the inner
perspective of the language learner to interact and be compared with an
external perspective on autonomous language learning. Most importantly,
the dialogue with the adviser/teacher or peers is real and, as such, it has
the potential to unleash powerful and meaningful interaction, where
internalized understandings can be brought to the surface and become
externalised. The benefits to learners are that they are enabled to reflect
deeply, without constraints. They can initiate the topics for discussion and
by doing so they gain insights into their own attitudes and competencies
and establish the basis for decision-making. This capacity for reflection
and consequent action is both the aim and the outcome of the evaluative
reflection process.
The descriptors in the model are not provided with a numeric answer
system, because giving a numeric score to the different answers would
imply a hierarchy among the components and the descriptors and would
severely compromise the learners’ ability to freely choose the components
and the descriptors upon which they would like to reflect. Furthermore, a
scored test is not advisable from a pedagogical point of view, since it
could give learners the false impression that there is a full score to reach.
Therefore, the assessment is merely formative and qualitative, resulting in
enhancing metacognitive processes and, most importantly, it can be
repeated, when and as needed, with the learner modifying and changing
the focus, as required. All of these features contribute to making the model
dynamic.
Testing the Dynamic Model:

The Empirical Research Procedure
Whereas the first part of the research study aimed at developing the
dynamic model and the descriptors and was based, as mentioned above,
upon a critical analysis of the literature and a validation with experts in
discussions and workshops, the second part of the research, which
followed the creation of the dynamic model, aimed at testing the dynamic
126 Chapter Six
model of autonomy and its descriptors with learners and teachers, to

ensure the model was a comprehensible and a useful support tool in the
development of their autonomy. The research was conducted over a period
of nine months both in language classroom settings (with 15 student
participants and 2 teachers) and in the CILL (with 6 student participants).
The language course followed in the classroom setting focused both on
promoting language competencies (French) and competencies specific to
learner autonomy (see Denorme, 2006 for a description of the approach
used). Additionally, students had access to the CILL, which is part of the
Language Centre at the University. The CILL offers materials in 30
languages and different forms of support for autonomous language
learning such as a language advising service, workshops and study guides.
The use of the CILL is optional and language learning in the CILL can be
undertaken as individual language learning, as a language learning
partnership (tandem learning) or within tutorials. Sessions with advisers
are open to all, can be attended once or more and aim to encourage
reflection on learning, support in decision-making, offer help in choosing
tasks and learning materials, or in evaluating progress and/or the learning
process.
Of the 21 student participants, 14 were native German speakers and
seven others were variously speakers of Italian, Chinese, Hungarian,
Turkish and Farsi. Eight of the participants were enrolled in courses for
language specialists, while the remaining 13 were attending non-specialist
language courses. The languages the participants were learning were
French (17), Spanish (1), German (2) and English (1). The learners who
were following a CILL mode of learning did so for various reasons, which
included the undertaking of remedial language work beyond the
classroom, following an individual learning plan, learning in tandem or
working on a project.
The approach taken to gathering data was qualitative, making use of:
a) preliminary questionnaires and interviews with learners and

teachers in order to determine their understandings of autonomy;
b) self-assessment using the dynamic model and descriptors (which, at
the time of the research, was conducted in paper format);
c) follow-up feedback questionnaires and interviews related to the
self-assessment; and
d) discussion of the learner profile which emerged from learner self-
assessment, through an advising session.
In the case of the CILL learners, their understandings of learner

autonomy were gleaned through individual interviews, while in the case of
the classroom learners, the same information was gleaned through
questionnaires and follow-up discussions. Learners were then invited to
complete the self-assessment, choosing the components and descriptors
which they wished to reflect upon, without the need to work through all of
them. The CILL learners completed the self-assessment on their own,
whereas the classroom learners of the language module did the same,
individually, during one session in the classroom. Feedback on the self-
assessment was gathered through questionnaires and interviews.
Two teachers participated in the research study, the focus of their
interviews being their feedback on the usefulness of the dynamic model
and the descriptors in an autonomy-oriented language module. Similarly,
the teachers were first interviewed about their understanding of learner
autonomy, and then they were asked to give their feedback on the self-
assessment with the dynamic model in the classroom setting.

Learner opinions on what constitutes learner autonomy were quite
diverse. Some regarded it as simply ‘learning without a teacher’, others as
‘self-aware learning’, while others saw autonomy as the ability to choose
among tasks, having the ability to correct oneself, choosing the pace and
rate of learning and, above all, being able to self-assess their own learning.
According to the learners’ answers, discipline, motivation, time
management, appropriate resources and learning environments play a role
in successful autonomous learning.
Of the 21 participants, 20 participants chose between two and eight
components for self-assessment, while only one student chose to assess
herself in all components. The choice of components and descriptors for
self-assessment shows that all areas are almost equally represented, with a
slight preference for ‘motivating oneself’ and ‘dealing with my feelings’
(chosen by 16 students), whereas ‘monitoring’ and ‘evaluating’ were less
focused on (chosen by 12 and 9 students respectively). Asked to give a
reason of their choice, some said they chose components in accordance
with the areas they felt were troublesome, while others chose areas in
which they felt confident and able to manage well. Some simply followed
the steps as they were presented, while other students, mostly the students
enrolled in the language module, gave no particular reason for their
choices (see Table 6-1).
128 Chapter Six
Table 6-1: Reasons for choice of components (more than one answer
possible).
Reason for Choice of Components Total

It is problematic, difficult for me 7
I followed the sequence of learning 3
It is relevant for autonomous language learning 2
It is easy for me, something I can do 2
I am interested in it 1
No answer given 11
What was encouraging was that all but one of the 21 participants gave
positive feedback concerning the self-assessment. The majority found the
model useful and thought that the self-assessment process gave them the
impetus for self-reflection, increased awareness of their learning processes
and helped them set further goals to improve their language learning.
Learners felt that the model also made them become more conscious of the
choices and opportunities open to them and more competent in making
decisions with regard to their future learning (see Table 6-2). Such decisions
might include choosing to undertake new learning tasks, trying new
strategies, joining a learning group or finding a tandem partner. Learners
might also decide to leave a course or change courses in order to meet their
more specific learning needs. Difficulties that the learners identified related
to autonomous learning were variations in motivation levels, the challenges
involved in self-assessment of their language skills, and their ability to select
suitable materials, to plan and to manage their time efficiently.
The results of the investigation showed that the dynamic model is a
valid tool which supports evaluation, raises awareness, reflection and
decision-making. Through reflection on skills and competencies, learners
were brought to a state of greater awareness, they could identify their
strengths and shortcomings and recognise areas in which they needed
support. This contributed to improving the learning process and to greater
regulation by learners. The following comments by two of the students
illustrate these points:
“I have learned that I have a problem with managing: I always learn, but
before [the self-assessment] I wasn’t aware of this problem. I start
learning, then I get side-tracked and I don’t make progress.” (Student 19)
“[The descriptors] allow you to understand which things you prioritize

when […] learning a language autonomously, and allow you to understand
how many opportunities you can exploit and which opportunities you do
actually exploit and which you do not, what can be improved and, as for
many things, if I have only my own point of view, maybe I am only able to
see certain things. […] It’s a test that, since it has no grade, one can do it
freely and it allows you to realize your own pros and contras.” (Student 4)
Table 6-2: Effects of the self-assessment on participants (more than

one answer possible).
Effect of the Self-Assessment Using the Dynamic Model Total

Stimulates reflection 8
Gives an overview about different methods for language 6
learning
Enhances decision-making 4
Helps awareness 3
Enhances awareness of priorities 3
It is stimulating, motivating 2
It is interesting 1
Helps awareness of issues with autonomous learning in 1
institutional context
It is frustrating 1
I think I can do it quite well 1
No answer 1
The only negative feedback on the self-assessment came from a learner

at the CILL, who had been learning English in self-access mode for a long
time (i.e., he had been working very meticulously through various video
courses during several months). He found the self-assessment with the
checklist frustrating, difficult to understand and too time consuming.
Interestingly, the preliminary interview with him showed that although he
was working in a self-access mode, he was not interested in developing
learner autonomy. On the contrary, he defined himself as dependent on the
materials with which he was learning. His feedback is of great value for
this research study, since it shows that self-assessment of learner
autonomy may be useful only in contexts in which learner autonomy is an
educational goal, otherwise it may be useless or even counterproductive.
As stated by some of the participants, the descriptors offer learners
another perspective on autonomous language learning. Reading the
descriptors and reflecting on the competencies and/or behaviours they
describe provide learners with another perspective, another point of view
on their own learning, stimulating self-reflection and self-assessment.
Comparing perspectives is crucial to the self-assessment and evaluation
process. One of the key steps of this self-assessment process is precisely
130 Chapter Six
the pedagogical dialogue in which learner and adviser, or learner and

teacher share and discuss their own perspectives. This dialogue is valuable
for both learners and advisers or teachers: for advisers and teachers, it
provides insights regarding the competencies or skills with which learners
need further guidance or support. For learners, it is an opportunity to
reflect on their own language learning, to externalise their own beliefs and
attitudes towards learning, to review or confirm their own assessment, or
to address questions, if need be. At the end of this process, learners should
be able to make more informed decisions about their further learning.
However, since learners may not be used to this reflection, it is the
duty of the adviser and/or teacher to choose settings and pedagogic
practices which enhance reflection and which always take into account the
needs and attitudes of the learners. The possible “learner resistance”, as
Cotterall and Malcolm (2015) put it, may be
“one of the greatest hurdles to implementing new assessment procedures.

Introducing more innovative assessment practices demands transformative,
personal change […]; willingness to participate […]; an inclination to
adopt an unfamiliar role […]; and the investment of time for no immediate
reward […]. Overcoming such resistance requires mechanisms for
engaging learners in reflecting on and articulating their beliefs about their
and their teacher’s role in language learning, and the means and purposes
of assessment.” (p. 171)
Therefore, the adviser and/or teacher also has to maintain a careful

balance between the focus on language itself and the focus on learning
processes and learning competencies, since learners may not always be
concerned about the latter.
Comparison between the Dynamic Model and other

Approaches to Assessment or Measurement of Autonomy
During the development and testing of the dynamic model of learner

autonomy, other researchers were investigating similar research questions,
identifying dimensions of learner autonomy and developing instruments
for describing, assessing or measuring autonomy (Cooker, 2012; Dixon,
2011; Murase, 2010) (see Table 6-3). This indicates the relevance and in
some way the urgency for inquiry into these fields.
Table 6-3: Comparison of studies on learner autonomy.
Aims Operationalizing learner autonomy; Developing a tool for

self-assessment and development of learner autonomy
Cooker, 2012
Dimensions Learner control; Metacognitive awareness; Critical

reflection; Motivation; Learning range; Confidence;
Information literacy
Method Q-methodology
Instrument Formative (self-) assessment tool: a learner generated
instrument, potentially unlimited; Languages: English
Use/Setting Self-access learning
Aims Developing a quantitative instrument for measuring learner
autonomy; Comparing results of the quantitative instrument
with teachers’ evaluation; Helping teachers to help learners
to develop learner autonomy
Dimensions Autonomy is a multidimensional concept; Autonomy is
Dixon, 2011
variable; Autonomy is a capacity; Autonomy is

demonstrated; Autonomy requires metacognition; Autonomy
involves responsibility; Autonomy involves motivation;
Autonomy involves social interaction; Autonomy is political
Method Critical reflexive mixed methods: first exploratory and then
quantitative
Instrument Questionnaire: Long List (256 items) and Short List (50
items); Languages: English and Chinese
Use/Setting Classroom learning, self-access learning
Aims Operationalizing learner autonomy; Developing an
instrument for measuring learner autonomy
Dimensions Technical (behavioural, situational); Psychological
Murase, 2010
(motivational, metacognitive, affective); Political-

philosophical (group/individual, freedom); Socio-cultural
autonomy (social-interactive, cultural)
Method Quantitative
Instrument Measuring Instrument for Language Learner Autonomy
(MILLA) (113 items); Languages: Japanese and English
Use/Setting Classroom learning
Aims Operationalizing learner autonomy; Developing an
instrument for reflection, self-assessment & learning support
Tassinari, 2010
Dimensions Cognitive and metacognitive; Motivational; Affective;

Action-oriented; Social
Method Exploratory-interpretative, qualitative
Instrument Dynamic model with descriptors (133 descriptors in total);
Languages: German and English
Use/Setting Self-access learning, language advising, classroom learning
132 Chapter Six
These research approaches have some aspects in common, but also

differ in some others. The primary aim of the research studies is describing
and operationalizing the construct of learner autonomy, but with different
scopes and forms: Cooker (2012) and Tassinari (2010) created a tool for
self-assessment, reflection and the development of learner autonomy,
whereas Dixon (2011) and Murase (2010) developed a quantitative
instrument to measure learner autonomy. The formative self-assessment
tool developed by Cooker is a learner generated instrument, based on
statements and learner profiles; the dynamic model, as described above, is
a self-assessment tool for learners which can be used in an adaptive and
recursive way; whereas both Dixon and Murase developed a questionnaire
with a Likert-scale.
Based upon a critical analysis of the literature, all these researchers
define autonomy as a multidimensional construct, whose main dimensions
are cognitive, metacognitive, motivational, affective and social. Although
slight differences can be observed in the perspective and in the focus on
some dimensions (e.g., Murase (2010) and Dixon (2011) stress the
political dimension; Cooker (2012) and Tassinari (2010) underline more
the aspects of control and critical reflection), many similarities exist in the
statements describing the learner’s competencies, attitudes, beliefs and
behaviours.
More relevant for the purpose of the present chapter are the differences
in the research methodology used in order to develop and validate the
instruments. The methods range from a merely quantitative approach
(Murase, 2010) to a prevailing qualitative one (Tassinari, 2010).
Concentrating on the learner perspective, Cooker (2012) makes use of
techniques and procedures of the Q-methodology, which allows for a
systematic understanding of subjectivity and of the participants’ point of
view about the subject of investigation. Finally, Dixon (2011) chose a
mixed methods approach with critical reflexive methods, starting with an
exploratory phase in order to explore the viability and appropriateness of a
quantitative approach for his research purpose, and continuing with
quantitative methods in order to develop and validate his questionnaire.
Among the findings of the research, I would like to point out the
conclusion in Dixon’s (2011) study. After having tested his questionnaires
(a Long List, with 256 items and a Short List, with 50 items) and
compared the results of the measurement with the questionnaires and the
teachers’ evaluation of the learners’ autonomy, he concluded that the
closed-item questionnaire he developed “cannot be claimed to be a
measure of autonomy; rather, the data provided by the questionnaire need
to be viewed “in context and in consultation with the learner” (Dixon,
2011, pp. 313-314). In other words, according to Dixon, autonomous

learning cannot be measured in the abstract, but it has to be considered as
situated in a real context; and it cannot be measured only through
questionnaires or statements, but the learners’ voice has to be listened to.
Conclusion
The aims of the research illustrated in this chapter were to create a
dynamic model of learner autonomy which would offer a comprehensive
description of language learning competencies, skills, attitudes and
behaviours, to be used to support reflection in autonomous language
learning processes. The self-assessment proposed by the dynamic model
offers a learner-centred, dynamic and recursive approach which involves
the collaboration of learners and advisers or teachers within a pedagogical
dialogue and can be renegotiated according to the changing focus of the
learning context and/or situation.
The research participants were positive about their use of the dynamic
model, stressing that the self-assessment stimulated their awareness and
their reflection on their learning competencies, and helped them recognize
strengths and/or issues in their learning process. Out of this reflection, they
could better focus on priorities and make decisions for their further
learning. With the use of the dynamic model, learners reached greater
awareness of themselves as autonomous learners, through the processes of
critical thinking and evaluation, which encouraged metacognitive
development.
However, self-assessment, both of language and of learning competencies,
can be very difficult for learners who are not used to it. The descriptors in
the model provide learners with criteria for conducting the self-
assessment. In addition, the pedagogical dialogue with advisers and/or
teachers gives learners the opportunity to compare their own perspective
with an external perspective, and therefore to enhance their self-awareness
and critical reflection, which are key aspects of learner autonomy. Thus,
the pedagogical dialogue aims at encouraging learners to reflect and take
on a more agentic role than they might previously have been accustomed
to in their language learning.
Since learners (and maybe even teachers) may be reluctant to engage
in such innovative practices of self-assessment, it is necessary to integrate
self-assessment in learning and teaching settings that foster autonomy and
to support it through reflection on learners’ and teachers’ roles and beliefs.
Due to the complex, developmental and fluctuating nature of learner
autonomy, a qualitative research approach, such as the one adopted in this
134 Chapter Six
inquiry, seems to be appropriate for investigating questions related to the

assessment of learner autonomy. In the same way, as learners’ reflection
enhances understanding of the learning process, critical, reflexive, and
exploratory research methods help the researcher both gain better insight
into the research object and be open-minded and without preconceptions
regarding the research findings.
From the results of the present investigation some questions arise
which could be explored in further research. One area worthy of further
investigation concerns the effects of self-assessment of learner autonomy
and of awareness-rising intervention in a longitudinal study, in order to see
what role they play in the development of learner autonomy and/or
learning competencies and if they influence changes in learners’ learning
behaviours and habits. Another research field could focus on forms and
techniques of the pedagogical dialogue both in the classroom and language
advising setting in order to provide practitioners with appropriate tools and
a wide repertoire of skills for fostering their own and their learners’
development.
References
Barcelos, A., & Kalaya, P. (2011). Introduction to beliefs about SLA
revisited. System, 39(3), 281-289.
Benson, P. (2001). Teaching and researching autonomy in language
learning. Harlow, Essex, UK: Pearson Education.
—. (2015). Foreword. In C. J. Everhard & L. Murphy (Eds.), Assessment
and autonomy in language learning (pp. viii–xi). Basingstoke, UK:
Palgrave Macmillan.
Boud, D. (2000). Sustainable assessment: Rethinking assessment for the
learning society. Studies in Continuing Education, 22(2), 151-167.
Breen, M. P., & Mann, S. J. (1997). Shooting arrows at the sun:
Perspectives on a pedagogy for autonomy. In P. Benson & P. Voller
(Eds.), Autonomy & independence in language learning (pp. 132-149).
London: Longman.
Candy, P. C. (1991). Self-direction for lifelong learning. San Francisco:
Jossey Bass.
Colbert, P., & Cummings, J. J. (2014). Enabling all students to learn
through assessment. In C. Wyatt-Smith, V. Klenowski & P. Colbert
(Eds.), Designing assessment for quality learning (Vol. 1, pp. 211-
231). Heidelberg, Germany: Springer.
Cooker, L. (2012). Formative (self-)assessment as autonomous language
learning. Doctoral thesis, University of Nottingham, UK.
Cotterall, S. (1995). Readiness for autonomy: Investigating learner beliefs.

.
System, 23(2), 195-205.

—. (1999). Key variables in language learning: What do learners believe
about them? System, 27(4), 493-513.
Cotterall, S., & Malcolm, D. (2015). Epilogue. In C. J. Everhard & L.
Murphy (Eds.), Assessment and autonomy in language learning (pp.
167-175). Basingstoke, UK: Palgrave Macmillan.
Dam, L. (1995). Learner autonomy, 3: From theory to practice. Dublin:
Authentik.
Denorme, M. (2006). Autoformation et diversification des publics.
Mémoire professionnel. Compte rendu d’expérience: l’élaboration
d’un curriculum dans le cadre d’un dispositif de langue autonomisant.
Université Charles-De-Gaulle, Lille 3.
Dickinson, L. (1987). Self-instruction in language learning. Cambridge:
Dixon, D. (2011). Measuring language learner autonomy in tertiary-level
learners of English. Doctoral thesis, University of Warwick, UK.
Retrieved from: http://wrap.warwick.ac.uk/58287
Everhard, C. J. (2015). The assessment-autonomy relationship. In C. J.
Everhard & L. Murphy (Eds.), Assessment and autonomy in language
learning (pp. 8-34). Basingstoke, UK: Palgrave Macmillan.
Holec, H. (1981). Autonomy in foreign language learning. Oxford:
Pergamon.
Kelly, R. (1996). Language counselling for learner autonomy: The skilled
helper in self-access language learning. In R. Pemberton, E. S. L. Li,
W. W. F. Or & H. Pierson (Eds.), Taking control: Autonomy in
language learning (pp. 93-113). Hong Kong: Hong Kong University
Press.
Lai, J. (2011). The challenge of assessing learner autonomy analytically.
In C. J. Everhard & J. Mynard (Eds.), Autonomy in language learning:
Opening a can of worms (pp. 43-49). Canterbury, UK: IATEFL.
Little, D. (1991). Learner autonomy 1: Definitions, issues and problems.
Dublin: Authentik.
—. (1995). Learning as dialogue: The dependence of learner autonomy on
teacher autonomy. System, 23(2), 175-181.
Littlewood, W. (1996). Autonomy: An anatomy and a framework. System,
24(4), 427-435.
—. (1999). Defining and developing autonomy in East Asian contexts.
Applied Linguistics, 20(1), 71-94.
136 Chapter Six
Martinez, H. (2005). Lernerautonomie: Ein konzeptuelles Rahmenmodell

für den Fremdsprachenunterricht...und für die Fremdsprachenforschung.
Fremdsprachen Lehren und Lernen, 34, 65-82.
Murase, F. (2010). Developing a new instrument for measuring learner
autonomy. Doctoral thesis, Macquarie University, Sydney, Australia.
Navarro, D., & Thornton, K. (2011). Investigating the relationship
between belief and action in self-directed language learning. System,
39(3), 290-301.
should know. Englewood Cliffs, NJ: Newbury House.
—. (2003). Toward a more systematic model of L2 learner autonomy. In
D. Palfreyman & R. C. Smith (Eds.), Learner autonomy across
cultures (pp. 75-91). Hampshire / New York: Palgrave Macmillan.
Prodromou, L. (1995). The backwash effect: From testing to teaching.
English Language Teaching Journal, 49(1), 13-25.
doi: 10.1093/elt/49.1.13
Sheerin, S. (1997). An exploration of the relationship between self-access
and independent learning. In P. Benson & P. Voller (Eds.), Autonomy
and independence in language learning (pp. 54-65). London:
Longman.
Tassinari, M.G. (2010). Autonomes Fremdsprachenlernen: Komponenten,
Kompetenzen, Strategien. Frankfurt am Main, Germany: Peter Lang.
CHAPTER SEVEN
CULTIVATING LEARNER AUTONOMY THROUGH
THE USE OF ENGLISH LEARNING PORTFOLIOS:
A LONGITUDINAL STUDY
BEILEI WANG
Abstract
The study reported in this chapter explored whether portfolio-based
assessment is effective in fostering learner autonomy through a
longitudinal study in a Chinese junior high school. A three-dimensional
learner autonomy scale was administered to both the experimental and
control groups. The questionnaire findings revealed that English Learning
Portfolios (ELP) were conducive to helping students gain learner
autonomy, which was further supported by the case study results. The
study also showed that the ELP template and development need constant
negotiation and adjustment in accordance with learner needs and
environmental constraints. Therefore, some implications and suggestions
are provided in this regard.
Introduction
Learner autonomy has been a popular research topic in the past thirty
years since Holec (1981) first used the word ‘autonomy’ in his report of
the Council of Europe’s Modern Language Project. In China, during the
past decade, learner autonomy, as a remedy for the conventional problem
of teacher-centeredness, already found its way to the National English
Curriculum Standards (NECS) (2001), a national English teaching
syllabus for primary and secondary English language education in China.
To help learners assume a more active role in English language
learning, formative assessment has been suggested by the NECS, in
addition to summative assessment, so that learners can assess their own
138 Chapter Seven
performance and that of their peers. In doing so, it is believed that learners
can attach more importance to the learning process than the learning
results as learning and assessment are reciprocally integrated (Little &
Erickson, 2015). In fact, the issue of test-oriented, time-consuming but
ineffective English language teaching in China has received a lot of
criticism since the late 1990s (Dai & Hu, 2009). However, research by Shu
(2004) has shown that most English teachers in secondary schools were
unaware of the exact requirements or suggestions put forth in 2001 by the
NECS. In practice, teachers in many schools still immersed their students
in exercises and discrete point quizzes and tests, activities which were not
designed to improve student language competence. Because of the
high-stakes nature of the senior high school entrance examinations, tests
were still considered to be the most powerful measure of students’
performance and teachers’ teaching abilities.
The issue of over-emphasis on teaching and formal assessment is also
evident in a core language journal published in Chinese, namely, Foreign
Language Teaching in Schools. For example, even a decade after the
NECS was introduced, the themes of the published papers in the 2011
journal were classroom teaching, which constituted 51.4% of all journal
content, followed by high-stakes tests, which covered 13.8% of the articles.
Articles on formative assessment were conspicuously lacking.
To bridge the gap between societal needs, educational policy and
reality, researchers and scholars launched collaborative university-school
English language teaching (ELT) projects (Wang & Zhang, 2014). The
present study was but a part of a collaborative ELT project between a
junior high school and a foreign language university. The goal of the
three-year longitudinal study was to empower teachers and learners with
innovative development in course design, assessment and teacher training.
The present study focused on reform in assessment by integrating the
English learning portfolio (ELP) into the assessment system in an effort to
foster learner autonomy.
Background
Learner autonomy (LA), though considered “an elusive notion” (Bown,
2009, p. 572) and embodied in various terms (Dickinson, 1987; Sheerin,
1991; Wenden, 1991; White, 1999; Zimmerman & Schunk, 2001), is a
three-dimensional concept in the present study: metacognitive, affective
and social. Metacognition is accepted as a key element in LA as learners
are supposed to take charge of their own learning (Holec, 1981) and “take
Cultivating Learner Autonomy: A Longitudinal Study 139
ownership (partial or total) of many processes which have traditionally

belonged to the teacher” (Littlewood, 1999, p. 71). That is, learners
assume responsibilities of self-planning, self-management and
self-reflection. The affective dimension was also recognized by Cotterall
(1995), Benson (2005), and Confessore and Park (2004), and it addresses
students’ willingness and readiness to learn English. In addition, the social
dimension, whereby learners cannot learn without social interaction, was
supported by Little (2005) and Benson (2005).
Moreover, LA is not a steady state achieved by learners, but it is
progressive and changeable. Nunan (1997) suggested that learners grow
through five stages, while Littlewood (1999) proposed two levels of
autonomy: proactive autonomy and reactive autonomy. Littlewood’s two
levels of autonomy were adopted for the present study. His view was that
learners might move back and forth along the continuum due to the
influence of contextual factors. The main purpose of the current study was
to cultivate leaners’ reactive autonomy and help them move towards
proactive autonomy.
Portfolios, originally an alternative to assessment (Banfi, 2003;
Cummins & Davesne, 2009; Nunes, 2004), began to be used as a way to
promote language learner autonomy especially after the enactment of the
Common European Framework of Reference (CEFR) and the European
Language Portfolio. The language portfolio has been implemented at
different levels, ranging from tertiary institutions in Spain (González, 2009)
to primary schools (Little, 2009), fourth and fifth graders in Turkey
(Yilmaz & Akcan, 2011), and Spanish language classrooms in 23
American high schools (Moeller, Theiler & Wu, 2012). The portfolio
helped make invisible learning outcomes visible and helped students
develop a “metacognitive understanding of language” (Kohonen, 2006, p.
34). The most widely used portfolio was the European Language Portfolio,
the first draft of which was published in 1995 and revised in 2001, as part
of the European Year of Languages (Trim & Bailly, 2001). The European
Language Portfolio was comprised of three obligatory components: a
Language Passport, a Language Biography and a Dossier. The portfolio
was adapted into LinguaFolio, an American version, which was adopted in
the official project for the 2005 Year of Languages by the National
Council of State Supervisors of Foreign Languages in the USA (Moeller,
Theiler & Wu, 2012).
In China, there were limited studies on the use of portfolio and learner
autonomy. Xu’s (2007) book on learner autonomy at the tertiary level was
a comprehensive one, touching upon the portfolio but failing to cover the
140 Chapter Seven
relevant research about the portfolio and learner autonomy. Gong and Luo
(2002) introduced what portfolio assessment is and how to develop it in
schools by giving some examples and practical suggestions. However,
there was no systematic description about any empirical studies on this
issue. Rao (2006) carried out a 6-month study during which he integrated
the portfolio in his class instruction among university students and gained
positive feedback from students, but his study did not include a control
group. Lo (2010) recorded how she used a reflective portfolio to promote
learner autonomy in a journalism course. In her study, questionnaires were
administered to learn more about learners’ gains in journalism.
The longitudinal study reported in this chapter posited two research
questions:
1. Can portfolio assessment promote learner autonomy among

younger learners in China?
2. What are the learner perceptions of the English learning portfolio
(ELP)?
The first research question attempted to examine whether portfolio

assessment can be a possible solution to the existing English education
problem in China, by helping learners take an active role in the learning
process. The second research question further explored learners’
viewpoints about ELP and LA for the sake of further improvements on
ELP practice.
The Study
This study took place in a Chinese junior high school as part of a large
ELT project, in collaboration with a university research team, which was
made up of six doctoral candidates and their supervisor. The needs
analysis in the preparatory phase revealed that English teaching at that
junior high school was still teacher-dominant and test-oriented, though
teachers and students agreed on the importance of the communicative
function of English. The ELP was thus used as a means of assessment in
order to change the test-oriented teaching and learning and to promote
learner autonomy. A quasi experiment was conducted to see the
effectiveness of portfolio assessment in fostering learner autonomy over a
period of three years. There were two groups of students: the experimental
group (EG), who received the ELP assessment intervention, and the
control group (CG), who received the traditional assessment.
Questionnaires were administered to both the EG and CG. Furthermore, a

separate case study was conducted with six of the EG students.
Participants
The number of participants grew over the duration of the study as
every year new students were enrolled in the school. The newly-enrolled
students were assigned to twenty parallel classes according to their
performance on the placement test covering three subjects: Chinese, math
and English. Every year, two out of the twenty classes were randomly
selected to be the EG group.
In Year 1, the EG was composed of EG1-1 (Year 1, Class 1) and
EG1-2 (Year 1, Class 2), with a total of 109 students. Two English
teachers participated in the project on a voluntary basis. In the following
two years, another two classes were recruited into the EG. Similarly, the
CG also grew in number in the three consecutive years, Year 1, Year 2 and
Year 3. Table 7-1 displays the numbers of study participants over the three
years of data collection.
Table 7-1: Student participants in the study.
Participants Year 1 Year 2 Year 3

EG (7th graders) 109 111 114
CG (7th graders) 110 110 112
This study excluded data collected from the learners enrolled in Year 3
because of the drastic changes in the local educational policy and the
forthcoming entrance examinations, but the questionnaires were still
administered.
The participants for the case study were selected by stratified case
sampling (Duff, 2007). A total of 6 participants were selected according to
their language proficiency, gender, and learning goals: Alice from EG1-1,
Bob and Elian from EG1-2, Cathy and Jack from EG2-1 (Year 2, Class 1),
David from EG2-2 (Year 2, Class 2). Pseudonyms have been used in the
reporting of the data to protect the privacy and identity of the students. The
demographic information for the case study participants is displayed in
Table 7-2.
142 Chapter Seven
Table 7-2: Demographic data of case study participants.
Alice Bob Elina Cathy Jack David

Class EG1-1 EG1-2 EG1-2 EG2-1 EG2-1 EG2-2
Gender Female Male Female Female Male Male
Learning
a a, c a, b, c c c d
goals
English CEFR CEFR CEFR CEFR CEFR CEFR
proficiency A1-A2 A1-A2 A2 A1-A2 <A1 A1
Notes: a = to pass exams; b = to go abroad; c = to communicate; d = to learn
foreign cultures. A1-A2 = higher than A1, lower than A2; <A1 = lower than A1.
ELP Treatment
The use of the portfolio assessment in the school underwent four
phases: preparatory phase, phase 1, phase 2 and phase 3 (see Table 7-3). In
phase 1 and phase 2, the ELP was a must for the EG students; in phase 3,
the ELP was not required but voluntary, in order to mirror the progression
from reactive autonomy to proactive autonomy.
Table 7-3: Phases of treatment in the junior high school.
Preparatory Phase 1 Phase 2 Phase 3

Phase (Year 1) (Year 2) (Year 3)
EG ELP template ELP piloting ELP ELP
designing & & adjusting implementation implementation
negotiating (Obligatory ) (Obligatory) (Voluntary)
CG No portfolio requirement
Note: EG = experimental group; CG = control group.
The ELP template was adapted from the European Language Portfolio,
and it was composed of the learner profile, learning records and learning
materials, corresponding to biography, passport and dossier of the
European Language Portfolio respectively.
The initial portfolio template was proposed by the research team and
negotiated with the school teachers for the details. The learner profile
included the age of starting English learning, English learning goal(s),
learning style (O’Brien, 1990, cited in Joy, 2012), and the most
unforgettable English learning experience, which helped students set their

goals with the help of their teachers. The learning records covered learning
goals and plans, learning journals (reflections), and the self and peer
assessment as well as the parent feedback form to help monitor the
learning process. The learning materials were the learning outcomes,
including the materials collected or written out of class as well as the
exercise books and test papers.
To raise students’ metacognitive awareness, the ELP practice was
goal-oriented (see Figure 7-1). EG students set their learning goals
according to their self-assessment results by using the graded
self-assessment grids, made their learning monthly and weekly plans and
reflected on their learning to see to what extent their plans were realized.
The reflective learning journals were handed in to teachers on a weekly
basis.
In-class and out-of-class learning materials were collected and used as
learning outcomes to boost students’ confidence in learning. EG learners
were encouraged to read, listen or write after class and present what they
learned in class. Meanwhile, their classmates assessed their presentation
performance by using the peer-assessment worksheets. During the process,
teacher-student, student-student and student-parent communications were
involved.
In practice, in the preparatory phase, before the EG students entered
the school, a needs analysis was conducted to know about the assessment
system in the school and the feasibility of the portfolio assessment. Then,
the research team and the school teachers worked together to decide on the
initial portfolio template, which also served as part of teacher training. The
teachers were informed about why the portfolio assessment was beneficial
to learning, what it was and how to help develop the portfolio.
In phase 1, when the seventh graders were enrolled, they took the
placement test and they were allocated to the EG and CG. CG learners
followed the traditional way of teaching and learning with no portfolio
assessment. EG students took the orientation classes in the first week to
find out about their active role in learning, to learn to do the
self-assessment, and to set realistic learning goals and plans with the help
of their teachers. Then, the students were asked to manage their own
learning with the help of the individualized ELP. The rubrics for the ELP
assessment were given in the guidelines to ensure the quality of portfolios.
The portfolios were reviewed not only by the teacher but also by peers on
a monthly basis. At the end of the semester, EG learners did the
questionnaires and 6 learners of each EG class were interviewed. Based on
144 Chapter Seven
the students’ feedback, the research team and the EG teachers simplified
the portfolio template.
Figure 7-1: The process of the ELP practice.
In phase 2, EG1 and EG2 students were required to keep the simplified
ELPs, which were revised according to the suggestions from students and
teachers. This time, EG2 students learned to develop the portfolio with the help
of not only their teachers but also their peers. EG1 students presented their
learning outcomes and shared their experiences in the orientation week. Small
adjustments were made based on the on-going needs analysis in the process.
In phase 3, EG1, EG2 and EG3 students were not required but
encouraged to develop ELPs. The questionnaire and interviews were
conducted to see whether there was a difference between EG and CG.
Instruments
Data were gathered through the ELP, the questionnaire and the case
studies and the data from these multiple sources helped achieve the effect
of triangulation (Duff, 2007; Nunan &Bailey, 2009).
A 5-point Likert scale on learner autonomy was adapted from Oxford
and Burry-Stock’s (1995) Strategy Inventory for Language Learning
(SILL). This instrument was used for two main reasons: SILL has been the
most widely accepted questionnaire to evaluate LA; SILL incorporates the
metacognitive, affective and social dimensions of language learning, which
is consistent with the three-dimensional construct of learner autonomy (i.e.,
metacognition, willingness and communication) used in this study. The
questionnaire items focus on the measurable behaviors, including the
items portraying metacognition like ‘I have got my own way of learning
English’, ‘I adjust my learning methods sometimes’, ‘I have a clear
learning goal’, ‘I make learning plans according to my learning goal’, and
‘I often reflect on my English learning’. Examples of items depicting
willingness were the following: ‘I am interested in English’, ‘I have
confidence in English learning’, and ‘I feel happy when learning English’.
Example items showing communication were: ‘I exchange my views of
English learning with peers’, ‘I talk with others in and out of class’, ‘I take
the chance to speak English’, and ‘I also use nonverbal language (gestures,
facial expressions, etc.) to express myself when I communicate with
others’. The questionnaire was designed, translated and administered in
Chinese to avoid misunderstandings and misinterpretations.
All members of the research team were involved in the validation of
the questionnaire items and initially they put forward 43 items. Then, the
questionnaire was piloted with 395 learners to ensure its construct validity
via factor analysis after the reliability test of Cronbach alpha coefficient.
The results showed that KMO was 0.920, and the Barlett’s test correlation
matrix was 0.000. Factor analysis revealed that nine factors stood out by
means of the principal component analysis, which were a bit scattered. As
a result, items with extraction communities lower than 0.5 were eliminated
from the scale.
Another factor analysis was conducted to verify the reliability and
construct validity of the revised 29-item scale. Statistical analysis revealed
146 Chapter Seven
a KMO of 0.931. Factor analysis further validated the construct of the

scale. Out of 29 items in the final version of the questionnaire, 18 items
manifest metacognition, 6 items reflect willingness, and 5 items
demonstrate learner communication. To ensure reliability, the questionnaires
in the study were administered not by the classroom teachers but by the
research team and teachers of other subjects in the school.
In the case studies, the learning journals and interview notes of the
participants were additional sources of data to explore the development of
learner autonomy. In the learning journals, students were asked to
summarize their progress each week and seek improvement in the
following week. The interviews aimed at students’ perceptions about the
portfolios: whether developing the portfolios was conducive to their
English language learning, whether they were satisfied with their
portfolios, whose portfolios were good, for what reasons, and what help or
support they expected. The interview notes and learning journals in the
case studies were categorized according to the three-dimensional construct
of learner autonomy.

Questionnaire Findings
In response to the first research question whether portfolio assessment
makes any difference to the EG in terms of learner autonomy, an
independent samples t-test was performed comparing EG and CG student
responses to the questionnaire. The findings revealed that ELP was
conducive to LA development for learners at the junior high school level
(see Table 7-4).
As is seen in Table 7-4, both CG and EG displayed quite a high degree
of learner autonomy. The negative t value revealed that the EG had
produced better results in terms of learner autonomy, especially in
metacognition, communication and willingness to learn autonomously.
The significant difference between EG and CG in LA validated the
effectiveness of the ELP.
Specifically speaking, the Levene’s test for equality of variances
showed no significance in any area, thus meeting the assumption of
equality variances. The t-test results suggest that there were significant
differences between the CG and EG in overall LA (p = .011),
metacognition (p = .017), communication (p = .035) and willingness (p
= .017). The findings indicate that EG learners had become more
autonomous than CG learners, to which the use of the ELP might have
contributed, especially in students managing and planning their own
learning.
Table 7-4: Independent samples test of LA and its subscales.
Mean Std. Deviation t-test

Sig.
Factor CG EG CG EG t df (2-tailed)
LA (Total) 3.927 4.094 .6098 .6916 -2.546 395 .011
Metacognition 3.939 4.105 .6317 .7293 -2.408 395 .017
Communication 3.803 3.963 .7384 .7648 -2.114 395 .035
Willingness 4.013 4.193 .7575 .7376 -2.394 395 .017
Note: CG = Control Group; EG = Experimental Group.
The metacognition was comprised of three sub-dimensions: planning,

managing and reflecting. The t-test results revealed a significant difference
between the CG and EG in managing (p = .002), but no significant
difference in planning (p = .073) and reflecting (p = .126). That means,
generally speaking, EG learners could manage their learning better than
CG learners. But interestingly, there were marked differences in planning
and reflecting between EG1 and EG2 (see Table 7-5). That is, EG2 had a
much higher degree of reflection. It is possible that during the second year
of junior high school and having used the ELP for a year, students in EG2
had better developed their skills to plan and reflect, which is evidence of
how the ELP can help develop students’ autonomy over time.
Table 7-5: Planning and reflecting.
Subscale Group Mean Std. Deviation

Planning EG1 4.029 1.029
EG2 4.299 .661
Reflecting EG1 3.75 1.009
EG2 4.31 .652
148 Chapter Seven
In conclusion, EG learners did exhibit more autonomous behaviors in

language learning than the CG learners, especially in the subscale of
metacognition and willingness. But whether the significant difference was
closely correlated with the ELP needs further investigation.
Case Study Findings

In the case studies, interviews were conducted and learning journals
were analyzed to elicit learners’ perceptions about portfolio assessment
and seek their constructive suggestions for revising and further developing
the ELP. Learners’ feedback on ELP was categorized into three areas:
metacognition, willingness and communication to probe more deeply into
the relationship between ELP assessment and LA development.
Phase 1
EG learners’ significant progress in their metacognitive abilities was
well supported by their weekly plans and reflections. Alice’s ELP won
favorable comments from her teacher Allan and her peers, and was
selected as one of the best 5 at the end of the first term. In the learning
journals, Alice viewed the portfolio as an aid to time-management:
“We will be clear about what to do instead of wasting time after making
plans. When we finish homework every evening, we will try to reach our
goal. And on weekends, we will spend time in learning English according
to plans and often check if the plan has been fulfilled. Otherwise, we might
waste time in watching TV and playing with computers.” (A-J2-11)
According to the journal entry, the portfolio helped Alice manage her
time and learning. Learning plans are an essential part of the portfolio and
they can guide students on what to learn and can give them insight into
when is the best time to learn outside of the classroom. The research team
and teachers gave students a list of recommended learning materials. Then,
students had the freedom to select the materials according to their tastes
and interests. As the school did not give students much extra time during
the school day, students had to use their free time to do the assigned
homework for the various subjects. Therefore, Alice’s plan of learning
English mostly on weekends was quite practical. Table 7-6 shows an
excerpt of Alice’s monthly learning plans. In the interview (A-I1), she
reported that she had made some plans in primary school but could not
stick to them. Thanks to the portfolio, she could persist in doing
out-of-class learning step by step together with her classmates under the
guidance of the teacher. She also gained a sense of achievement when her
portfolio was selected as one of the best. Her experience proved that the
portfolio could help learners to plan, manage and monitor their English
learning.
Table 7-6: Alice’s monthly learning plan (A-P1-9).
Materials for out-of-class learning Learning time
New Concept English Book 1

Learn at least 100 lessons with good 7:00-8:00am on Sundays
pronunciation(reading aloud)
The Elephant Man
8:00-8:25pm on Sundays
Can read it fluently and translate it well
Grammar for Middle School Learners
8:00-8:25pm on Tuesdays
by Bo Bing, Unit 1-3
English songs: There you’ll be, Ben,
3:00-5:00pm on Fridays
I’ll follow him and learn to sing them
Do three reading comprehensions 9-10:30am on Saturdays
Read some newspapers or magazines
Optional
Make some English newspapers
Elina, another good ELP keeper, viewed the portfolio as her learning
outcome and a stock of learning materials:
“I can carry out my plans and make some adjustments. For the out-of-class
learning, I’d like to separate my own learning materials from the teacher’s
so that I can clearly know what I have gained after class. More importantly,
I have begun to review all the materials in ELP. After all, we develop ELP
for our own sake…” (E-J2-11)
In a summary of her learning journals, Elina portrayed the first winter

vacation as having completed a substantial amount of learning activities:
“This winter holidays did not last long but I felt happy and substantial for I
benefited a lot from autonomous learning. I read two parts of Gulliver’s
Travels and wrote two book reports. Besides, I wrote two poems and
learned to sing four English songs. I translated an article and made two ppt.
Also, I finished learning 7A, New Concept English and Li Yang English
150 Chapter Seven
and took notes... I devoted my time to English learning after finishing

doing other assignments.” (E-J1-18)
Elina showed her devotion to the portfolio and to English learning

during the winter vacation: “Usually, I got up at 6:30, then listened to Li
Yang English ... At noon, … Every evening, I listened to some English
songs, watch some English movies or listen to Gulliver’s Travels.” At the
end of the holidays, she reported: “I’ve made great progress in English
speaking and vocabulary. I will keep working hard and I’m sure that I will
succeed.” When the new term began, her teacher, Betty, also noted Elina’s
progress in spoken English and set her portfolio as an example to
encourage other students. Elina was devoted to English learning in her
spare time, even on her way home. At the end of phase 2, she scored the
highest on the oral test. Elina’s experience was living proof of how the
portfolio could help learners achieve their goals and maintain their interest
in English learning not only in class but also out of class.
Bob, an average learner in the class, reflected on his ELP:
“I have always taken ELP as a collection of learning materials. This

semester I have collected 29 pages of out-of-class learning materials, 8
presentation drafts and 6 assessment sheets with the table of contents, test
papers with tags and other things, weekly plans, reading materials and
books. I make good plans but sometimes cannot finish all of them. So I’ll
improve it next time.” (B-J3-13)
Bob had a sense of achievement when reviewing his portfolio. At first,

he did not take an interest in English learning, but in other science subjects.
When he was asked to do out-of-class learning as obligatory practice, he
selected pop science articles in English and shared them with his
classmates. To his surprise, his presentations were well-accepted by his
classmates and gradually he developed an interest in English. Evidently
presentations and peer recognition facilitated his English learning. From
the learning journals, the research team and the teachers were delighted to
learn that ELP played a significant role in helping learners make plans and
manage their time, especially during weekends and vacations. In other
words, the use of ELP helped learners manage their out-of-class English
learning and witnessed their progress in the process.
However, in practice, the teachers and some students also found the
ELP time-consuming due to its complicated design. As most students in
the school spent their weekdays doing the assigned homework for the
various subjects, they had little time for extra learning. Therefore, the
teachers and the research team decided to streamline the ELP and use it
mainly for learning during weekends and holidays. In the second term of
Year 1, the ELP was simplified. Based on the ongoing needs analysis, in
Year 2, the ELP was further revised and mainly focused on planning and
reflecting in addition to self-assessment and peer-assessment in class.
Phase 2
For Cathy, David and Jack, the ELP template underwent some changes
in format. Their weekly plans were moved out of the ELP folder and they
were put together with the reflections in the learning journals so that the
students could clearly see whether or not they had accomplished their
plans.
As the research team and teachers were more experienced in the
second year, activities in the orientation week were used to increase the
students’ awareness of planning and doing out-of-class learning. After
watching the film Akeelah and the Bee and interacting with the EG1
students, Cathy immediately took action. She began to keep learning
journals in English in the hope of improving her English in addition to
reflecting. She wrote in the journal:
“I keep writing learning journals in English every week and Carol [her
teacher] corrects my mistakes. In this way, I benefit a lot, not only to learn
English but also in writing.” (C-J2-16)
Therefore, the weekly journals not only helped Cathy learn how to
learn but also helped her improve her English writing. The journal also
served as a bond in the teacher-student communication.
Similarly, David summarized his one-year of learning as:
“I can better manage my study with the help of the teacher and ELP after I
entered this junior school about a year ago. The weekly plan in my
learning journals and the monthly plan can help me well arrange my study.
I often cooperate with my classmates in class, and practice dialogues and
prepare for presentations. I like this learning mode.” (D-J2-16)
His journal also showed how ELP helped raise learners’ metacognitive
awareness in planning and managing their leaning. As a matter of fact,
students could not only learn to plan their learning but they also made use
152 Chapter Seven
of learning skills and strategies and tried to find out the most suitable way
of learning for themselves. As Cathy put it:
“I have learned quite a lot of learning strategies. I gained some of them by

myself, and obtained more by summarizing and sharing them in class. In
this way, I find out-of-class learning easier to manage.” (C-J2-16)
It could be claimed that it is easier for successful learners like Alice,

Elina, Cathy and David to plan and manage their studies. But there is room
for every learner to work towards autonomy. Jack, an ineffective learner,
was a case in point. He failed the written placement test, with a score that
was 24 points lower than the average score. As far as spoken English was
concerned, his language proficiency was lower than A2 in reference to the
CEFR scale. Jack’s learning plan was very practical in that the out-of-class
learning was mostly related to the assigned homework. Considering his
language proficiency, Carol, his English teacher, helped him make an
individualized plan to help him finish his homework more efficiently and
work on his weaknesses so as to boost his confidence in English learning.
In his learning journals, Jack expressed his gratitude to Carol more than
once:
“This week I learned how to learn English and how to present what I learn.
Thank you, Carol. You taught me how to take notes and use notes. The
learning strategies we learn in every unit were really helpful.” (J-J1-2)
To Jack, the ELP served as the bond between him and Carol. When
encountering some problems in English learning, Jack would ask
questions in the learning journal. For instance, he wrote:
“Carol, I have two questions. First, will we learn phonetic symbols in the
junior high school? I did not learn in the primary school. Second, what can
I do if I meet with some new words in a test?” (J-J1-3)
Carol responded patiently when reading his journal:
“Don’t worry. We will learn phonetic symbols in the following week. Then
you can learn to pronounce new words and correct wrong pronunciations.
As for the new words you meet with in reading, I have two suggestions.
Firstly, it is common to run into new words. You can skip them if they do
not keep you from understanding the content or guess the meaning in the
context. Secondly, …”
Gradually Jack’s interest in English grew. In his journal (J-J1-4), he

showed his learning slogan: “Enjoy what you are learning and present
what you have learned to the full extent.” The change in his willingness to
learn English was revealed in his journals and learning materials.
The case studies results thus provided further evidence for the role that
the portfolio played in fostering learner autonomy. The metacognitive
abilities and English learning willingness were promoted in the learners.
The journals of Alice, Elina, Bob and David highlighted the positive
impact of the planning and monitoring role of the portfolio; the journals of
Elina and Jack reflected their changing attitude towards English learning;
and the learning journals of Cathy and Jack revealed that ELP helped
facilitate teacher-student communication after class especially on their
learning plans and strategies. As a result, all participants in the case
studies moved closer to proactive autonomy.
Furthermore, the difference in the questionnaire results in the planning
and reflecting between EG1 and EG2 could also be explained by the way
their plans were made and implemented. In the first phase, learning
journals were kept in a notebook and handed in every week, while the
learning plans were placed in the portfolio folder for the monthly peer
review and random checks by the teachers. In the second phase, EG2
learners kept their weekly plans and reflections together in their learning
journals, which was more convenient for the learners themselves, their
peers and teachers to monitor their learning. The revised format of plans
drew planning and reflecting closer, which can be seen in one of David’s
weekly plans (see Table 7-7).
In short, the revised portfolio was better suited to foster learner
autonomy by helping students plan and reflect, facilitating communication
between teachers and students, as well as, helping students share
experiences with their classmates. The constant negotiation and
adjustment of the ELP template showed that the ELP-based approach
could contribute to LA development among young learners but its function
would be limited in a high-stakes learning context, where learners are
assigned burdensome homework. The study also showed that when
learners learned to plan and manage their English learning, teachers were
expected to play multiple roles, such as guide, helper, facilitator, and
counselor, in order to diagnose student’s problems and provide
constructive suggestions and timely help. That is, the development of
autonomy in students placed higher expectations on their teachers.
154 Chapter Seven
Table 7-7: David’s weekly plan.
Weekly plan Results

Listening Listen and imitate Listened and read after it for
Lesson 137, 141, 143 in 5-6 times. Can read 137, 143
New Concept English very fluently.
B1. Read after the tape Imitated the tones of
and try to pronounce the different people with
words clearly. emotions.
Speaking … …
Reading … …
Translating … …
Others … …
Conclusion
This longitudinal study showed that ELP assessment employed as an
obligatory element in English learning in the first two phases enabled
learners to move closer to reactive autonomy and gradually progress to
proactive autonomy with teacher guidance and peer feedback.
The questionnaire results revealed significant differences between the
EG and CG students in LA, particularly in metacognition, while the case
studies validated the positive relationship between the ELP and LA among
EG learners in a high-stakes context. Though the study was conducted in a
top-down approach, on-going needs analysis was important. In the study,
the ELP template was negotiated and adjusted more than once to cater for
learner needs and to consider contextual constraints in the high-stakes
learning context.
The positive feedback concerning the simplified ELP template also
shows that there is no one-size-fits-all or best portfolio template. It is
always important for researchers, teachers and learners to figure out the
most suitable one for their learners in a certain context. A clear purpose
and focus in the portfolio is equally important, as it can help learners, who
may otherwise find it too difficult to achieve their goals and fulfill their
plans, develop their interest and confidence in the learning process. This
way, ineffective learners like Jack can also develop their interest in
English learning. In addition, Jack’s case also showed that portfolio use
should be adapted to the individual learner. Teachers are expected to
provide sufficient scaffolding and assistance in goal-setting and
plan-making especially in the beginning so that learners can make a

feasible and individualized learning plan.
In conclusion, the ELP can fulfill the goal of fostering learner autonomy
among younger Chinese learners but the effectiveness of the ELP lies in the
on-going needs analysis of the learners, and the constant efforts and
cooperation among learners, teachers and researchers. It is only through this
cooperation that adjustments and further improvements can be made.
References
Banfi, C. S. (2003). Portfolios: Integrating advanced language, academic,
and professional skills. ELT Journal, 57(1), 34-42.
learning. Beijing: Foreign Language Teaching and Research Press.
Bown, J. (2009). Self-regulatory strategies and agency in self-instructed
language learning: A situated view. The Modern Language Journal,
93(4), 570-583.
Confessore, G. J., & Park, E. (2004). Factor validation of the Learning
Autonomy Profile (Version 3.0) and extraction of the short form.
International Journal of Self-directed Learning, 38(1), 39-58.
Cotterall, S. (1995). Readiness for autonomy: Investigating learner beliefs.
System, 23(2), 195-205.
Cummins, P. W., & Davesne, C. (2009). Using electronic portfolios for
second language assessment. The Modern Language Journal, 93(1),
848-867.
Dai, W., & Hu, W. (2009). Foreign language education in China:
Development and research (1949-2009). Shanghai: Shanghai Foreign
Education Press.
Dickinson, L. (1987). Self-instruction in language learning. Cambridge:
Duff, P. (2007). Case study research in applied linguistics. Lawrence
Erlbaum Associates Inc.
Gong, Y., & Luo, S. (2002). The alternatives in English assessment.
Beijing: People’s Education Press.
González, J. Á. (2009). Promoting student autonomy through the use of
the European Language Portfolio. ELT Journal, 63(4), 373-382.
Holec, H. (1981). Autonomy and foreign language learning. Oxford:
Pergamon Press.
Joy, M. (2012). Learning styles in the ESL/EFL classroom. Beijing:
Foreign Languages Teaching and Research Press.
156 Chapter Seven
Kohonen, V. (2006). Student autonomy and the European Language

Portfolio: Evaluating the Finnish pilot project (1998-2001). Retrieved
from: http://www.caslt.org/pdf/LECTURA%202_Kohonen,%20V.pdf
Little, D. (2005). The Common European Framework and the European
Language Portfolio: Involving learners and their judgments in the
assessment process. Language Testing, 22(3), 321-336.
—. (2009). Language learner autonomy and the European Language
Portfolio: Two L2 English examples. Language Teaching, 42(2),
222-233.
Little, D., & Erickson, G. (2015). Learner identity, learner agency, and the
assessment of language proficiency: Some reflections prompted by the
Common European Framework of Reference for Languages. Annual
Review of Applied Linguistics, 35, 120-139.
Littlewood, W. (1999). Defining and developing autonomy in East Asian
contexts. Applied Linguistics, 20(1), 71-94.
Lo, Y. F. (2010). Implementing reflective portfolios for promoting
autonomous learning among EFL college students in Taiwan.
Language Teaching Research, 14(1), 77-95.
Moeller, A. J., Theiler, J., & Wu, C. (2012). Goal setting and student
achievement: A longitudinal study. The Modern Language Journal,
96(2), 153-169.
Nunan, D. (1997). Strategy training in the language classroom: An
empirical investigation. RELC Journal, 28(2), 56-81.
Nunan, D., & Bailey, K. M. (2009). Exploring second language classroom
research: A comprehensive guide. Heinle: Cengage learning.
Nunes, A. (2004). Portfolios in the EFL classroom: Disclosing an
informed practice. ELT Journal, 58(4), 327-335.
Oxford, R. L., & Burry-Stock, J. A. (1995). Assessing the use of language
learning strategies worldwide with the ESL/EFL version of the strategy
inventory for language learning (SILL). System, 23(1), 1-23.
Rao, Z. (2006). Helping Chinese EFL students develop learner autonomy
through portfolios. Reflections on English Language Teaching, 5(2),
113-122.
Sheerin, S. (1991). Self-access. Language Teaching, 24(3), 143-157.
Shu, D. (2004). FLT in China: Problems and suggested solutions.
Shanghai: Shanghai Foreign Education Press.
Ministry of Education. (2001). The National English Curriculum
Standards. Beijing: Beijing Normal University Publishing Group.
Trim, J. L. M., & Bailly, S. (2001). Common European Framework of
Reference for Languages: Learning, teaching, assessment: A guide for
users. Council of Europe Publishing.

Wang, Q., & Zhang, H. (2014). English teachers’ action research. Beijing:
Foreign Languages Teaching and Research Press.
Wenden, A. (1991). Learner strategies for learner autonomy.
Prentice-Hall.
White, C. (1999). Expectations and emergent beliefs of self-instructed
language learners. System, 27(4), 443-457.
Xu, J. (2007). Autonomy in college foreign language learning–from theory
to practice. Beijing: China Social Sciences Press.
Yilmaz, S., & Akcan, S. (2011). Implementing the European Language
Portfolio in a Turkish context. ELT Journal, 66(2), 166-174.
Zimmerman, B. J., & Schunk, D. H. (2001). Self-regulated learning and
academic achievement: Theory, research and practice (2nd ed.).
Lawrence Erlbaum Associates, Inc.
CHAPTER EIGHT
SHARING ASSESSMENT POWER TO PROMOTE

LEARNING AND AUTONOMY
CAROL J. EVERHARD
Abstract
In investigations of the theory and practice of foreign language
learning and teaching, attention has often been focused on the evaluation
of language competencies, on the one hand, and on the nature of autonomy
in language education, on the other. Although learner-centred assessment
is recognized as an essential element in promoting learner autonomy, the
matter of the relationship and interconnection between teaching, learning,
assessment and autonomy remains largely unexplored. With reference to
empirical research conducted in a higher education context in Greece, the
author takes a closer look at the nature of the relationship between
teaching, learning, assessment and autonomy. The study explores how
practice in peer- and self-assessment of oral and writing tasks can both
help develop language competencies and allow learners to exercise more
autonomy in the learning process. By sharing assessment power with
learners and encouraging them to take part in assessment processes using
pre-determined sets of criteria and offering feedback to peers, an attempt
was made to move from summative methods of assessment, which align
with transmissional modes of teaching and assessment of learning, to more
formative methods of assessment, which align with transactional modes of
teaching and assessment for learning. Based on the study findings, it is
argued that the activation of criterial thinking, metacognitive processes
and awareness facilitates progression to more transformative approaches
to teaching and to sustainable assessment or assessment as learning, which
leads to more liberatory and lifelong learning.
Sharing Assessment Power to Promote Learning and Autonomy 159
Introduction
Although a great deal of interest has been generated in autonomy in
language learning and its promotion, at the same time it has become
apparent to teachers and researchers that assessment practices have not
really changed in ways that might help promote autonomy and, indeed,
may even come into conflict with the learner-centred practices which the
fostering of autonomy demands. The research project which is reported in
this chapter explored the use of assessment, not as a means of policing,
ranking or providing certification, but rather as a means to promote a
greater sense of ‘being’ as a learner and to create a greater sense of ‘self’
(Everhard, 2012). Conventional assessment passes judgment on learners
and categorizes their learning efforts by means of grades. This type of
summative assessment highlights overall weakness, but does not help
learners identify exactly where their weak points lie and how these
weaknesses can be overcome and converted to strengths. Hence the need
for more formative approaches in which assessment aims to promote
learning through increasing learner awareness.
Murphy (2015) has pointed out that although the rhetoric of an
institution may indicate the desire to encourage and promote autonomy in
learners, particularly in the distance education setting she describes, in
practice, the type of teaching, guidance, materials and support on offer
may actually contradict these aims. Nguyen and Walker (2016) indicate
that it is assessment practices which are inappropriate or poorly thought-
through which can actually detract the most from the good teaching on
offer, since the same benefits offered by good teaching are not captured in
assessment approaches (Boud, 1995, cited in Nguyen & Walker, 2016).
Raappana (1997) notes the dangers of such contradiction in a Finnish high
school setting and believes that opportunities for metacognition and self-
assessment among learners are key to promoting autonomy and
encouraging teachers to avoid teaching to the test. Black and Jones (2006),
in a British primary education language-learning context, also highlight
the importance of metacognition and how the assimilation of criteria,
through peer-assessment, helps learners progress to assessing their own
work with greater ease and understanding. Heron (1981, p. 66) highlights
the fact that “people are not used to criterial thinking” and so what is
important is “consciousness raising about criteria and criterial thinking”,
to enable self- and peer-assessment tasks and skills to become part of the
language learner’s repertoire.
What will be described in this chapter are details of an approach which
was taken in a Greek higher education (HE) setting towards assessment.
160 Chapter Eight
The participants in the research study were 1st year English majors and the
aim was to encourage and promote the autonomy of the language learners
involved by sharing assessment power. The study used particular
pedagogical and assessment tools, combined with advising sessions, to
increase learner awareness. It explored the possibility of greater
empowerment and autonomy of language learners through implementing
learner-focused assessment.
The Study
The setting for the research project, entitled the Assessment for
Autonomy Research Project (AARP), was undertaken in the context of a
first-year first-semester Language Mastery I course at the School of
English (SOE), at the Aristotle University of Thessaloniki (AUTh). The
aims of the research were to establish if learners, when given the
opportunity to peer-assess, could do so with objectivity using the criteria
given and whether once having had practice in peer-assessment, learners
could use the same criteria to conduct self-assessment with the same
objectivity. It was hoped that by assuming such responsibility and being
given the opportunity for critical and criterial thinking, learners could use
the assessment process as an opportunity for learning, through offering
and receiving feedback and playing a role in their assessment that had
equal standing with the instructor’s. In addition, by developing greater
self-awareness, the assessment process might lead learners to greater
autonomy and self-direction in their language learning.
As would be expected, the majority of participants were indeed first-
year students who had entered the SOE through competitive state exams.
The number of students participating in the project each year averaged out
at around 50, with about 250 participants in total over the 5-year period.
An anomaly of the entrance examination system was that students had
achieved high grades in order to qualify for entry to the SOE, but they
were not necessarily highly proficient in English. Language Mastery I was
the first of two courses on offer for the development of language skills,
with this first course being concerned with description and narrative, while
Language Mastery II focused on argumentative and persuasive discourse.
Each of these courses consisted of four contact hours per week, though as
English majors, all of their courses from the four departments of English
Literature, American Literature, Linguistics and Translation were delivered
in English.
Diagnostic testing of new entrants to the SOE, using the Oxford

Placement Test (OPT), would generally reveal the majority to be at B2
level within the Common European Framework of Reference scales
(CEFR), while discussion during advising sessions tended to show that the
majority had certification in English which placed them at C1 or C2 level,
but the passing of time (usually two years or more) and neglect, through
prioritizing study of other subjects for entry to university, had, in most
cases, led to language attrition.
The AARP had a duration of five years and involved learners in peer-
and self-assessment of oral and writing skills. The focus of both the
speaking and writing assignments and their assessment was description
and narrative. The oral assignments were evaluated through triangulated
peer-, self- and teacher assessment, and they were based on an individual’s
performance in an in-class presentation of 3-5 minutes’ duration, at a time
selected by each learner. There was a great deal of choice related to the
subject-matter, the manner and the content of the presentation. Students
could choose to present alone, as a pair or as a member of a small group,
but they were assessed as individuals and upon their particular
contribution. The presentation was judged according to a set of pre-
determined criteria, in the form of a checklist, with five criteria which
could each be judged on a scale of 1 (weak) to 5 (strong). All assessors
were required to use the same criteria. Although, clearly, assessors knew
who the presenters were, in each case, peer-assessors were encouraged to
maintain anonymity and a number rather than a name was used to
distinguish who the assessor was. This was insisted upon in order to
promote objectivity, responsibility and trust. The oral grade accounted for
10% of each student’s total course grade and the weighting of peer-, self-
and teacher oral assessment was equal, each constituting one third of the
10%. The peer-assessment was calculated, at a later date, as the mean of
all the peer-scores which had been awarded for that particular
presentation. Peer assessment was thus an iterative process.
In the case of writing skills, peer assessors were required to assess only
one peer and assessment was of two writing assignments, at different
stages in the course, one of which was concerned with paragraph writing
and the other with essay writing. As in the case of oral assessment, the
criteria were pre-determined and consisted of five criteria which could be
judged on a scale of 1 (weak) to 5 (strong). Again, all assessors were
required to use the same criteria.
The whole assessment process was arranged in such a way that it fitted
in well with pedagogical processes and was intended to cause as little
disruption to normal classroom routines as possible. The assessment
162 Chapter Eight
criteria checklists were introduced early on in the course so that learners

would have time to raise any questions and doubts about the assessment
process and would also have time to become familiar with the criteria
which would be used for triangulated assessment. The students were also
primed on the advantages of this form of assessment, which had made it
popular with students in previous years. Although it is sometimes argued
among researchers that learners should be involved in the creation of
assessment criteria themselves in order to validate the assessment process,
there is a lack of empirical evidence to suggest that in practice this is the
case and, antithetically, there is fairly strong evidence (Bowen, 1988;
Orsmond, Merry & Reiling, 2000) which suggests that learner assessment
can be more effective when learners are actually presented with ready-
made assessment criteria with which to judge their own work or that of
their peers.
It has also been suggested, on occasion, that it is more appropriate that
the process of self-assessment should precede that of peer-assessment
(Dam, personal communication; Pope, 2005), but again there is not much
evidence in the literature to support this view. Indeed, in a variation
/deviation from the norm which occurred in the AARP, some students
took part in additional self-assessment with the 2nd Writing Assignment
(Writing 2), so that they performed self-assessment before they submitted
their assignments and again, with a delay, after peer-assessment, as was
customary on the AARP. This delay between peer-assessment and self-
assessment of writing tasks was with a view to enabling learners to
sufficiently ‘distance’ themselves from the task, something which Boud
(1995) considers to be prerequisite. This deviation from the norm for self-
assessment on the AARP produced some surprising results, which will be
discussed in more detail in a later section.
In terms of research procedures, a Pre-Study was conducted in the first
semester of the first academic year of the AARP, with the author-
researcher’s two groups of Language Mastery I, with 27 students in Group
A and 23 students in Group B. For the Pre-Study and throughout the
duration of the AARP, these students were predominantly female,
predominantly Greek-speaking, within the age-range 17-19 and, as
mentioned previously, with a predominant score of CEFR B2 level on the
OPT, which was used for diagnostic testing purposes in the first week of
entry to the course.
The Pre-Study was conducted to ensure that all the research
instruments which would be in use, could be deemed reliable. Some of
these included Learner Profile Cards, a Learner Contract, the Criterial
Checklists and an Assessment Questionnaire. The latter was completed at
the end of the assessment cycles to gather student opinions about the
assessment process. Having ensured the consistency and reliability of
these instruments, the same pedagogical and research procedures were
followed in the next three years. Again, each year the author-researcher’s
two groups of Language Mastery I students were involved in the same oral
and writing assessment procedures, thus involving in total six groups in
the three-year period.
In the final year of the AARP, in the Post-Study, claims made in the
literature, that training is necessary for accurate peer-assessment, were put
to the test, by providing training for the two groups involved, in the form
of mock assessment, both for the oral presentation and for the first writing
assignment. It should be noted that in the four years previous to the Post-
Study, learners learned how to peer-assess by doing peer-assessment.
There was no training involved.
In the case of the Post-Study intervention, which involved training for
oral peer-assessment, nine older student volunteers offered live
presentations which were assessed by both groups simultaneously, using
the pre-determined criteria for oral assessment previously provided. The
grading and comments of the learners were saved in duplicate and brought
for discussion and comparison with the instructor’s assessment in the next
lesson. In the case of the writing assessment, as suggested by Dickinson
(1992), frozen data were used and five sample answers, drawn from
students’ assignments in previous years, were randomly selected for
training purposes with the learners. They were asked to assess the first
three samples at home and these were then discussed together in class.
Then they were asked to assess the remaining two samples in class
(without consultation), together with a ‘live’ sample for peer-assessment.
On completion, they recorded their assessments in duplicate and their
assessment of the two remaining samples was discussed at the next class
meeting. In addition, appointments were made with students in small
groups so that their peer-assessment of the mock samples and of the ‘live’
example could be discussed with the instructor, as could any issues arising
from their oral assessment training. The outcome of this intervention on
the part of the instructor, using training in peer-assessment, rather than
practice alone, revealed some interesting results. Also, a new questionnaire
was devised, in English rather than Greek, so that information specific to
learners’ impressions and opinions of the training intervention could be
gathered.
The assessment data, both quantitative and qualitative was processed
and analysed statistically in a number of ways in order to be able to
spotlight and highlight findings from a variety of perspectives and views.
164 Chapter Eight
Results
In considering the results of the AARP study, it has to be remembered
that although assessment of both Writing and Speaking Skills involved
triangulated peer-assessment, self-assessment and teacher assessment, the
derivation of peer-assessment grades was significantly different in each
case, as will be explained. What is similar, however, is the fact that
learners learned to assess by doing. As mentioned previously, there was no
attempt to train them to peer-assess until the last year of the 5-year project.
Some of the results derived from the statistical analysis are shown in
Tables 8-1 and 8-2. If we consider the ANOVA results derived from
assessment of the first writing assignment, concerned with paragraph
writing, under the column ‘Writing 1’ in Table 8-1, we see that in the Pre-
Study, AY1, there were no significant differences between the groups,
indicating alignment between peer-, self- and teacher assessment. The
same occurred with the second writing assignment, designated ‘Writing 2’
in Table 8-1, which was concerned with essay writing. Again, there were
no significant differences between peer-, self- and teacher assessment.
With regards to the Main Study (AY2-AY4), there was some consistency
in the assessment of the Writing 1 task, but not the same type of
consistency as in the Pre-Study (AY1). Only one out of the two groups in
each year, in each case the second group, namely Groups D, F and H,
showed alignment in assessment between peer-, self- and teacher
assessment.
In the other groups, namely Groups C, E and G, in Writing 1, there was
divergence in assessment. There was no consistency in how these groups
diverged, as revealed by the Tukey-Kramer test. The Pearson correlation
coefficients, on the other hand, revealed some alignment between self-
assessment and teacher-assessment (S-T) in Group C, with a value of 0.61
and in Group F, with a value of 0.51. In the case of Group C, there was a
peer-assessment and teacher-assessment (P-T) correlation value, similar to
that of the S-T, with a value of 0.62.
In Writing 2, in the Main Study (AY2-AY4), the ANOVA results
revealed only one group, Group F, with no significant differences between
peer-, self- and teacher assessment, similar to the results in Writing 1.
Similarly, Groups A and B in the Pre-Study (AY1) showed no significant
differences for either Writing 1 or Writing 2. For the remaining groups,
the Tukey-Kramer tests revealed some consistency in assessment
behaviour. For Groups C, D and E, there was non-alignment between self-
assessment and peer-assessment, but there was alignment between peer-
assessment and teacher assessment. Also, in the case of Groups G and H,
there was misalignment between self-assessment and peer-assessment, but

there was no misalignment with teacher assessment.
Table 8-1: AARP assessment overview (Writing 1 and Writing 2).
Writing 1 Writing 2
One-way ANOVA &
One-way ANOVA &
Pearson corr. coeff.


Tukey-Kramer
Tukey-Kramer
Year &
Group
P-T
P-T
S-T
S-T
AY1
n.s. n.s.
Group A
AY1
n.s. n.s.
Group B
AY2 p = .002 p < .001
0.62 0.61 0.48
Group C A>C=B A>B=C
AY2 p = .010
n.s.
Group D A>B=C
AY3 p = .006 p = .002
Group E A>C only A>B=C
AY3
n.s. 0.51 n.s.
Group F
AY4 p = .005 p = .048
0.47
Group G A>B only A>B only
AY4 p = .017
n.s. 0.66
Group H A>B only
AY5* p < .001 p = .021
0.65
Group I A>B>C A>B only
AY5*
n.s. 0.66 n.s.
Group J
Note: n.s. = non-significant; ANOVA - A = Self; B = Peer; C = Teacher;
PEARSON - S-T = Self-Teacher; P-T = Peer-Teacher; AY = Academic Year;
AY5* = Only participants in the Post-Study Intervention exercises have been
included.
In terms of Pearson correlation co-efficients, there were interesting

correlations for S-T for Groups C, G and H, with values of 0.48, 0.47 and
166 Chapter Eight
0.66 respectively, revealing a tendency towards alignment between self-

assessment and teacher assessment.
In the Post-Study (AY5), despite the time spent on training for peer-
assessment, the same pattern emerged as in the years before, with
significant differences in the first group of the year, Group I, and no
significant differences, as shown by ANOVA, for the second group, Group
J. Only Group J displayed alignment between peer-, self- and teacher
assessment in both the writing assignments. The Tukey-Kramer equation
of A>B>C for Group I in Writing 1 indicated complete non-alignment of
peer-, self- and teacher assessment, while for Writing 2, the non-alignment
for the same group was between self- and peer-assessment. The Pearson
correlation coefficient for P-T in Group J was 0.66 for Writing 1 and in
Group I it was 0.65 for Writing 2, indicating inconsistency between the
two groups.
The results derived from learner involvement in the assessment of oral
skills on the AARP are altogether different from those for the writing
skills. This is most likely due to the fact that in the case of oral peer-
assessment, the mean peer grade was arrived at by whole-group criterial
thinking rather than the thinking and decision-making of an individual, as
was the case with the assessment of writing. The alignment between peer-,
self- and teacher assessment which is present in the assessment of oral
presentations in the Pre-Study from both Groups A and B is repeated five
more times in the Main Study and by both groups who were involved in
the oral intervention in the Post-Study (see Table 8-2). Not only is
alignment shown by the ANOVA test, but also Pearson correlation
coefficients produced P-T values of 0.75 and 0.51 respectively for Groups
A and B in the Pre-Study (AY1), and P-T values of 0.46, 0.44 and 0.52 for
Groups F, G and H respectively in the Main Study (AY2-AY4). The P-T
correlation value for Group I in the Post-Study (AY5) was 0.51. These
correlation values show a tendency towards alignment between peer-
assessment and teacher assessment. At the same time, correlations were
also produced for S-T, with correlation coefficients of 0.45, 0.42 and 0.45
for Groups C, F and H respectively in the Main Study, indicating a
tendency towards alignment between self-assessment and teacher
assessment.
Group E seemed to fit the profile of a completely rogue group, since
there is no alignment between peer-, self- and teacher assessment either
for Writing 1, Writing 2 or for Oral presentation assignments, making it
unique. It is possible that members of Group E aimed to demonstrate their
autonomy by refusing to conform regarding assessment. Elsewhere,
plausible reasons for ‘cheating’ within this group, pertaining to
collaboration on oral presentations, have been suggested (Everhard,

2015b).
Another interesting phenomenon, previously referred to, was
uncovered through asking learners to self-assess their second writing
assignment twice. In Groups C and D, in the Main Study (AY2), there was
a unique variation in the way self-assessment of Writing 2 was conducted
since learners conducted self-assessment of their assignments twice-over:
(1) on submission of the assignment, and (2) in the usual way, with a delay
after peer-assessment had been completed.
Table 8-2: AARP assessment overview (Oral Presentation).
Oral Presentation
One-way ANOVA &

Tukey-Kramer
P-T
S-T
Year & Group
AY1 Group A n.s 0.75
AY1 Group B n.s. 0.51
AY2 Group C n.s. 0.45
AY2 Group D n.s.
AY3 Group E p < .001 A=B>C
AY3 Group F n.s. 0.46 0.42
AY4 Group G n.s. 0.44
AY4 Group H n.s. 0.52 0.45
AY5* Group I n.s. 0.51
AY5* Group J n.s.
Note: n.s. = non-significant; ANOVA - A = Self; B = Peer; C = Teacher;
PEARSON - S-T = Self-Teacher; P-T = Peer-Teacher; AY = Academic Year;
AY5* = Only participants in the Post-Study Intervention exercises have been
included.
Unfortunately, not all of the students from Groups C and D

participated in this variation, so that only 20 of the 22 participants in
Group C and 13 of the 18 participants in Group D, were involved.
Nevertheless, it seems important to report the outcome of this particular
experiment and the results given in Table 8-3 might go some way towards
168 Chapter Eight
throwing light on the results for these groups shown in Table 8-1. In Table
8-3:
1. ‘Actual’ refers to the means derived from all the participants who
were involved in self-assessment after conducting peer-assessment,
which was the normal procedure on the AARP and it is that on
which the results in Table 8-1 are based.
2. ‘Variation 1’ presents the means derived from learner-assessment
for those students from Groups C and D, who submitted self-
assessment with their assignments, before peer-assessment
processes took place. The S-A means presented in Variation 1 are
therefore those based on their first self-assessment, before
involvement in peer-assessment.
3. ‘Variation 2’ presents the same means as in ‘Actual’ (i.e., self-
assessment conducted after peer-assessment), but with the means
derived only from those students who performed self-assessment
twice, i.e., only those students who took part in Variation 1.
Table 8-3: AARP mean scores for Writing 2, for Groups C and D,
with ANOVA results for self-assessment variations.
AY2 - Group C - Actual AY2 - Group D - Actual

Mean SD SEM N Mean SD SEM N
S-A 8.80 0.67 0.142 22 S-A 8.44 1.05 0.248 18
P-A 7.94 1.12 0.240 22 P-A 7.41 1.26 0.297 18
T-A 7.42 0.81 0.174 22 T-A 7.41 1.03 0.242 18
AY2 - Group C -Variation 1 AY2 - Group D - Variation 1
S-A 8.54 0.70 0.156 20 S-A 8.11 1.19 0.329 13
P-A 7.86 1.15 0.258 20 P-A 7.29 1.31 0.362 13
T-A 7.51 0.79 0.177 20 T-A 7.29 1.08 0.300 13
AY2 - Group C -Variation 2 AY2 - Group D -Variation 2
S-A 8.87 0.62 0.139 20 S-A 8.71 1.08 0.300 13
P-A 7.86 1.15 0.258 20 P-A 7.29 1.31 0.362 13
T-A 7.51 0.79 0.177 20 T-A 7.29 1.08 0.300 13
Note: S-A = Self-Assessment; P-A = Peer-Assessment; T-A = Teacher
Assessment; AY = Academic Year; SD = Standard Deviation; SEM = Standard
Error of Measurement; N = Number of participants.
Some interesting points emerged from these data. Firstly, from the
teacher-researcher’s point of view, it is gratifying to see that there seems
to be consistency in her assessment of Groups C and D, in that the T-A
means were calculated at 7.42 for Group C and 7.41 for Group D. Both
with regard to peer-assessment and self-assessment, Group C tends to be a
little more generous in its self-assessment when compared with Group D.
What is very interesting is the exact match between peer-assessment and
teacher-assessment in Group D. It is strange that this same alignment does
not appear in Group D self-assessment, with the discrepancy in S-A means
with T-A means rising from 0.82 for the first self-assessment to 1.42 for
the second self-assessment. There is also an increase in discrepancy from
1.03 to 1.36 in Group C, when comparing the means for S-A and T-A,
between the first self-assessment and the second. This increase in
discrepancy is alarming, when one would actually expect closer alignment
between S-A and T-A after the experience of peer-assessment.
What is most significant is the fact that the highest means in all cases,
whether Actual, Variation 1 or Variation 2 for both Groups C and D were
produced by S-A, with higher S-A values in each case awarded by Group
C, as compared with Group D. Most interesting also is the fact that the
second self-assessment exceeds the first self-assessment, with S-A means
rising from 8.54 to 8.87 in Group C and even more steeply, from 8.11 to
8.71 in Group D. This inflated self-assessment differs from findings in the
Far East where modesty prevails and self-assessment means tend to be
lower than T-A and P-A, both for writing (Matsuno, 2007, 2009) and
speaking (Chen, 2006).
With regard to Standard Deviation (SD), the highest level of SD was
displayed by P-A in all cases, indicating that Peers were awarding a wider
range of grades than both S-A and T-A. One-way ANOVA revealed
significant p values for both Groups C and D in the Actual and in
Variation 2, but in the case of Variation 1, while Group C still produced a
p value of .002, which is considered significant, in the case of Group D,
the p value was 0.147, which is considered non-significant. Tukey-Kramer
Multiple Comparison Tests revealed very similar results in Groups C and
D (Actual) and Groups C and D (Variation 2), with the pattern A>B=C,
indicating alignment between peer-assessment and teacher assessment.
In order to understand better the differences in assessment behaviour
between Variation 1 and Variation 2, a paired sample t-test of the two self-
assessments of the two groups was conducted and the results are shown in
Table 8-4. Both the ANOVA and the t-test revealed significant increases
in S-A values for both Groups C and D, with the paired t-test revealing an
increase of 0.33 for Group C and an increase of 0.60 for Group D between
the first self-assessment conducted and the second. These increases occur
after peer-assessment processes, which we would have expected to have
been a form of training in using the criteria and to have had more of a
170 Chapter Eight
regulatory effect on subsequent self-assessment processes. On the

contrary, peer-assessment appears to have had a negative effect on self-
assessment, since S-A means have increased the second time around.
Table 8-4: Paired t-test results for Writing 2 self-assessment

Variations (AY2).
N Mean Difference SD SEM

AY2 Group C 20 0.33 0.54 0.121
95% CI for mean difference: (0.077; 0.583)
t-value = 2.73 p = .013
N Mean Difference SD SEM
AY2 Group D 13 0.60 0.63 0.17
95% CI for mean difference: (0.221; 0.979)
t-value = 3.45 p = .005
Note: CI = Confidence Interval; SD = Standard Deviation; SEM = Standard Error
of Measurement; N = Number of participants.
The results overall could be seen as an indication that in Group D

(though the sample is small), and possibly also in Group C, there was
sufficient assessment “maturity” (Ritter, 1998, p. 79), before peer-
assessment processes, for these students to self-assess themselves
reasonably accurately, indicating that practice through peer-assessment
with these groups perhaps led to less rather than more self-assessment
accuracy. It is otherwise hard to explain why students had the tendency to
overrate in self-assessment processes, in Groups C and D, to the extent
that they did, after an experience of peer-assessment which had proved to
be rather successful.
Qualitative Data
Qualitative data was gathered from participants at the end of each
AARP semester by means of a post-questionnaire. In the case of Groups C
and D, completed questionnaires were received only from 13 participants
from Group C and 16 students from Group D. T-analysis of their
responses to 10 of the questions showed lack of agreement between the
groups concerning the matter of objectivity in peer-assessment, concerning
how easy self-assessment was and how easy it was to be objective in self-
assessment. Table 8-5 shows some responses by students to these
questions, which seem to be representative of opinions in these groups.
Table 8-5: Responses to questions about peer- and self-assessment.
Question 4 – How easy did you find it to be objective in peer-

assessment?
C9 – In the beginning it was difficult to assess ourselves and our
classmates, and if we didn’t give a good grade, we felt bad.
D8 – I had never found myself before in similar circumstances and I
confess that it caused me anxiety that I would perhaps assess someone
more strictly than I should.
Question 5 – How easy was it to assess yourself (self-assessment)?
C4 – I learned to be more objective with myself, although it was hard to
grade myself.
D3 – The problem was that I couldn’t really assess myself. I found
everything perfect!
Question 6 – How easy did you find it to be objective in self-
assessment?
C11 – You try to evaluate yourself and the others objectively and you
learn from this procedure.
D4 – I had never assessed with such criteria before and thus the whole
concept was a little complicated.
Many researchers (including Benson, 2001; Everhard, 2015a; Nunan,

1997; Sinclair, 1997, 2000, 2011) suggest that autonomy is a matter of
degree. Some stress the essential link between assessment, particularly
self-assessment and autonomy (Dam, 1995; Harris, 1997; Little, 2009).
Others suggest that autonomy cannot exist without the ability to self-
assess (Hunt, Gow, & Barnes, 1989). This view seems to be substantiated
by Harris and Bell (1990), who see both autonomy and assessment as a
matter of degree, asserting, as does Dickinson (1987), that the greater the
degree of self involvement in assessment, the greater the degree of
autonomy that can be enjoyed. Everhard (2014), in her model of autonomy
for the AARP, visualises autonomy as being on a continuum. When
decisions about learning are fully teacher-controlled, Everhard suggests
that there is a state of heteronomy. The more that learners are involved in
decision-making and assessment processes, the greater the degree of
autonomy that is promoted. A modified extract from the model is shown in
Table 8-6.
The details in Table 8-6 are derived from Everhard (2014), Everhard
(2015a) and Harris and Bell (1990) and show how degrees of autonomy
are linked to degrees of assessment. Summative forms of assessment tend
to promote intellectual heteronomy, while learner-centred and more
172 Chapter Eight
formative types of assessment, involve the learners in metacognitive

thinking and decision-making, leading to transformative, liberatory and
sustainable learning, with learners taking greater responsibility and
initiative.
Table 8-6: Modified extract from the AARP model showing the
relationship between assessment and degrees of autonomy (Based on
Everhard 2014, 2015a; Harris & Bell, 1990).
Intellectual Academic Academic Intellectual

heteronomy autonomy autonomy autonomy
– no – low degree of – medium – high degree
autonomy. autonomy. degree of of autonomy.
autonomy.
Language Language Collaborative On-going
instructor instructor defines assessment realistic self-
evaluation. assessment involving peer, assessment of
Criteria for mechanisms, but self and learning
assessment learners may instructor. achievements
may be provide some and success.
hidden. justifications for
answers and
solutions.
Traditional Collaborative Peer assessment Self-
educational assessment – – formative assessment –
assessment – formative assessment. sustainable
summative assessment. assessment.
assessment.
Conclusion
While Tassinari’s research project in Berlin (this volume; Tassinari,
2015) focuses on learning competencies, this project in Thessaloniki,
focuses on language competencies; however, there are some similarities
between the two studies. Before proceeding to peer- and self-assessment
tasks, participants in the AARP were encouraged to reflect on their
speaking and writing skills by means of the Learner Contract and small-
group counselling meetings with the teacher. As with Tassinari’s dynamic
model, where learners reach awareness of themselves as autonomous
learners, in the case of the AARP they become aware of their strengths
and weaknesses in the productive skills and in their approach to language
learning in general. Together with the teacher/counsellor, learners in

Thessaloniki were asked to consider some aims and goals which they
could have and to reflect on the means by which they could achieve these
goals. This is very similar to the procedures in the Berlin study and, in
both cases, assessment of these abilities promoted metacognition.
AARP participants are encouraged to take action along with other
group members who have similar weaknesses and similar targets. Goals
and possible ways of achieving them are recorded on the contract and it is
signed and dated, to make it binding, by the teacher and the learner. It,
then, remains to the learners to decide how to proceed during their own
time. The means by which this agreement is reached in both the Berlin and
Thessaloniki projects is through pedagogical dialogue, aimed at
encouraging learners to reflect and take on a more agentic role, as opposed
to the more passive role they may have previously been used to during
their language learning careers.
The combination of the pedagogical approach and the assessment
approach which involved learners in decision-making processes, raised
awareness of themselves as learners and helped them, through developing
critical thinking skills, to recognise both their strengths and their
weaknesses in their speaking and writing, helped them move beyond the
state of ‘learned helplessness’ (Dörnyei, 1994, p. 276) and heteronomy
which may have been promoted in their previous language learning
experiences.
The aims of the triangulated assessment implemented and of the AARP
overall, were not to promote agreement between learner-awarded and
teacher-awarded marks, because as Rühlemann (2002, p. 4) points out,
such expectations would have been “futile”. Orsmond et al. (1996; 1997)
and Sadler (1989) insist that it is worth taking the risk with marks in order
to develop what are necessary skills. Black, Harrison, Lee, Marshall, and
Wiliam (2003) agree that it is only through practice in peer- and self-
assessment that critical thinking can be developed. Janssen-van Dieten
(1989) points out that poor results in self-assessment show the need for it
rather than its abandonment and feels it should be pursued.
Although sharing assessment power with learners is not without risks,
teachers and advisers are beginning to recognise its usefulness as a
“teaching tool”, more than a “grading procedure” (Orsmond et al., 1996, p.
245). There is the realization that “validity of judgments” has to take
precedence over “reliability of grading” (Sadler, 1989, p. 122), and that
avoidance of learner involvement in assessment of the kinds described
here denies learners opportunities to learn from assessment, develop
greater autonomy and sustain this learning throughout their lives.
174 Chapter Eight
References
learning (1st ed.). Harlow, UK: Pearson Education.
Black, P., & Jones, J. (2006). Formative assessment and the learning and
teaching of MFL: Sharing the language learning road map with the
learners. Language Learning Journal, 34(1), 4-9.
Black, P., Harrison, C., Lee, C., Marshall, B., & Wiliam, D. (2003).
Assessment for learning: Putting it into practice. Maidenhead, Berks.
& New York, NY: Open University Press.
Boud, D. (1995). Enhancing learning through self-assessment. London:
RoutledgeFalmer.
Bowen, J. (1988). Student self-assessment. In S. Brown (Ed.), Assessment:
A changing practice (pp. 47-70). Edinburgh, UK: Scottish Academic
Press.
Chen, Y.-M. (2006). Peer and self-assessment for English oral
performance: A study of reliability and learning benefits. English
Teaching & Learning, 30(4), 1-22.
Dam, L. (1995). Learner autonomy, 3: From theory to practice. Dublin:
Authentik.
Dickinson, L. (1987). Self-instruction in language learning. Cambridge,
UK: Cambridge University Press.
—. (1992). Learner autonomy, 2: Learner training for language learning.
Dublin: Authentik.
Dörnyei, Z. (1994). Motivation and motivating in the foreign language
classroom. The Modern Language Journal, 78(3), 273-284.
Everhard, C. J. (2012). Re-placing the jewel in the crown of autonomy: A
revisiting of the ‫ލ‬self’ or ‫ލ‬selves’ in self-access. Studies in Self-Access
Learning Journal, 3(4), 377-391. Retrieved from:
http://sisaljournal.files.wordpress.com/2009/12/everhard1.pdf
—. (2014). Exploring a model of autonomy to live, learn and teach by. In
A. Burkert, L. Dam & C. Ludwig (Eds.), The answer is learner
autonomy: Issues in language teaching and learning (pp. 29-43).
Faversham: IATEFL.
—. (2015a). The assessment-autonomy relationship. In C. J. Everhard &
L. Murphy (Eds.), Assessment and autonomy in language learning (pp.
8-34). Basingstoke: Palgrave Macmillan.
—. (2015b). Investigating peer- and self-assessment of oral skills as
stepping-stones to autonomy in EFL higher education. In C. J.
learning (pp. 114-142). Basingstoke: Palgrave Macmillan.
Harris, D., & Bell, C. (1990). Evaluating and assessing for learning (2nd
ed.). London & New York: Kogan Page & Nichols Publishing.
Harris, M. (1997). Self-assessment of language learning in formal settings.
English Language Teaching Journal, 51(1), 12 -20.
Heron, J. (1981). Assessment revisited. In D. Boud (Ed.), Developing
student autonomy in learning (pp. 55-68). London: Kogan Page.
Hunt, J., Gow, L., & Barnes, P. (1989). Learner self-evaluation and
assessment – a tool to autonomy in the language learning classroom. In
V. Bickley (Ed.), Language teaching and learning styles within and
across cultures (pp. 207-217). Hong Kong: Institute of Language in
Education, Education Department.
Janssen-van Dieten, A.-M. (1989), The development of a test of Dutch as
a second language: The validity of self-assessment by inexperienced
subjects. Language Testing, 6(1), 30-46.
Little, D. (2009). The European language portfolio: Where pedagogy and
assessment meet. Strasbourg: Council of Europe.
Matsuno, S. (2007). Self-, peer-, and teacher-assessment in Japanese
university EFL writing classrooms. Doctoral thesis, Temple
University, Tokyo, Japan.
—. (2009). Self-, peer-, and teacher-assessments in Japanese university
EFL writing classrooms. Language Testing, 26(1), 75-100.
Murphy, L. (2015). Autonomy in assessment: Bridging the gap between
rhetoric and reality in a distance language learning context. In C. J.
learning (pp. 143-166). Basingstoke: Palgrave Macmillan.
Nguyen, T. T. H., & Walker, M. (2016). Sustainable assessment for
lifelong learning. Assessment & Evaluation in Higher Education,
41(1), 97-111. doi: 10.1080/02602938.2014.985632
Nunan, D. (1997). Designing and adapting materials to encourage learner
autonomy. In P. Benson & P. Voller (Eds.), Autonomy and
independence in language learning (pp. 192-203). Harlow, Essex:
Longman.
Orsmond, P., Merry, S., & Reiling, K. (1996). The importance of marking
criteria in the use of peer assessment. Assessment and Evaluation in
Higher Education, 21(3), 239-250.
Orsmond, P., Merry, S., & Reiling, K. (1997). A study in self-assessment:
Tutor and students’ perceptions of performance criteria. Assessment &
Evaluation in Higher Education, 22(4), 357-368.
Orsmond, P., Merry, S., & Reiling, K. (2000). The use of student derived
marking criteria in peer and self-assessment. Assessment & Evaluation
in Higher Education, 25(1), 23-38.
176 Chapter Eight
Pope, N. K. L. (2005). The impact of stress in self- and peer assessment.

Assessment & Evaluation in Higher Education, 30(1), 51-63.
Raappana, S. (1997). Metacognitive skills, planning and self-assessment
as a means towards self-directed learning. In H. Holec & I. Huttunen
(Eds.), Learner autonomy in modern languages: Research and
development (pp. 123-137). Strasbourg: Council of Europe.
Ritter, L. (1998). Peer assessment: Lessons and pitfalls. In S. Brown (Ed.),
Peer assessment in practice: SEDA Paper 102 (pp. 79-85).
Birmingham, UK: SEDA.
Rühlemann, C. (2002). Sharing the power: Action research into learner
and teacher co-evaluation. Humanising Language Teaching, 4(1), 1-
11. Retrieved from: http://www.hltmag.co.uk/jan02/mart.htm
Sadler, D. R. (1989). Formative assessment and the design of instructional
systems. Instructional Science, 18(2), 119-144.
Sinclair, B. (1997). Learner autonomy: The cross-cultural question.
IATEFL Newsletter, 139, 12-13.
—. (2000). Learner autonomy: The next phase? In B. Sinclair, I. McGrath
& T. Lamb (Eds.), Learner autonomy, teacher autonomy: Future
directions (pp. 4-14). Harlow: Pearson Education.
—. (2011). Learner training. In C. J. Everhard & J. Mynard, with R. Smith
(Eds.), Autonomy in language learning: Opening a can of worms (pp.
91-98). Canterbury: IATEFL.
Tassinari, M. G. (2015). Assessing learner autonomy: A dynamic model.
In C. J. Everhard & L. Murphy (Eds.), Assessment and autonomy in
language learning (pp. 64-88). Basingstoke: Palgrave Macmillan.
CHAPTER NINE
DEVELOPING A TOOL FOR ASSESSING

ENGLISH LANGUAGE TEACHER READINESS
IN THE MIDDLE EAST CONTEXT
SADIQ MIDRAJ, JESSICA MIDRAJ,

CHRISTINA GITSAKI, AND CHRISTINE COOMBE
Abstract
The effectiveness of teachers significantly correlates with students’
educational achievement. Ensuring teachers are well prepared for the
classroom is of primary importance in the educational system of any
country. In the field of English as an additional language (EAL), there are a
number of tools that have been developed to gauge teacher readiness for
taking on the profession. However, the existing tools and resources have a
North-American orientation and they are not a good fit for other contexts
that may require a different cultural sensitivity on the part of the teacher.
This is particularly acute in situations where teachers trained in the West are
employed in non-western contexts as is the case in the United Arab Emirates
(UAE). The purpose of the project described in this chapter was the creation
of a contextually relevant resource for independent learning and self-
assessment to strengthen EAL teachers’ content knowledge, pedagogical
knowledge, and professional dispositions. This chapter describes the context
of the study and the process of developing the resource by first compiling
EAL standards and indicators that are culturally responsive and cater to the
needs of EAL teachers in the UAE and the greater Gulf Region, and then by
producing over 300 assessment items designed to measure specific EAL
indicators. Valuable insights from the project leaders are shared in an effort
to facilitate the process of developing similar resources for other EAL
contexts. It is envisioned that the resource will support teacher learning and
178 Chapter Nine
measure their progress in applying strategies, methods, and theories in EAL

teaching/learning-related situations.
Introduction
Recent studies at the classroom level have found that teacher
effectiveness is a strong determinant of differences in student learning
(Cavalluzzo, Barrow, Henderson et al., 2015; Cowan & Goldhaber, 2015;
Hawk, Coble, & Swanson, 1985; Tchoshanov, 2011). Empirical evidence
has shown that students who have high quality teachers make significant
and lasting learning gains. These findings make the identification and
evaluation of teacher effectiveness a major priority in today’s classrooms.
Despite the importance of highly effective teachers, there is no one scale
or agreed-upon list of criteria to evaluate what makes an effective teacher
in general or an effective second language teacher in particular. This
dearth of research carries over into the Middle East context where studies
about teacher effectiveness are lacking.
An important research strand in teacher effectiveness and one that is
central to the conceptualization and development of the teacher resource
described in this chapter is the knowledge base on what is known as self-
efficacy. Self-efficacy, a belief in one’s capability to execute the actions
necessary to achieve a certain level of performance, is an important
influence on behavior and affect, relating to individuals’ goal setting,
effort expenditure and levels of persistence (Deemer & Minke, 1999;
Bandura, 1993). When applied to teachers, the self-efficacy (or teacher
efficacy) construct has been associated with teachers’ instructional
practices and attitudes toward students (Ashton, Webb, & Doda, 1983;
Bender, Vail, & Scott, 1995; Midgley, Anderman, & Hicks, 1995).
In addition, teacher efficacy has been defined as both context and
subject-matter specific. In terms of context, the effects of efficacy have
been studied with pre-service, novice and in-service teachers at various
levels of education (i.e., elementary, middle and secondary school) and in
various contexts (i.e., urban, suburban and rural). However, no empirical
findings to date have been presented on the efficacy of teachers of English
as an additional language (EAL) in the Gulf or Middle East context. It was
this lack of empirical evidence and adequate resources for measuring
language teacher effectiveness that served as the motivation for the
creation of this context-specific localized teacher resource.
Finally, teacher efficacy has also been associated with subject-matter
knowledge and the belief that teachers must have both declarative and
procedural knowledge to successfully navigate today’s classroom
Developing a Tool for Assessing English Language Teacher Readiness 179
(Pasternak & Bailey, 2004). Declarative knowledge refers to knowledge

about something, for example, knowledge of grammatical rules.
Procedural knowledge refers to “the knowledge of how”, “the ability to do
things” (Pasternak & Bailey, 2004, p. 157), for example, how to use the
grammatical rules. The resource described in this chapter aimed to
increase the effectiveness of EAL teachers in the Gulf Region by
addressing the content knowledge, the pedagogical knowledge and the
professional dispositions they need in order to improve student learning
outcomes.
Teacher Effectiveness in Relation to Student Learning

Research suggests that teacher effectiveness affects student learning.
Previous studies on teacher effectiveness tend to focus on teacher
certification, student achievement as measured by standardized test scores,
pedagogical practice as measured by classroom observation, and the
relationship of teacher content-knowledge to student achievement. Many
of the measures of teacher effectiveness have focused on a combination of
the aforementioned criteria (Cavalluzzo, Barrow, Henderson et al., 2015;
Cowan & Goldhaber, 2015; Kane & Staiger, 2008; Rockoff, 2004).
One important finding from the available research shows that teachers
who are professionally certified are more effective than non-certified
teachers. Moreover, students who are taught by professionally certified
teachers are more likely to achieve better learning outcomes (Cavalluzzo
et al., 2015; Cowan & Goldhaber, 2015). In one recent study, Cavalluzzo
et al. (2015) analyzed scores of thousands of secondary school students
between 2000 and 2012 in Chicago and Kentucky and correlated their
performance on an American College Testing suite of assessments to their
teachers’ assessment scores. The results showed that students taught by
board-certified teachers did better than students who were taught by non-
board-certified teachers (Cavalluzzo et al., 2015). Similarly, Cowan and
Goldhaber (2015) investigated the effectiveness of board-certified teachers
on elementary and middle school students’ scores in Washington State in
the United States. The results showed that the National Board for
Professional Teaching Standards (NBPTS) certified teachers were about
“0.01-0.05 student standard deviations more effective” than non-NBPTS
certified teachers with similar levels of experience (Cowan & Goldhaber,
2015, p. 2). In addition, the teachers’ primary NBPTS assessment results
predicted student achievement. Furthermore, research by Hawk, Coble, and
Swanson (1985) revealed a strong positive effect of teacher certification on
student achievement. The researchers used controlled matched comparison
180 Chapter Nine
groups. They then matched teachers by level of experience, school, and used
pre- and post-tests of student achievement at the beginning and end of the
academic year in the specific curriculum taught. The study results showed
that students of certified teachers in mathematics performed significantly
better than those who were taught by uncertified teachers in mathematics in
both general mathematics and algebra.
With regards to early career teachers, Tchoshanov (2011) maintains that
there are significant correlations between new teachers’ content knowledge,
knowledge of student learning variables, and the quality of their lessons.
Content knowledge isolated from other teacher knowledge, such as
pedagogical content knowledge, curriculum knowledge, epistemological
knowledge, and knowledge of the learners may not give a complete picture
of the relationship between teacher, content knowledge, and student
achievement.
The studies reviewed here show that students of teachers with
professional knowledge perform better than those taught by teachers without
professional pedagogical and content knowledge. Ensuring that English
language teachers, in the Gulf Region and beyond, have the required
professional knowledge is crucial and of the utmost importance to help
increase student achievement.
Independent Learning and Self-Assessment

Given that the instrument resulting from the present study is intended
for self-assessment and independent learning, it is important to explore the
benefits of such an approach. Self-assessment is defined as assessment that
involves learners in making judgments about their performance, learning,
and/or attitudes as they reflect on and judge how their work relates to the
established standards so that they can determine the subsequent steps in
their learning process (Midraj, in press).
With reference to the social cognitive theory of self-regulation
(Bandura, 1991), independent learning necessitates learners’ self-
regulation of their learning as they take responsibility for their own
learning. Underpinning this is motivation towards achieving their own
professional goals and aspirations, developing metacognitive strategies to
achieve their goals and monitoring their progress towards achieving them
through self-evaluation (Bandura, 1991, 1983; Eggen & Kauchak, 2007).
Research studies show that self-assessment can enhance the use of
strategic planning and self-monitoring of one’s learning (Punhagui & De
Souza, 2013). Tamjid and Birjandi (2011) conducted a quasi-experimental
study using self- and peer-assessment activities. The results showed that
incorporating such activities in a course of study can promote learner

independence and enhance learner autonomy. These findings also indicate
that self- and peer-assessments ultimately improve learners’ metacognition
skills and strategies and ultimately lead to improvements in their
achievement (Blue, 1994; Tamjid & Birjandi, 2011).
In well-structured independent learning and self-assessment, the
expectations are high and transparent, while success and satisfaction in
meeting these expectations and achieving the goals develop high levels of
positive academic and general self-esteem, and self-efficacy. Experiencing
this learning process enables teachers to develop the skills necessary for
their own ongoing professional development as well as providing
opportunities for, and supporting, self-regulation in their students as
lifelong learners.
The UAE Context

English language teaching and learning plays a central role in
education in the United Arab Emirates (UAE). The UAE is a multilingual
and multicultural country where 85% of the population comprises expats
from other Arab, Western, and Asian countries (Central Intelligence
Agency, n.d.). With over 150 different languages in use across the country
(International Business Publications, 2013), English has gained a
significant place as the language of communication among expats, while
Arabic is the official language of the country and the language of
instruction in public schools, where English is taught as a foreign
language. Given the increasingly important role of English in the lives of
Emiratis, recently a number of bilingual (English/Arabic) education
curricula have been implemented in public schools (e.g., New Model
School programme, Madares Al Ghad (Schools of the Future)
programme) in an effort to raise Emirati students’ English language
proficiency in preparation for attending undergraduate programmes in
higher education institutions where the language of instruction is English
(for a review see Layman, 2011).
Efforts in implementing bilingual education programmes have also
been followed by a complete overhaul of the national English language
curriculum, which is the guiding document for English language teaching
in public schools. The new curriculum aims to graduate students at a high
English language proficiency level, equivalent to Band 7 to 8 on the
International English Language Testing System (IELTS) (Ministry of
Education, 2014). Recognizing the key role that teacher effectiveness
plays in the success of these education reforms, Marwan Al-Sawaleh,
182 Chapter Nine
Assistant Undersecretary of Support Services at the Ministry of Education

(MOE), UAE, in an interview with Haneen Dajani and Roberta
Pennington on June 3, 2014, maintained that the MOE is also in the
process of implementing a licensure system for about 60,000 teachers in
the UAE public schools.
English language teachers in public schools are predominantly expats
who are educated and trained in other Arab countries (e.g., Egypt, Syria,
Jordan) or Western countries (e.g., Britain, Australia, USA). Their training
and background can differ significantly as well as their subject knowledge
and professional dispositions (Gitsaki & Bourini, 2012). Given this
diversity, being evaluated for licensing purposes presents a number of
challenges for EAL teachers. How can they prepare for the licensure test,
which will be specific to the UAE context? How can they make sure their
knowledge and skills are sufficient for the UAE context? Using one of the
well-established tools for measuring EAL teacher professional knowledge
and effectiveness, such as the TESOL Standards for P-12 Teacher
Education Programs (Teaching English to Speakers of Other Languages,
2010) or the Praxis Series tests (Educational Testing Services, 2011), is
not going to be particularly effective as such tools were designed for EAL
teachers in North-America and while for the most part they apply to
international contexts, there are elements- in both the indicators of
knowledge and professionalism and the test items themselves- that would
not contribute positively to in-service or pre-service teacher preparation
for the licensure evaluation.
In the midst of this change and evolution in the English language
education programme in the UAE and the challenges it creates for in-
service and pre-service teachers, the current study was conceived as a
remedy to teacher preparation for the EAL teachers’ licensure. It aimed at
providing EAL teachers in the UAE and the greater Gulf Region with a
contextually relevant resource for independent learning and self-assessment
in an effort to strengthen content knowledge, pedagogical knowledge, and
professional dispositions.
The Study
The study reported in this chapter is part of a larger project that aimed
to design and implement a self-evaluation instrument for assessing EAL
teacher readiness in the context of the UAE and the greater Gulf Region. It
was envisioned that the creation of the instrument would involve five
stages (see Figure 9-1). In the first stage (Stage 1), an initial review of the
existing tools for assessing EAL teacher readiness would be performed by
Develooping a Tool forr Assessing Eng
glish Language Teacher Readin
ness 183
the researchh team. In Stagge 2, EAL inddicators wouldd be compiled based on

the tools revviewed and allso new indicaators would bbe described to o address
context speccific localizedd needs. In Stage 3, assesssment items would
w be
created to aaddress each of the perforrmance indic ators. In Stag ge 4, the
validity andd reliability off the items wo
ould be testedd. In the finall stage of
the project ((Stage 5), the resource wou uld be publishhed and made available
for use. This chapter provvides an overrview of the ffirst three stagges of the
project.
Stagge 1
Revview resources on EAL teacheer
standaards.
Stage 2
IIdentify contexxt-specific stand
dards and
inndicators.
Stage 3
Develop item
ms to test speciific indicators.
Stagee 4
Pilot thee resource and evaluate
e its valiidity
and reliab
bility.
Sttage 5
Makke resource avaailable to in-servvice and
prre-service EAL teachers in the UAE.
Figure 9-1: Prroject stages.
Review of
o the Tools for
f EAL Teeachers
One of tthe major staages in the prrocess of creaating our conttextually-
relevant andd localized teaacher developpment resourcee was examin ning what
other resourrces (i.e., stanndards docum
ments, tests, ettc.) were available for
teacher usee. In an initiial review off the literatuure, many so ources of
standards foor EAL teacheer performancce were foundd to exist. Wee decided
to base the ddevelopment of o our standard
ds and indicattors on two off the most
widely usedd in the public domain: the t TESOL P Professional Teaching
184 Chapter Nine
Standards (2010) and ETS’s Praxis exam specifications (2011). A brief

description of each of these resources follows.
The TESOL Standards
The TESOL Standards for P-12 Teacher Education Programs

(TESOL, 2010) address the professional expertise needed by EAL
educators to work with language minority students. The Council for the
Accreditation of Educator Preparation (CAEP) uses these performance-
based standards for national recognition of teacher education programmes.
Also known as the TESOL Professional Teaching Standards, these
standards can be used to assess programmes that prepare and license K-12
EAL educators, as well as other teacher educator programmes. In this
widely used standards document, there are five domains of knowledge that
are deemed essential for EAL teachers:
Domain 1: Language
Domain 2: Culture
Domain 3: Planning, Implementing and Managing Instruction
Domain 4: Assessment
Domain 5: Professionalism
More information about the TESOL Standards can be found at:

https://www.tesol.org/advance-the-field/standards/tesol-caep-standards-
for-p-12-teacher-education-programs
The review of the TESOL Standards revealed that they are not
culturally-relevant for all student groups as some indicators use
terminology specific to North American contexts and the laws and
regulations that apply in those contexts. Table 9-1 provides an example of
a rubric for assessing Standard 5.a. ESL Research and History, which
would be unsuitable for the Gulf context where different laws and
regulations to North America are in play:
Table 9-1: Example performance indicator and achievement levels

(Source: TESOL 2010, p. 69).
Suggested 5.a.2. Demonstrate knowledge of the evolution of laws

Performance and policy in the ESL profession.
Indicator
Candidates are aware of the laws, judicial decisions,
Approaches
policies, and guidelines that have shaped the field of
Standard
ESL.
Candidates use their knowledge of the laws, judicial
Meets decisions, policies, and guidelines that have influenced
Standard the ESL profession to provide appropriate instruction for
students.
Candidates use their knowledge of the laws, judicial
decisions, policies, and guidelines that have influenced
the ESL profession to design appropriate instruction for
Exceeds
students.
Standard
Candidates participate in discussions with colleagues
and the public concerning federal, state, and local
guidelines, laws, and policies that affect ELLs.
Such wording can be confusing to teachers who work in countries

outside of North America and can render the usefulness of such resources
questionable.
Praxis
The Praxis Series tests developed by Educational Testing Services

(ETS) measure teacher candidates’ knowledge and skills and are used for
licensing and certification processes. The suite of exams include:
x Praxis Core Academic Skills for Educators (Core): These tests

measure academic skills in reading, writing and mathematics. They
were designed to provide comprehensive assessments that measure
the skills and content knowledge of candidates entering teacher
preparation programmes.
x Praxis Subject Assessments (formerly the Praxis II tests): These
tests measure subject-specific content knowledge, as well as
general and subject-specific teaching skills, that teachers need for
beginning teaching.
186 Chapter Nine
The Praxis Core Academic Skills for Educators tests and the Praxis
Subject Assessments feature selected-response and essay questions that
measure the content and pedagogical knowledge necessary for a beginning
teacher (ETS, 2015). For more information about the Praxis test see:
https://www.ets.org/praxis/
Upon reviewing the Praxis standards, it became apparent that they too
had a North-American orientation as seen in the following example of an
indicator and associated questions for Cultural Understanding in Module 2
(Cultural and Professional Aspects of the Job) of Praxis:
4CA.10 Knows how to explain United States cultural norms to English-

language learners
1. What are some common U.S. cultural norms, values, and patterns of
behavior that should be made explicit to ELLs?
2. How can teachers help students build intercultural competencies?
(ETS, 2011, p. 25)
Here is an example of a test item outlining the process of referring a

student to special education which would be considered unsuitable for the
Gulf context:
Question 63. A middle school ESOL student who has been in the United
States for two years is being discussed in a team meeting. It is noted that
the student is still at the beginning ESOL level, has difficulty focusing on
assignments, has poor recall, and displays several inappropriate behaviors.
The teachers have checked the student’s educational history, which
indicates that the same problems were seen the year before. Which of the
following would be an appropriate next step?
(A) Wait at least six more months because the student has not been in the
United States long enough to be evaluated for special education services.
(B) Send a letter home to the student’s parents urging them to help stop the
inappropriate behaviors from occurring.
(C) Develop a pre-referral intervention plan to improve the student’s
classroom and study skills.
(D) Refer the student to the special education team and ask for testing and
a physiological evaluation.
Explanation:
Your awareness of appropriate channels for evaluating the special
education needs of ESOL students is tested here. Before ESOL students
are referred for special education evaluations, pre-referral interventions
should be attempted. Based on the response to the intervention, the student
might subsequently be referred for special education. The correct answer,

therefore, is (C). (ETS, 2011, p. 69)
There is no one standardized instrument for teacher competency that is

used across the globe or even in one country. Customized versions of the
Praxis II are tailored to meet the context and politics of several different
States in the US. In addition to Praxis, there are many other standardized
instruments, such as the New York State Teacher Certification
Examinations, New Mexico Assessment of Teacher Basic Skills, the
English Major Field Test, State Education Exams, in several states to
name a few. The goal of this project was to take into consideration the
standards developed by TESOL and other professional organisations and
create not a standardized test, but a resource guide for independent
learning and self-assessment for EAL teachers working in the Gulf
context.
Procedure
The project was initiated at the College of Education in one of the
federal higher education institutions in the UAE. The College of Education
had a special EAL teacher education programme which had received
recognition from the TESOL International Association (http://www.tesol.
org) as part of the National Council for Accreditation of Teacher
Education/Council for the Accreditation of Educator Preparation
(NCATE/CAEP) accreditation process. The lead faculty of the EAL
programme headed the project team. Fifteen academics from notable
higher education institutions in the region were selected to participate in
the project. All team members were working in countries within the Gulf
region, namely, the UAE, Oman, and Qatar.
The 15 team members were selected based on a number of factors,
including academic background and qualifications, research publications
in the field, and work experience in EAL teacher training. The project
coordinators also sought representation from the three local federal
institutions and other regional institutions and entities. As far as academic
credentials are concerned: two team members had master’s degrees and
the rest of the team (13) had terminal degrees in language education or
general education. Seven of the 15 members were Arabic-English
bilingual academics with many years of teaching experience in K-12 and
tertiary institutions. Three members were language assessment supervisors
at their respective institutions. One of the language assessment supervisors
was serving at the time as the President of TESOL International
188 Chapter Nine
Association. Six team members were applied linguists and one of the
applied linguists was serving as the Associate Dean for the largest English
as a foreign language programme in the UAE. One team member was a
psychologist and another one was a special needs expert. The Chair of the
project was the Dean of the College of Education that had undertaken the
project and the principal investigator was a language education academic
with over 25 years of teaching experience, 7 years in K-12 and 18 years in
EAL teacher education.
Once assembled, the team members were divided into five groups and
a leader was appointed for each group. The five groups worked on the
creation of the resource by first reviewing EAL teacher international
standards and associated indicators using a set of shared resources. These
resources comprised core course textbooks and supporting material as well
as teacher eligibility guidelines from different countries, such as the
following:
x The TESOL International standards and indicators;

x European and Australasian standards on teaching English;
x Teachers of English eligibility instruments;
x EAL programme curricula and course syllabi; and
x Curriculum crosswalk that includes the programme standards,
learning outcomes, topics covered and teaching/learning materials.
All resources were assembled and placed in an electronic folder which

was shared with each project member. Given the geographical distance of
the members, working remotely and conducting meetings virtually was a
necessity.
At first, each group reviewed the set of materials and the related
indicators. The resource was divided into five domains or modules and
each group worked on a specific module: Group 1 worked on language
foundations; Group 2 worked on culture and language; Group 3 worked
on instruction and pedagogy; Group 4 worked on language assessment;
and Group 5 worked on professionalism and research.
Upon review of the different standards and indicators in the various
resources, The TESOL Standards for P-12 Teacher Education Programs
(TESOL, 2010) were deemed to be the most useful resource for the project
even though there were examples of content that was not relevant to our
context and therefore not suitable for inclusion in the new assessment
instrument. Upon discussion with the research team, it was decided that
we would adapt TESOL Standards (TESOL, 2010) by rewording
problematic indicators and adding those that we felt were missing. The
indicators for each domain were then contextualized and adapted for the
Gulf Region and the greater Middle East and North Africa (MENA)
Region.
Once agreement among the members of each group had been reached
as to the set of indicators to be used for each domain, the groups
proceeded with the production of contextually relevant multiple-choice
items that covered the standards and indicators for each domain
respectively. Prior to item production, training was held for all item
writers on how to generate objective closed-response multiple-choice
items (MCIs) using internationally accepted guidelines (see Coombe,
Folse, & Hubley, 2007; Rodriguez, 1997; Statman, 1988). Moreover, item
writers were to link their questions to Bloom’s Taxonomy (Anderson &
Krathwohl, 2001; Bloom, Englehart, Furst, Hill, & Krathwohl, 1956).
Writing balanced MCIs for higher-order cognitive skills proved to be more
challenging than writing items for basic or factual knowledge.
In addition to each MCI, the members of each team had to also provide
an explanation as to why a particular answer was the right answer for each
MCI and/or why the distractors were the wrong answer options. This
feature of the assessment instrument was there to ensure that the EAL
teachers, who would use it, would be able to learn from it and not simply
find out which of their answers were wrong and which were right by using
a simple answer key. Once the MCIs and their explanations were written,
each group reviewed the items they created for their assigned domain
before sending them for internal review.
For the internal review process, each group was assigned the MCIs
from another group to review, make recommendations for improvements,
and provide feedback. Once the internal review of the items was
completed, each group revised their own MCIs and then the items were
sent to a group of external reviewers. The external reviewers chosen for
the project comprised professionals in the field of language assessment
who were responsible for compiling and administering large-scale
assessment instruments. Their task was to review the indicators and the
MCIs per indicator using a set of specific criteria (see Table 9-2).
190 Chapter Nine
Table 9-2: Criteria for the external review of the indicators and the
MCIs.
A Criteria for the Indicators

1. The indicators are closely related to the 2010 TESOL
International Standards.
2. The indicators are contextually relevant to the MENA region.
3. The indicators are politically and culturally sensitive to the
MENA region.
4. The indicators are free of grammar, punctuation, and spelling
errors.
B Criteria for the Individual Items
5. The item is not culturally biased (inclusive of all MENA region
cultures).
6. The item generally reads well.
7. The item includes relevant and correct information.
8. The item is accurately aligned to the indicator it purports to
assess.
9. The item is accurately aligned with Bloom’s Taxonomy.
(1 = Recall, 2 = Comprehension, 3 = Application, 4 = Analysis, 5
= Evaluation, 6 = Creation)
10. The maximum number of words in the item does not exceed 225
words (up to 110 words maximum for the stem and 110 for the
answer options).
11. The item is free of grammar, punctuation, and spelling errors.
C Criteria for the Stems
12. There is no usage of the word ‘EXCEPT’ in the stem unless it is
required for item clarity.
13. The stem is worded in the positive form unless a significant
learning outcome requires the negative form.
14. The stem is a question or a statement that clearly identifies a
problem.
D Criteria for the Answer Options
15. There is only one unambiguously correct key.
16. The alternatives are plausible and concise.
17. The alternatives are mutually exclusive.
18. The alternatives are homogenous in length, grammar and content
to avoid giving extraneous clues.
19. The alternatives do not include ‘All’, ‘Never’, ‘none of the
above’ or ‘all of the above.’
20. The alternatives/key do not include different combinations of

options such as A & C, D & B.
21. The alternatives are presented in a logical order such as
alphabetically or at random.
22. The alternatives do not contain words like ‘all’, ‘none’, or
‘always’.
E Other Criteria
23. The item is of appropriate level of difficulty for ESL teachers
with a Bachelor’s degree.
24. The item is in the active voice unless a significant learning
outcome requires the passive form.
25. The explanation for the correct answer includes ‘Therefore, the
correct answer is . . . .’
26. The item explanation does not need to include in-text citations.
However, footnotes and the bibliography that accompanies each
module may include references.
The external reviewers reviewed each and every item from each
group. A total of 450 MCIs were reviewed. The review revealed that
writing MCIs with plausible distracters is a challenging task. About 33%
of the total items were discarded after the review process was complete.
The results of the external reviews were then collated and each of the five
groups had to revise their MCIs accordingly.
Sample Test Items

A total of 300 out of 450 items met the standards and criteria set by the
research team. The following are sample MCIs from the resource that are
contextually and culturally relevant to the EAL teachers in the Gulf
Region.
Item 1
Paul’s family moved to Oman. The family adopted several of the values
and traditions of the Omanis like celebrating Eid, but held on to some
characteristics of their own culture such as traditional American dishes.
This is an example of cultural __________.
(A) segregation
(B) dehumanization
(C) integration
(D) discrimination
192 Chapter Nine
Explanation:
Cultural integration occurs when one cultural group preserves some
distinctive aspects of its own culture, while adopting many of the values,
attitudes, and traditions of the dominant culture.
Therefore, the correct answer is (C).
Item 2
Who are polychronic time-oriented people?
(A) People who do many things simultaneously.

(B) People who do one thing at a time.
(C) People who like to treat other people the same way.
(D) People who like to make exceptions for certain people.
Explanation:
Because Western cultures are mostly monochronic and Middle Eastern
cultures are polychronic, it is important for Arab students to know the
difference. Polychronic time-oriented people are people who adjust their
time to suit their needs and may have to do many things simultaneously.
Therefore, the correct answer is (A).
Item 3
Fatima’s family moved from the United Arab Emirates to the United
Kingdom. While she appreciates many aspects of the new culture, she
prefers to wear Emirati traditional clothing and eat Emirati food, and she
makes a special effort to continue learning about Emirati heritage. Fatima
exhibits a high level of __________.
(A) assimilation
(B) cultural identity
(C) linguistic diversity
(D) bigotry
Explanation:
Cultural identity is part of people’s self-perception, as they prefer to follow
the traditions, nationality, language, ethnicity and social class of their
distinct culture. Assimilation is when people adapt to the prevailing culture
of the majority. Linguistic diversity is having a variety of languages, and
bigotry is an act of prejudice and racism. Therefore, the correct answer is
(B).
Item 4
A student ended an academic note to his teacher with this:
‘Wish peace be with you.’ This is an example of:
(A) Code-switching
(B) L1 interference
(C) Avoidance
(D) Displacement
Explanation:
Code-switching is when a student that speaks English and Arabic uses
words from both languages in the same sentence. Avoidance is a
communication strategy used by a learner when he avoids talking about a
topic because he does not have the necessary language resources to talk
about it. Displacement is a linguistic term indicating the capability of
language to communicate about things that are not immediately present.
The student example in this question is a direct translation from his first
language (Arabic) also known as L1 interference. Therefore, the correct
answer is (B).
Item 5
Children in the UAE attend bilingual schools where all subjects are taught
in both English and Arabic. At the end of school these children will be:
(A) Natural bilinguals

(B) Coordinate bilinguals
(C) Minimal bilinguals
(D) Compound bilinguals
Explanation:
According to the definitions of Bilingualism, a natural bilingual is
someone who has not undergone any specific training in a second
language; a coordinate bilingual is someone whose two languages are
learned in distinctly separate contexts; a minimal bilingual is someone with
only a few words and phrases in a second language; and a compound
bilingual is someone whose two languages are learned at the same time,
often in the same context such as the New School Model bilingual
programme in the UAE. Therefore, the correct answer is (D).
Item 6
Children from an Arab background will typically be diglossic as they use a
dialectal variety of some sort at home, but will be taught in Standard
Arabic and English at school. The pattern of errors in English that these
children make will be more influenced by:
194 Chapter Nine
(A) Dialectal Arabic, the older the child.

(B) Dialectal Arabic, the younger the child.
(C) Dialectal and Standard Arabic, the older the child.
(D) Dialectal and Standard Arabic, the younger the child.
Explanation:
An Arab child becomes a relatively proficient user of Standard Arabic
after 5 to 6 years of formal education, that is, by the age of 12-13. The
dialectal variety, as a real mother tongue, is mastered at a much earlier
age. So the younger the child, the more likely it is that his/her English
errors will be influenced by his/her dialectal Arabic. Conversely, the older
the child the more likely that his/her Standard Arabic is well established
and affects his/her English performance. Therefore, the correct answer is
(B).
Item 7
When a language learner says: “Inshallah, I will see you next week”, it is
an example of:
(A) L1 interference
(B) Fossilization
(C) Code switching
(D) Pidginization
Explanation:
This question tests your understanding of common theoretical terms in the
field of language acquisition. When a student uses a direct translation from
his first language (Arabic) into English, it is called L1 interference.
Fossilization refers to the loss of progress in the acquisition of a second
language despite continuous exposure to the second language.
Pidginization is when a group of people, who do not have a common
language, use a simplified language for communication. When words from
L1 and L2 end up in the same sentence, it is referred to as code-switching.
In this case the student used “Inshallah”, which is an Arabic word, in the
same sentence with English words. Therefore, the correct answer is (C).
Item 8
The magazine TESOL Arabia Perspectives is most likely to contain which
of the following?
(A) Articles about the teaching and learning of English with a focus on the
United States and abroad.
(B) Articles about the teaching and learning of English with a focus on the
Middle East and North Africa.
(C) Articles about the teaching and learning of Arabic with a focus on the
Middle East and North Africa.
(D) Helpful tips for teachers of Arabic and English in the Middle East.
Explanation:
TESOL Arabia Perspectives is the quarterly publication of the TESOL
Arabia association (tesolarabia.org) and discusses the teaching and
learning of English with a focus on the Middle East and North Africa.
Therefore, the correct response is (B).
Item 9
In order to study at federal tertiary institutions in the United Arab
Emirates, prospective Emirati students must take the _____.
(A) Common Educational Proficiency Assessment-English (CEPA-

English)
(B) Scholastic Assessment Test (SAT)
(C) National Admissions and Placement Office (NAPO)
(D) International English Language Testing System (IELTS)
Explanation:
The Common Educational Proficiency Assessment-English (CEPA-
English), administered through the National Admissions and Placement
Office, is a requirement for admission to federal tertiary institutions in the
UAE. Therefore, the correct response is (A).
Item 10
A teacher is designing a new grammar assessment for her English class.
She wants to design an assessment that reflects a situation the student is
likely to encounter in the “real world”. Therefore, the selected theme and
context is visiting Yas Mall. The teacher is considering the _____ of the
assessment.
(A) transparency
(B) practicality
(C) washback
(D) authenticity
Explanation:
Authenticity in assessment involves designing “real-life” tasks in which
students use and apply their knowledge and skills. Yas Mall is a major
shopping center in Abu Dhabi, and people from all over the region travel
to shop there. Therefore, the correct response is (D).
196 Chapter Nine
The TESOL Teacher Readiness Inventory (T-TRI)

The resource that was produced after the process outlined above,
namely, TESOL Teacher Readiness Inventory (T-TRI), was then prepared
for piloting with in-service and pre-service teachers in the UAE. The
piloting of the resource is an important stage in the development of the
instrument. It is important to ensure that the items included in the resource
are valid and reliable. To this end, three types of validity are deemed to be
crucial to the success of the project: content validity, criterion validity and
face validity. In terms of content validity, the project team ensured that the
resource items were based on the agreed upon standards and the associated
indicators. Resource items that were not adequately matched to the
indicators were either amended or discarded. During the piloting of the
resource, pre-service and in-service teachers will review the resource items
for both face validity and contextual relevance. In terms of criterion
validity, during the rollout phase of the resource, the investigators will
measure the relationship between the test-takers’ performance on the items
and their actual status as novice, developing, or proficient achievers.
During the pilot stage of the instrument, item analysis statistics will be
generated on all items and will be used to analyze the effectiveness of
individual items. Performing these statistical analyses will help the
research team and item writers improve items and eliminate ambiguous or
non-discriminating items. These statistics will also provide the item level
of difficulty, item discrimination, frequencies and distribution, and
reliability coefficients.
Using items that pretested well, a final resource (to be published in
hard copy and in an online format or even possibly an APP) will be
created ensuring that the required instrument specifications are honored
such as item difficulty and content coverage. The final resource will allow
users to calculate their level of proficiency on a given scale based on the
number of correct answers in the MCIs per domain. In the online version,
the teacher proficiency profile will be automatically generated with
specific information on which domain(s) require more attention and
references and resources to help the test-taker acquire specific knowledge
in domains where weaknesses have been detected.
The project is on-going and the research team plans to periodically
write revised editions of the T-TRI to include new items that reflect the
flux of changes and improvements in new learning strategies and respond
to the new requirements of the educational authorities in the Gulf Region
and also changes the TESOL International Association makes to the
standards. Currently, the T-TRI includes items that address cultural
elements from different Gulf countries. A future edition of the instrument

should include MCI writers and reviewers from the wider MENA Region
in order to make the resource more responsive to the linguistic and cultural
differences between the MENA countries and the needs of EAL teachers
working in these countries.
Conclusion
The project outlined in this chapter will contribute to the body of
knowledge of EAL teacher education by developing a contextually
relevant independent study and self-assessment resource for EAL pre-
service and in-service teachers in educational institutions in the Gulf
Region. The independent learning and self-assessment resource will
support teacher learning and assessment of critical knowledge and
understanding that go beyond basic factual knowledge. The resource will
provide opportunities for teachers to become not only direct beneficiaries,
but also stakeholders in determining the nature of support required in a
format and time that is most beneficial to them, through self-assessment
and self-regulation of their learning.
Teacher education instructors may use the resource as part of the
formative assessment of the EAL teacher candidates’ learning in the core
curriculum courses. The use of the instrument may generate quantitative
data that teacher education units may use to improve the curricula. The
data may show the strengths and the challenges of teacher education
programmes.
One of the long-term goals for this project is to turn the resource from
a hard-copy format into an APP that teachers can use on mobile devices.
In addition, the resource will be designed to provide differentiated
learning opportunities along with synchronous (real-time) reporting to
test-takers that can help them identify areas of strengths, areas to improve,
and plan strategies to support their learning.
The resource will be designed to reinforce and evaluate teachers’ ability
to solve problems and analyze teaching-learning situations, understand
relationships that contribute to effective teaching and successful learning,
and to predict and interpret test-takers’ progress and achievement.
Therefore, by using the resource, EAL teachers’ professional content
knowledge and pedagogical capabilities may improve, thus, increasing their
effectiveness in the classroom, increasing their employability in schools,
and meeting the competitive labor market requirements as well as the
licensure standards adopted by educational authorities.
198 Chapter Nine
References
Anderson, L. W., & Krathwohl, D. (2001). A taxonomy for learning,
teaching, and assessing: A revision of Bloom's taxonomy of
educational objectives. New York: Longman.
Ashton, P., Webb, R. B., & Doda, N. (1983). A study of teachers’ sense of
self-efficacy (Final Report, National Institute of Education Contract No
400-79-0075). Gainesville, FL.: University of Florida (ERIC document
number ED 231 834).
Bandura, A. (1993). Perceived self-efficacy in cognitive development and
functioning. Educational Psychologist, 28(2), 117-148.
—. (1991). Social cognitive theory of self-regulation. Organizational
Behavior and Human Decision Processes, 50(2), 248-287.
—. (1983). Self-efficacy determinants of anticipated fears and calamities.
Journal of Personality and Social Psychology, 45(2), 464-469.
Bender, W. N., Vail, C. O., & Scott, K. (1995). Teachers’ attitudes toward
increased mainstreaming: Implementing effective instruction for
students with learning disabilities. Journal of Learning Disabilities,
28(2), 87-94.
Bloom, B., Englehart, M. Furst, E., Hill, W., & Krathwohl, D. (1956).
Taxonomy of educational objectives: The classification of educational
goals. Handbook I: Cognitive domain. New York, Toronto: Longmans,
Green.
Blue, G. (1994). Self-assessment of foreign language skills: Does it
work? CLE Working Paper, 3, 18-35.
Cavalluzzo, L., Barrow, L., Henderson, S. et al. (2015). From large urban
to small rural schools: An empirical study of National Board
Certification and teaching effectiveness. CNA Analysis and Solutions.
Retrieved from: http://www.cna.org/sites/default/files/research/IRM-
2015-U-010313.pdf
Central Intelligence Agency. (n.d.). The World Factbook. Middle East:
United Arab Emirates. Retrieved from:
https://www.cia.gov/library/publications/the-world-
factbook/geos/print/country/countrypdf_ae.pdf
Coombe, C., Folse, K., & Hubley, N. (2007). A practical guide to assessing
English language learners. Ann Arbor, MI: University of Michigan
Press.
Cowan, J., & Goldhaber, D. (2015). National Board Certification:
Evidence from Washington State. CEDR Working Paper 2015-3.
Seattle, WA: University of Washington.
Dajani, H., & Pennington, R. (2014, June 4). New licensing system for
teachers in the UAE. The National. Retrieved from:
http://www.thenational.ae/uae/education/new-licensing-system-for-
teachers-in-the-uae
Deemer, S., & Minke, K. (1999). An investigation of the factor structure
of the Teacher Efficacy Scale. The Journal of Educational Research,
93(1), 3-10.
Eggen, P., & Kauchak, D. (2007). Educational psychology: Windows on
classrooms, Pearson: Merrill Prentice Hall.
Educational Testing Services. (2015). Praxis. Retrieved from:
www.ets.org/praxis
—. (2011). The Praxis Series: The Official Study Guide: English to
Speakers of Other Languages. ETS.
Gitsaki, C., & Bourini, A. (2012). Innovative approaches to teaching: A
teacher professional development program for grades 6-9. In H. Emery
& F. Gardiner-Hyland (Eds.), Contextualizing EFL for young learners
(pp. 3-24). Dubai, UAE: TESOL Arabia.
Hawk, P., Coble, C. R., & Swanson, M. (1985). Certification: It does
matter. Journal of Teacher Education, 36(3), 13-15.
International Business Publications. (2013). United Arab Emirates
Country Study Guide. Volume 1, Strategic Information and
Developments. Washington D.C.: IBP.
Kane, T., & Staiger, D. (2008). Estimating teacher impacts on student
achievement: An experimental evaluation. NBER Working Paper No.
14607. Retrieved from: http://www.nber.org/papers/w14607
Layman, H. (2011). A contribution to Cummin’s Thresholds Theory: The
Madaras Al Ghad Program. Master’s dissertation, the British
University in Dubai, Dubai, UAE.
Midgley, C., Anderman, E., & Hicks, L. (1995). Differences between
elementary and middle school teachers and students: A goal theory
approach. Journal of Early Adolescence, 15, 90-113.
Midraj, J. (In Press). Self-Assessment. In J. Liontas & M. DelliCarpini
(Eds.), The TESOL Encyclopedia of English Language Teaching. New
York: Wiley.
Ministry of Education. (2014). English as an International Language
(EIL): National Unified K–12 Learning Standards Framework. Dubai:
Ministry of Education.
Pasternak, M., & Bailey, K. M. (2004). Preparing nonnative and native
English-speaking teachers: Issues of professionalism and proficiency.
In L. D. Kamhi-Stein (Ed.), Learning and teaching from experience:
200 Chapter Nine
Perspectives on nonnative English-speaking professionals (pp. 155-

175). Ann Arbor, MI: University of Michigan Press.
Punhagui, G. C., & De Souza, N. A. (2013). Self-regulation in the learning
process: Actions through self-assessment activities with Brazilian
students. International Education Studies, 6(10), 47-62.
Rockoff, J., (2004). The impact of individual teachers on student
achievement: Evidence from panel data. American Economic Review,
94(2), 247-252.
Rodriguez, M. C. (1997). The art and science of item writing: A meta-
analysis of multiple choice item format effects. Paper presented at the
Annual Meeting of the American Educational Research Association,
April, Chicago, IL.
Statman, S. (1988). Ask a clear question and get a clear answer: An
enquiry into the question/answer and the sentence completion formats
of multiple-choice items. System, 16(3), 367-376.
Tamjid, N. H., & Birjandi, P. (2011). Fostering learner autonomy through
self- and peer-assessment. International Journal of Academic Research,
3(5), 245-251.
Tchoshanov, M. (2011). Relationship between teacher knowledge of
concepts and connections, teaching practice, and student achievement
in middle grades mathematics. Educational Studies in Mathematics,
76(2), 141-164.
Teaching English to Speakers of Other Languages. (2010). Standards for
the Recognition of Initial TESOL Programs in P-12 ESL Teacher
Education. Alexandria, VA: TESOL Publications. Retrieved from:
http://www.tesol.org/docs/books/the-revised-tesol-ncate-standards-for-
the-recognition-of-initial-tesol-programs-in-p-12-esl-teacher-
education-(2010-pdf).pdf?sfvrsn=0
CHAPTER TEN
FOREIGN LANGUAGE TEACHERS’

PROFICIENCY: THE IMPLEMENTATION
OF THE EPPLE EXAMINATION IN BRAZIL
DOUGLAS ALTAMIRO CONSOLO

AND VERA LÚCIA TEIXEIRA DA SILVA
Abstract
The definition of the linguistic aspects and the domain within which to
operate in order to assess foreign language teachers’ proficiency has been
a challenge in language assessment. Foreign language teachers’
proficiency is understood as how and when linguistic knowledge and the
competence for communication lead to effective language use in teaching
contexts. In Brazil, the discussion about the characteristics and the quality
of teachers’ language is justified by the results of several studies that attest
the low proficiency level, mainly in oral skills, achieved by pre-service
language teacher trainees and in-service teachers in various teaching
contexts. Given the importance of investigating teachers’ language, and
the connections between the domain, the testing instruments and the
criteria on which to base a valid assessment of their language proficiency,
researchers in Brazil have been exploring the implementation of the
EPPLE, a language examination for foreign language teachers. This
chapter reports on some results from these investigations, focusing on
lexical frequency and variety, as well as accuracy and grammatical
complexity. Data were collected by means of instruments designed
especially to assess foreign language teachers’ proficiency–the TECOLI (a
test for listening comprehension in Italian), the TEPOLI (a test for oral
proficiency in English) and the EPPLE examination. The data and
discussion presented in this chapter can support revisions of the construct
202 Chapter Ten
for the EPPLE examination, and contribute in the areas of foreign

language teaching, language testing and teacher education.
Introduction
Language proficiency is a requirement for foreign language teachers
who are non-native speakers (NNS) of the language they teach and neither
higher nor lower levels of proficiency in teacher talk have been fully
established by means of comprehensive scales so as to assess, for example,
foreign language teachers in a large country such as Brazil and its variety
of schooling contexts. The definition of both the linguistic aspects and the
domain in which to operate to assess this proficiency has been a challenge
in language assessment.
Teachers’ proficiency is understood as teachers’ linguistic performance
on occasions where linguistic knowledge and communicative competence
lead to effective language use in teaching contexts. Our discussion about
the domain, the linguistic aspects and the quality of teachers’ language is
justified by the results from several studies that attest the low proficiency
level, mainly in oral skills, among students in a number of pre-service and
in-service teacher education courses, as well as in teaching contexts,
especially in ELT in regular schools.
In order to investigate the testing instruments and the criteria on which
to base a valid assessment of foreign language teachers’ proficiency,
researchers from four Brazilian public universities–State University of Sao
Paulo (UNESP), State University of Rio de Janeiro (UERJ), State
University of Maringa (UEM) and University of Brasilia (UnB)–and three
tertiary level institutions, Faculty of Technology (FATEC), Paulista
University (UNIP) and UNISEB/Estacio (a private university centre in the
city of Ribeirao Preto, in the state of Sao Paulo), have been investigating
foreign language teachers’ proficiency through the implementation of the
EPPLE (Exame de Proficiência para Professores de Língua Estrangeira)
a language examination for foreign language (FL) teachers (Consolo,
2008; Consolo, Lanzoni, Alvarenga, Concário, Martins, & Teixeira da
Silva, 2010, 2009). The researchers interested in the development of the
EPPLE examination are also members of the ENAPLE-CCC (Ensino e
Aprendizagem de Línguas: Crenças, construtos e competências), a
research group hosted by UNESP and recognized by the CNPq (Conselho
Nacional de Desenvolvimento Científico e Tecnológico), the national
council for research and technology.
This chapter reports on four studies within the EPPLE project, the
present main interest of the ENAPLE-CCC group, focusing on the testing
Foreign Language Teachers’ Proficiency 203
of lexical frequency and variety, and on the accuracy and grammatical

complexity in spoken language. The results of these investigations can
shed light on the washback effects from examinations such as the EPPLE
on foreign language teacher development. The studies contribute to the
validation process for the EPPLE, to the development of more adequate
and valid test items and, consequently, the establishment of assessment
criteria to produce better rating scales for the examination as well as
greater guidance in rater training.
The research results reported here were compiled from four studies in
FL teaching contexts in Brazil, namely Baffi-Bonvino (2010), Borges-
Almeida (2009a), Silva Neto (2014) and Veloso (2012), and data in each
study were collected by means of instruments designed especially to
assess FL teachers’ proficiency: the TECOLI, a test for listening
comprehension in Italian; the TEPOLI, a test for oral proficiency in
English; and the EPPLE examination.
The Study
Our research contexts comprise Letters courses in public universities in
Brazil, as well as in-service teacher courses, in which the data for the
studies we review were collected. The Letters course (Curso de Letras), is
a four-year, sometimes a five-year university course in Brazil to educate
language teachers, in Portuguese and in other languages. Letters students
can usually opt for certification in one language or in two languages,
which they should be able to teach after graduation. All the data and
information discussed in this chapter derive from studies conducted to
investigate the implementation and validation of the EPPLE examination.
This means that, rather than collecting new data for this discussion, we
draw on empirical data generated by our colleagues, available for public
consultation, and on information available for members of our research
group.
The participants, in the studies by Baffi-Bonvino (2010), by Borges-
Almeida (2009a) and by Silva Neto (2014), are graduating students from
Letters courses (English and Portuguese languages), some of which were
already working as teachers of English as a foreign language (EFL) in
private language schools. In Veloso’s (2012) study, participants were
students graduating from Letters courses (Italian and Portuguese), and
certified teachers of Italian as a FL.
The research instruments were two different tests of FL proficiency,
especially designed for teachers and teachers-to-be, and the oral test of the
EPPLE examination. Each of these instruments is described below.
204 Chapter Ten
The TECOLI
The TECOLI, fully described in Veloso (2012), is an abbreviation for
the Teste de Compreensão Oral em Língua Italiana (Listening Comprehension
Test in Italian), a product of Veloso’s doctoral investigation, based on six
versions of a listening comprehension test designed and applied to (future)
teachers of Italian as a FL, in Brazil and in Italy, and the final and revised
version of the TECOLI is based on data from five contexts of pre-service
teacher education courses (Letters courses) with focus on Italian as a FL in
Brazil.
The TECOLI can be a reference for tasks to test listening skills by
means of language samples and test items that were reviewed by teachers
of Italian as a FL and also underwent a detailed statistical analysis. The
bulk of the detailed results from the investigation conducted by Veloso
(2012), thus, provides grounds for the testing of FL teachers’ listening
comprehension skills and can support the assessment of oral skills in the
EPPLE examination.
The TEPOLI
The TEPOLI, short for Teste de Proficiência Oral em Língua Inglesa,
(Consolo, 2004), is a test of oral proficiency in English and consists of an
interview based on a set of pictures, some of which are accompanied by
short texts, taken from EFL course books and magazines. The pictures
work as visual prompts in the first testing task, on which the topics for the
oral interaction between examiner and examinee(s) are based. Examinees
can take the TEPOLI individually or in pairs, and when two examinees are
tested, this task is geared towards encouraging the examinees to interact
not only with the examiner but also with each other.
As of 2003, the TEPOLI includes a second test task that consists of a
role-play that aims at assessing the production of oral language that
encompasses the metalanguage EFL teachers are expected to use in
teaching contexts, for example, when they explain or talk about the
English language with their students. This task has been incorporated in
the EPPLE oral test, as described in the following section.
The data on EFL student-teachers’ proficiency in English discussed by
Baffi-Bonvino (2010) and Borges-Almeida (2009a) are largely based on
results of the TEPOLI and the levels of oral proficiency in English
achieved by students graduating from Letters courses in Brazil. All the
student-teachers who participated in Baffi-Bovino’s and in Borges-
Almeida’s studies, as well as in the study by Silva Neto (2014), reported
below, were in the last year of an undergraduate teacher development

course in Brazil.
The EPPLE
The EPPLE examination stands for Exame de Proficiência para
Professores de Línguas Estrangeiras, and it is a proficiency examination
for FL teachers to evaluate and classify their linguistic proficiency-
henceforth LPFLT (language proficiency of foreign language teachers), a
type of language proficiency that is both general and specialized. The
examination aims at teachers-to-be, that is, undergraduates about to
conclude an initial teacher education programme at a tertiary level, usually
in a Letters course (in the case of Brazil), and FL teachers already engaged
in the profession and responsible for foreign language lessons to young
children, in primary and secondary education, at university and in private
language schools. The examination may also be taken by teachers engaged
in further education at a postgraduate level.
LPFLT includes the abilities of comprehension and production of the
foreign language focused on a given version of the EPPLE, in both verbal
and written modes. The EPPLE has so far been designed and piloted only
in English. However, a detailed study of items to test listening
comprehension in Italian has already been carried out and presented by
Veloso (2012), as reported earlier in this chapter, and our proposal
includes plans to produce the EPPLE in other foreign languages taught in
Brazil such as French, German and Spanish.
General language proficiency, as seen as part of the LPFLT, is
characterized by the quality of performance in a given language, as it is
used by the majority of its speakers in a variety of everyday situations,
from informal to formal conversations, when reading informative texts and
usual documentation, to understand oral language in verbal messages and
short videos, and in the production of e-mail messages and written texts
aiming at social networking, for example. With regard to the specialized
proficiency of FL teachers, the main part of LPFLT that mostly determines
language proficiency for professional demands, it encompasses the use of
a given FL for educational purposes, for example, to manage classroom
discourse and communication in language teaching contexts (Consolo,
2007). In this sense, teachers’ language includes providing information
and giving instructions, pedagogical explanations, evaluating students’
performance, reading academic texts and teaching materials, the
understanding of audio and videos for pedagogical purposes, and the
production of materials and instruments to evaluate students. More
206 Chapter Ten
detailed reviews and studies about teachers’ language have been the focus
of studies by members of the EPPLE research team and their supervisees,
such as Andrelino (2014), Colombo (2014), Ducatti (2010) and Fernandes
(2011).
The EPPLE examination comprises two tests: a paper for reading
comprehension and written production, and a test of listening comprehension
and speaking skills. The written test contains comprehension questions
about texts of general interest for language teachers, and items in which
the candidate must deal with writing tasks likely to be carried out by FL
teachers such as writing questions for a reading comprehension exercise
and correcting mistakes in texts produced by language students. Tasks that
require the production of argumentative texts, short messages sent by
electronic mail, or summaries of academic texts, can also be in this paper.
A sample of the written test is available at www.epplebrasil.org.
The speaking test, if given in its face-to-face format, is conducted in
pairs of candidates and in the presence of an examiner who manages the
test and of an examiner who acts more like an observer and a rater.
A fully electronic version of the EPPLE in English was produced and
has been applied to student-teachers since 2011. The ‘electronic EPPLE’ is
a computer-based examination that includes the tasks for the oral test, to
be done in around 25 minutes, and the tasks for the written test, to be done
in the second part. The whole electronic examination lasts around two and
a half hours. The electronic EPPLE includes recorded instructions given to
the candidates at the beginning of the examination, and a task to test the
camera and the microphone connected to the computer before the oral test
starts. Candidates have the possibility to report on faulty equipment, if that
is the case, and the examiner(s) and/or technician(s) in charge of the
computer laboratory can help solve technical problems that may occur.
The answers produced by the candidates are recorded in the computers and
in a data bank to be rated on a later date.
The speaking test, in both the face-to-face and the electronic versions,
has four parts. In the first part the candidates are expected to speak about
themselves: about personal and professional information, previous
experiences as FL students, and professional expectations for the future.
The second part of the oral test is based on a brief video segment, or on
two short video extracts, that firstly must be understood so that a
discussion about the content in the video(s) can be carried out by the
candidates, with each other, and with the examiner conducting the test. In
the computer-based version of the EPPLE, candidates answer questions
about the video, and the screen design for Part 1 of the video tasks is
shown in Figure 10-1.
Foreignn Language Teaachers’ Proficieency 207
Figure 10-1: T
The EPPLE exaamination, Orall Test, Part 1.
In the thhird part of thee test the cand

didates must show their prroficiency
in using mettalanguage, thhat is, specific language for pedagogical purposes.
p
Situations of problems ussually faced by FL studentss are presented and the
examinees aare expected to offer solutio ons for linguisstic doubts lik
kely to be
raised in lannguage lessonns. Examineess are expecteed to explain linguistic
rules, for eexample, as well as prod duce pedagoggical explanaations. A
reproductionn of the screenn for this part is shown in FFigure 10-2 beelow.
In the laast part of thee oral test candidates are assked for their opinions
about the oral test they justj finished, to provide ddata about thee EPPLE
from the exxaminees’ perrspective, wh hich may alsoo provide data from a
perspective of self-assesssment. The examinees are eexpected to feeel free to
express theirr opinions aboout the oral test and about thheir own perfformance.
In the faace-to-face oraal test the two
o examiners m meet after the test so as
to evaluate the candidatees’ performan nce and classiify the quality y of their
speaking acccording to a proficiency scale of a hoolistic nature. For the
answers prooduced and reecorded in thee electronic vversion of thee test, the
recordings aare rated by twwo examiners later on usingg the same scaale.
The dataa analysed andd discussed by y Silva Neto (22014), reporteed below,
are based onn results from the EPPLE oral o test in its face-to-face format,
f as
208 Chapterr Ten
well on a ppreliminary seemi-electronicc version of the test deveeloped in

Power Pointt and administtered in a com
mputer laboratoory.
Figure 10- 2: The EPPLE exxamination, Oraal Test, Part 3.
Reesults and Discussion

D
Vocaabulary, Lexxical Compeetence and tthe Validatio
on
off the EPPLE
E Oral Test
Oral dataa collected byy means of thee TEPOLI andd two mock veersions of
the Cambriddge FCE (Firsst Certificate in English) ooral test were analysed
by Baffi-Boonvino (2010)) and her study recommennded a revisio on of the
proficiency rating scale for the EPP PLE oral tesst, and also indicated
reliability inn the results of
o a test that was designedd and piloted d so as to
provide grouunds for the design
d of the EPPLE exam mination. In thiis section
we present ddata concerninng the oral prroduction of uundergraduatee students
in test situattions, and repport on the comparison of tthe language produced
and the criteeria of the bannd scales of th
he two tests ussed in Baffi-B
Bonvino’s
study.
The extensive analysis of the language produced in test situations, as

fully reported in Baffi-Bonvino (2010), was conducted by means of the
RANGE software and the lists of vocabulary presented according to the
British National Corpus (BNC). In the RANGE software there are 16 lists
of words, and from List 1 to List 14 there are 1,000 family words in each
of them, in bottom-up order, that is, from the most frequent words in
English (List 1) to the least frequent ones (List 14). List 15 is about names,
List 16 has exclamations, hesitations, etc; and the List named Not in the
lists presents the words which were said incorrectly. The analysis focuses
on the results from the categories Types and Families because these are the
categories which represent the greatest variation in the use of lexical
items, that is, rich vocabulary (Scaramucci, 1997). Tokens were not
considered because they represent the number of words produced, not the
quality of those words. Further details of this lexical analysis were
reported in Consolo and Teixeira da Silva (2011).
Below are some data from the TEPOLI, produced by AL, one of the
participants in Baffi-Bonvino’s study who had an excellent performance in
the test, to illustrate our discussion. The data and the table shown below,
discussed in Baffi-Bonvino’s doctoral thesis, were first published in Baffi-
Bonvino’s MA dissertation (Baffi-Bonvino, 2007), a study in which initial
comparisons between TEPOLI and two FCE mock oral tests were
analysed.
Extract 1
01 AL: uh (+) well MC (+) I (+) I was listening to (+) to you and
(+) and your colleague
02 in the class during the (+) the roleplay (+) and I (+) and I
noticed that (+) you know
03 (+) you’ve got some problems during your speaking (+) uh (+)
and (+) and in your
04 grammar too (+) but (+) I (+) you know (+) one thing that is
really (+) worrying (+) is
05 (+) uh (+) the USE of (+) auxiliary words (+) like when you
said (+) uh (+) I not worry
06 about the environment (+) what’s missing here? There’s
something missing right (+)
07 because as you know in Portuguese uh (+) you know (+) in
Portuguese structure (+) in
08 English structure (INCOMP) so (+) you were speaking at the
present in the moment (+)
09 you know (+) you want to tell your colleague (+) that you are
not WORRIED (+) if you
210 Chapter Ten
10 use the verb to be (+) you are not worried and you need (+)
you can use an adjective
11 (+) but here {ASC} you use WORRY (+) the verb (+) so you
can not just put NOT there
12 and that’s it (+) it’s missing {ASC} something and (+) this
something is (+) the
13 auxiliary (+) that you know is (+) the auxiliary (+) do (+) so
you can tell I do not worry
14 (+) about environment (+)maybe you wanna emphasize (+) or
you can just (+) contract
15 that (+) you know (+) like I don’t worry about (+) environment
(+) you know (+)
16 because (+) so try to pay attention to (+) to (+) you know (+)
when we make negative
17 using the simple present (+) right (+) you (+) you need to use
(+) uh you know (+)
18 either don’t or (+) doesn’t for third person (+) right (+) so (+)
that’s it
(TEPOLI, 28 Nov 2005. Source: Baffi-Bonvino, 2007, pp. 269-270)
AL’s performances in the two FCE mock tests and in the TEPOLI
were equivalent, if we compare the marks given to this candidate.
Similarly, equivalence was found for three other participants in the study,
LR, MR and MC (see Table 10-1).
Table 10-1: Marks in the FCE and in the TEPOLI (Source: Baffi-
Bonvino, 2007, pp. 276-277).
FCE Mock 1 FCE Mock 2 TEPOLI

Student (0 – 5) (0 – 5) (0 – 10)
LR 3.5 4.0 8.0 / Band C
MR 3.0 3.5 7.0 / Band D
MC 3.0 4.0 6.0 / Band E
AL 4.5 4.5 9.5 / Band A
In TEPOLI examinees are rated in an A–E scale and given a numerical

mark ranging from 10.0 to 0.0. The results presented in Table 10-1 above
are based on a rating scale ranging from A (the highest band) to E (the
lowest band). An equivalence between the grades for the FCE mock tests
and for the TEPOLI is observed in Table 10-1, if we consider that the
highest grade for the FCE Speaking paper is 5.0. The participants’ oral
proficiency levels, as assessed in the three test situations, are similar and
the criteria used for each of the three tests were based on qualitative
descriptors. Both tests, FCE and TEPOLI, are based on holistic scales that
consider the final product, and the test-takers’ performance is described
through many specific linguistic and communicative criteria.
Silva Neto (2014) analysed the lexical competence of pre-service
teachers who were graduating from a Letters course in a public university
in the state of Sao Paulo. The data were obtained by means of a trial
version of the EPPLE oral test, in the two formats aforementioned: a face-
to-face test conducted with pairs of students, and a preliminary semi-
electronic version of the oral test administered to the same students in a
computer language laboratory. The lexical characteristics and quality of
English in the speech of test-takers were analysed, such as the relevance
and type of vocabulary used in the target language when interacting, the
suitability of the lexical items to the test tasks, negotiations of meaning
that might have arisen from the difference between the lexical competence
of the candidates, the appropriateness of lexical items to the content of the
speech and the coefficient of frequency of the item according to the
subject matter. The results obtained by comparing the data from the face-
to-face test and its ‘semi-electronic’ version show that the students’
performances do not vary significantly in the two versions of the test.
Silva Neto (2014) claims that his results point to the need for revision of
the descriptors for the vocabulary produced in the EPPLE oral test and the
introduction of an analytical scale to rate speech that considers the
differences between proficiency bands based not only on the frequency
factor of lexical items but also on their appropriateness to the speaking
context.
Based on the results presented by Baffi-Bonvino (2010) and by Silva
Neto (2014), we recommend a combination of holistic scales with
analytical ones for the EPPLE oral test, which would be more adequate to
assess candidates’ oral proficiency. The existing scales assess the oral
production as a combination of the oral skills involved in the tasks and,
although the descriptors within each band focus on separate linguistic
aspects, linguistic features are seen to operate interdependently so as to
contribute to or impede communication. Analytical scales, on the other
hand, would make it possible to assess each of the criteria involved in oral
production in a separate way.
212 Chapter Ten
Accuracy and Grammatical Complexity across Levels

of Proficiency in TEPOLI: The Search for Valid Scales
in the EPPLE Examination
In this section we report on part of a doctoral research study (Borges
Almeida, 2009a) conducted with graduating students from a foreign
language (English) teacher education course (a Letters course) in a state
university in Brazil. The study was guided by the research question, “how
is grammar characterized across the TEPOLI’s proficiency levels in the
participants’ performance in terms of (a) accuracy and (b) complexity in
the oral test and in a seminar”? (Borges Almeida, 2009a, p. 27).
According to Consolo et al. (2010), the EPPLE projects conducted
over a decade ago, aimed at utilizing a general theoretical background on
language testing, the literature on existing tests of proficiency both in the
Brazilian context and in the international scene, and data generated by
results in the TEPOLI and in the preliminary versions of the EPPLE oral
and written tests, in order to guide the implementation of the examination.
The EPPLE projects (Consolo, 2011; Consolo, 2008) also aimed at
establishing, amongst the language aspects and the different factors that
constitute and influence the “linguistic-communicative ability” (Bachman,
1990; Llurda, 2000), or “language ability” (see Bachman & Palmer, 1996),
criteria for the characterization of the FL proficiency of teachers, mainly
in the different contexts of language education in Brazil (e.g., regular
schools, namely Ensino Fundamental and Ensino Médio, university
courses and private language schools).
In the doctoral study conducted by Borges-Almeida (2009a), accuracy
in speech production was investigated by means of two quantitative
indexes: the number of deviations per unit or its percentage and the
percentage of deviation in free-clauses (D’Ely, 2006; Guará-Tavares,
2008). Complexity was investigated in quantitative terms through the
mean length of units (given in words), the mean length of clauses, and the
frequency of clauses per unit and of dependent clauses per independent
unit (Wolfe-Quintero, Inagaki, & Kim, 1998).
It must be noted that the sentence, a language unit that pertains to and
represents written language, should not be used as a unit of analysis for
spoken language. Several studies on spoken language have used Hunt’s
(1965) T-unit, which consists of one dependent clause plus all its
subordinate clauses. Since there has been criticism regarding the definition
and operationalization of the T-unit in interaction contexts, Borges-
Almeida (2009a) adopted Foster, Tonkyn, and Wigglesworth’s (2000) AS-
unit (analysis of speech unit), which more clearly presents how to deal
with the disfluent mechanisms of speech. An AS-unit is defined as:
“a single speaker’s utterance consisting of an independent clause, or sub-

clausal unit, together with any subordinate clauses(s) associated with
either” and “allows for the inclusion of independent sub-clausal units,
which are common in speech.” (Foster et al., 2000, p. 365)
Borges-Almeida’s study (2009a) is therefore an example of the efforts

to bring theoretical background on language description and analysis into
the discussions towards the improvement of the EPPLE examination, as
pointed out in Consolo et al. (2010). Her review of key aspects that
characterize spoken language support the analysis of data from TEPOLI
reported in the next section.
Grammatical Accuracy and Complexity in TEPOLI

As previously reported in Consolo and Teixeira da Silva (2011), the
results of the indexes of deviation free units and deviations per unit (see
Table 10-2 below) indicate the internal consistency of the TEPOLI
proficiency scale regarding grammatical accuracy, since such indexes
reveal that the higher the band, the less deviation can be found in a
candidate’s speech. The same can be observed in recordings of class
seminars presented by the same students who did TEPOLI, and this
indicates that the test has predictive power over a communicative situation
that is not an interview in itself–in the case of the seminars, a situation
somehow similar to an authentic class. This way, the higher the TEPOLI
band, the better the student’s grammatical performance in both, a non-
testing setting and in TEPOLI.
A comparison of deviations from the normative standards for
grammatical accuracy in English, found in free units (first column), in
deviations per unit (second column) and in occurrences of self-corrections
is presented in Table 10-2.
214 Chapter Ten
Table 10-2: Grammatical accuracy as measured in TEPOLI (Source:

Borges-Almeida & Consolo, 2010).
Index Index Index

Band Deviation-Free Units Deviations per Unit Self-Correction
Oral Test Seminar Oral Test Seminar Oral Test Seminar
B 0.88 0.84 0.14 0.19 0.08 0.12
C 0.79 0.67 0.28 0.51 0.09 0.12
D 0.65 0.62 0.46 0.60 0.10 0.06
E 0.61 0.52 0.53 0.97 0.05 0.05
Conversely, the indexes of self-corrections, for example, do not reveal

that their occurrence is determined by the proficiency bands for TEPOLI,
and this raises questions about how self-correction is currently described
in the TEPOLI scale and calls for a revision of this specific item in the
descriptors of grammatical competence and performance in spoken
language.
The indexes of complexity suggest that the differences between bands
are small and not very consistent when quantitatively observed, as
illustrated in Table 10-3 below. One of the interpretations for the weak
consistency of the index of words per unit is that such an index can also be
employed as a measure of fluency. The index of clauses per unit shows
little variation due to its relation to subordination, which is not a strong
characteristic of spoken language. Still the differences found suggest that
candidates placed in higher bands tend to produce more complex clauses
than candidates placed in lower bands.
Table 10-3: Grammatical complexity as measured in TEPOLI

(Source: Borges-Almeida & Consolo, 2010).
Index Index
Band
Clauses per Unit Words per Unit
Oral Test Seminar Oral Test Seminar
B 1.45 1.46 6.62 9.07
C 1.36 1.35 6.79 8.69
D 1.35 0.93 7.32 6.90
E 1.26 1.30 6.32 7.98
Band D can be seen as the level in which the candidates’ performance

is the most varied. In a paper in which she reports an analysis of data from
the TEPOLI from a phonological perspective, Borges-Almeida (2009b)
mentions that data from candidates placed in band D for fluency, as
analysed according to the criterion of filled pauses, also shows a pattern
considerably different from those observed in the other bands. One of the
hypotheses for that is that in this band candidates may go through more
linguistic restructuring stages, and so band D would be characterized by “a
period of variable latency” (Cruz, 2001, p. 51).
The results achieved by Borges-Almeida give evidence of important
differences between the proficiency bands for the linguistic elements
investigated in the broad scope of the bases for the EPPLE examination
and contribute to a revision and the improvement of the scale descriptors,
in order to maximize its validity. Both grammatical complexity and
accuracy increase towards each higher band. On the other hand, the data
do not support the description of how the phenomenon of self-correction
appears along the scale. Based on such results, the EPPLE scale can be
improved taking into account the changes already made to the original
scale for TEPOLI (Consolo, 2004).
Conclusion
The process of designing the EPPLE and its consolidation as a
language examination in the context of educating FL teachers in Brazil is
on the way towards a revised construct for the examination and the
definition of assessment criteria informed and supported by past and
present research studies.
Even though the results so far achieved by our research team are
mainly about English and Italian, we encourage the inclusion of other
languages in future projects and studies that can support further revision of
the constructs of the EPPLE examination and, as a consequence,
contribute in the areas of foreign language teaching and language testing,
as well as foreign language teacher education.
Once the EPPLE is widely used, as pointed out by Consolo and
Teixeira da Silva (2013), it is expected to motivate a revision of the course
contents and aims in pre-service and in-service teacher education,
especially in the Letters courses in Brazil. The standards established by
such an examination may be considered a reference for LPFLT, and for
the quality of language teaching and learning in the Brazilian educational
contexts.
216 Chapter Ten
Acknowledgement
This project was supported by FAPESP (Fundação de Amparo à
Pesquisa do Estado de São Paulo, process 2014/10544-0).
References
Andrelino, P. J. (2014). Análise da estrutura genérica das instruções na
fala do professor de Inglês: Contribuições para o teste oral do EPPLE
(An analysis of the generic structure in English teachers’ talk:
Contributions to the EPPLE oral test). Doctoral thesis, UNESP, Sao
Jose do Rio Preto, Brazil.
Bachman, L. F. (1990). Fundamental considerations in language testing.
Oxford: Oxford University Press.
Bachman, L. F., & Palmer, A. (1996). Language testing in practice.
Oxford: Oxford University Press.
Baffi-Bonvino, M. A. (2010). Avaliação da proficiência oral em Inglês
como língua estrangeira de formandos em Letras: Uma proposta para
validar o descritor ‘vocabulário’ de um teste de professores de língua
Inglesa (The assessment of oral proficiency in English as a foreign
language of graduating students in a Letters course: A proposal to
validate the descriptor for vocabulary in a test for English language
teachers). Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil.
—. (2007). Avaliação do componente lexical em inglês como língua
estrangeira: Foco na produção oral (Assessment of the lexical
component in English as a foreign language: Focus on oral
production). Master’s dissertation, UNESP, Sao Jose do Rio Preto,
Brazil.
Borges-Almeida, V. (2009a). Precisão e complexidade gramatical na
avaliação de proficiência oral em Inglês do formando em Letras:
Implicações para a validação de um teste (Grammatical precision and
complexity in the assessment of oral proficiency of Letters graduating
students: Implications for a test validity). Doctoral thesis, UNESP, Sao
Jose do Rio Preto, Brazil.
—. (2009b). Pausas preenchidas e domínios prosódicos: Evidências para a
validação do descritor fluência em um teste de proficiência oral em
língua estrangeira (Filled pauses and prosodic domains: Evidence for
the validation of the descriptor for fluency in a foreign language oral
proficiency test). ALFA: Revista de Linguística, 53(1), 167-193.
Borges-Almeida, V., & Consolo, D. A. (2010). Investigating accuracy and
complexity across levels: The search for a valid scale for the Language
Proficiency Examination for Foreign Language Teachers (EPPLE) in

Brazil. Poster presented at the Language Testing Research Colloquium
(LTRC), 12th-16th April, Cambridge, UK.
Colombo, C. S. (2014). O insumo linguístico oral em aulas de Inglês como
língua estrangeira para crianças: A fala do professor em foco (The
oral linguistic input in EFL lessons for children: Focus on teacher talk).
Master’s dissertation, UNESP, Sao Jose do Rio Preto, Brazil.
Consolo, D. A. (2011). Avaliação da proficiência linguístico-
comunicativa-pedagógica do professor de línguas: operacionalização
de construto no Exame de Proficiência para Professores de Língua
Estrangeira (EPPLE) (Assessment of language teachers’ linguistic-
communicative pedagogic proficiency: Operating the construct in the
EPPLE examination). Research Project–PHASE I. Sao Jose do Rio
Preto, Brazil: UNESP.
—. (2008). Exame de Proficiência para Professores de Língua
Estrangeira (EPPLE): Definição de construto, tarefas e parâmetros
para avaliação em contextos brasileiros (Proficiency examination for
foreign language teachers (EPPLE): Defining construct, tasks and
parameters for assessment in Brazilian contexts). Research Project. Sao
Jose do Rio Preto, Brazil: UNESP.
—. (2007). A competência oral de professores de língua estrangeira: A
relação teoria-prática no contexto brasileiro (The oral competence of
foreign language teachers: The theory-practice relationship). In D. A.
Consolo & V. L. Teixeira da Silva (Eds.), Olhares sobre competências
do professor de língua estrangeira: Da formação ao desempenho
profissional (pp 165-178). Sao Jose do Rio Preto, Brazil: Editora HN.
—. (2004). A construção de um instrumento de avaliação da proficiência
oral do professor de língua estrangeira. Trabalhos em Linguística
Aplicada, 43(2), 265-286.
Consolo, D. A., Lanzoni, H. P., Alvarenga, M. B., Concário, M., Martins,
T. H. B., & Teixeira da Silva, V. L. (2010). Exame de Proficiência
para Professores de Língua Estrangeira (EPPLE): Proposta inicial e
implicações para o contexto brasileiro (Proficiency Examination for
Foreign Language Teachers: Initial proposal and implications for the
Brazilian context). In Proceedings of the II Congresso Latino-
Americano de Formação de Professores de Línguas (CLAFPL) (pp.
1035-1050). Rio de Janeiro, Brazil: PUC-Rio.
Consolo, D. A., Lanzoni, H. P., Alvarenga, M. B., Concário, M., Martins,
T. H. B., & Teixeira da Silva, V. L. (2009). An examination of foreign
language proficiency for teachers (EPPLE): The initial proposal and
implications for the Brazilian context). In Proceedings of the II
218 Chapter Ten
Congresso Internacional da ABRAPUI (ABRAPUI International

Conference) (pp. 1-15). Sao Jose do Rio Preto, Brazil: ABRAPUI.
Retrieved from: http://www.abrapui.org/congresso.
Consolo, D. A., & Teixeira da Silva, V. L. (2013). A proficiência
linguístico-comunicativa-pedagógica do professor de língua
estrangeira: Alinhavando pesquisas em avaliação para novas políticas
linguísticas no Brasil (The foreign language teacher’s linguistic-
communicative-pedagogic proficiency: Research on language
assessment towards new language policies in Brazil). Paper presented
at the X Congresso Brasileiro de Linguística Aplicada, September 9th-
12th, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil.
Consolo, D. A., & Teixeira da Silva, V. L. (2011). A discussion about
tasks and criteria to assess EFL teachers’ oral proficiency in the
EPPLE examination. Paper presented at the AILA World Congress,
August 23rd-28th, Beijing University of Foreign Language Studies,
Beijing, China.
Cruz, M. L. O. B. (2001). Estágios de interlíngua: Estudo longitudinal
centrado na oralidade de sujeitos brasileiros aprendizes de espanhol
[Stages of interlanguage: A longitudinal study focusing the oral
production of Brazilian learners of Spanish]. Doctoral thesis,
UNICAMP, Campinas, Brazil.
D’Ely, R. C. S. F. (2006). A focus on learners’ metacognitive processes:
The impact of strategic planning, repetition, strategic planning plus
repetition, and strategic planning for repetition on L2 oral
performance. Doctoral thesis, UFSC, Florianopolis, Brazil.
Ducatti, A. L. F. (2010). A interação verbal e a proficiência oral na
língua-alvo na prática de sala de aula: (Re)definindo o perfil do
professor de uma professora de língua inglesa da escola pública
(Verbal interaction and oral proficiency in the target language in the
classroom context: (Re)defining the professional profile of na English
teacher in a public school). Master’s dissertation, UNESP, Sao Jose do
Rio Preto, Brazil.
Fernandes, A. M. (2011). A metalinguagem e a precisão gramatical na
proficiência oral de duas professoras de Inglês como língua
estrangeira (Metalanguage and grammatical accuracy in the oral
proficiency of two English as a foreign language teachers). Master’s
dissertation, UNESP, Sao Jose do Rio Preto, Brazil.
Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring spoken
language: A unit for all reasons. Applied Linguistics, 21(3), 354-375.
Guará-Tavares, M. G. (2008). Pre-task planning, working memory capacity,

and L2 speech performance. Doctoral thesis, UFSC, Florianopolis,
Brazil.
Hunt, K. (1965). Grammatical structures written at three grade levels.
NCTE Research Report 3. Champaign, IL, USA: NCTE.
Llurda, E. (2000). On competence, proficiency and communicative
language ability. International Journal of Applied Linguistics, 10(1),
85-96.
Scaramucci, M. V. R. (1997). The lexical competence of university
students to read in EFL. DELTA, 13(2), 215-246. Retrieved from:
http://www.scielo.br
Silva-Neto, T. M. (2014). Competência lexical na proficiência do
professor de Inglês como língua estrangeira: Uma análise do teste
oral do EPPLE (Lexical competence in EFL the EFL teacher’s
proficiency: An analysis of the EPPLE oral test). Master’s dissertation,
UNESP, Sao Jose do Rio Preto, Brazil.
Veloso, F. S. (2012). Percurso para a elaboração de um teste de co em
língua Italiana de (futuros) professores (The process of designing a
listening comprehension test in Italian for (future) language teachers).
Doctoral thesis, UNESP, Sao Jose do Rio Preto, Brazil.
Wolfe-Quintero, K., Inagaki, S., & Kim, H. Y. (1998). Second language
development in writing: Measures of fluency, accuracy and complexity.
Technical report 17. Manoa, Hawai‘i, US: University of Hawai‘i Press.
ISSUES IN LANGUAGE ASSESSMENT
AND EVALUATION
CHAPTER ELEVEN
VOCABULARY SIZE ASSESSMENT

AS A PREDICTOR OF PLAGIARISM
MARINA DODIGOVIC, JACOB MLYNARSKI,

AND RINING WEI
Abstract
Culture, education, attitude and language proficiency have been
viewed as the major causes of second language writer plagiarism
(Amsberry, 2009; Erkaya, 2009). However, research data that would
sufficiently substantiate these claims are scarce. The study described in
this chapter investigates the relationship between plagiarism and
vocabulary knowledge in the writing of over 200 students of English as a
second language. It uses both lexical error and vocabulary size assessment
as measures of vocabulary command. The study relies on an instructional
software tool called Grammarly, which identifies both textual borrowing
and language errors, as well as on the Vocabulary Size Test (VST) to
measure students’ vocabulary knowledge. The results indicate that there is
some correlation between the error count and plagiarism, and a strong
negative correlation between vocabulary size and plagiarism rate.
Therefore, the findings seem to suggest that poor vocabulary command
could be a major cause of plagiarism in second language writers. Based on
these findings, the importance of systematic vocabulary teaching and
learning as a strategy to avoid plagiarism emerges.
Introduction
Plagiarism, or using someone else’s words or ideas without
acknowledgment, appears in both native speaker (NS) and non-native
speaker (NNS) university student writing (Jaeger & Brown, 2010). Text
borrowing or appropriation, as plagiarism is sometimes called, has caused
Vocabulary Size Assessment as a Predictor of Plagiarism 223
librarians (e.g., Amsberry, 2009; Lund, 2004), literacy instructors (Evans

& Youmans, 2000) and second language specialists (e.g., Lankamp, 2009;
Liu, 2005; Shi, 2006) to give it increasing research consideration, with the
focus on possible reasons and remedies. Three factors seem to have
contributed to this development: the advent of computational plagiarism
detection tools, the emancipation of second language writing research and
the increased research in vocabulary learning and assessment. A quick
library database search revealed that approximately one third of plagiarism
related journal articles talked about antiplagiarism software, while another
third dealt with second language writing. Only a handful of articles
mentioned lexical issues. However, there seemed to be little overlap
among the three groups of articles. To bridge this gap, the current study
combined the application of electronic plagiarism detection tools with the
vocabulary knowledge assessment of second language writers. The study
reported in this chapter investigated the relationship between plagiarism
and lexical error in the writing of over 200 ESL students. In addition, it
used vocabulary size assessment to further investigate the link between
lexical knowledge and plagiarism.
Culture and language proficiency are often viewed as the major causes
of second language writer plagiarism, but the lack of adequate instruction
or consistently implemented policy have also been considered (Bacha &
Bahous, 2010). In her fairly comprehensive review, Amsberry (2009)
named cultural, educational and linguistic influences as the main causes of
textual borrowing among second language students. In addition to these,
there seem to be emotional or attitudinal causes as discussed by Erkaya
(2009) in a study of textual appropriation in Turkey. Both Erkaya (2009)
and Shi (2006) identified the availability of sources on the Internet as a
precipitating factor in the increase of student plagiarism.
In terms of culture as a major cause of plagiarism, Confucian societies
are sometimes believed not to have cultivated as strong notions of text
ownership as Western societies (Lund, 2004; Shi, 2006), seeing textual
borrowing as a sign of respect toward the original on the one hand and
blurring the boundaries between common knowledge and original ideas on
the other. Moreover, culture seems to influence educational practices such
as rote learning (Ballard & Clanchy, 1984) and copying as learning along
with a lack of writing instruction (Amsberry, 2009). To educational
influences, Hyland (2001) additionally counted the misplaced cultural
sensitivity of second language instructors, which prevents them from
giving clear and meaningful feedback on textual borrowing. On the other
hand, writing in a new language and trying to accommodate a new
rhetorical practice are, according to Amsberry (2009), linguistic influences
224 Chapter Eleven
that push second language students toward patch-writing, i.e., seamless

and unacknowledged incorporation of fragments from an original into
one’s own text. Finally, students may find themselves unmotivated to
write or discouraged by their instructor’s negative attitude toward writing
(Erkaya, 2009).
Surprisingly, little research has been done to investigate the impact of
linguistic insufficiencies of second language writers on their tendency to
plagiarise. It is, therefore, important to assess to what extent poor
command of the target language might be the cause of plagiarism in
second language (L2) writers. A good point of departure for such an
investigation would be vocabulary. In the words of David Wilkins (1972,
p. 111), “without grammar very little can be conveyed; without vocabulary
nothing can be conveyed”. Hence, L2 student writers struggling to find the
right words might be indeed enticed to borrow from the writing of more
proficient authors.
This study, therefore, explores the relationship between vocabulary
knowledge and textual borrowing or plagiarism. It does so by correlating
the language error rate and vocabulary size with the textual borrowing
rate. The notions of language error and vocabulary size are discussed in
the following section.
Background
There are two types of L2 vocabulary knowledge: receptive and
productive (Nation, 2006). Receptive vocabulary is usually larger than the
productive and enables the learner to comprehend things they read and
listen to. Productive vocabulary, on the other hand, facilitates the
productive skills of speaking and writing. In addition to vocabulary size,
which is expressed in the number of words a learner knows, vocabulary is
also measured in terms of depth (Beglar & Nation, 2007). Depth concerns
everything a learner knows about a word, including ways of spelling and
pronouncing it, the sentence structure it requires, its part of speech, the
functions it can have in connected discourse, the contexts in which it can
possibly occur, other words that may accompany it, the idiomatic
expressions it is known to build and the connotations it can have (Folse,
2004). It is expected that in productive skills, such as speaking and
writing, a larger vocabulary size would have the effect of a greater lexical
range used, while a greater depth of vocabulary knowledge would result in
a more accurate use of vocabulary.
Lexical range is one of the measures of language proficiency. The
underlying vocabulary size has been found to greatly affect reading
success in second language learning (Nation, 2006; Nergis, 2013; Ward,

2009). However, it is interesting that the perceptions of the vocabulary
size deemed sufficient for successful reading of expository discipline-
specific texts continues to change: from 5,000 to 3,000 (Nation & Waring,
1997) to 2,570 (Coxhead, 2000; Nation, 2006), reaching the size of only
2,000 for engineering (Mudraya, 2006; Ward, 2009). According to Biber
(2012), registers and perhaps disciplines differ in their use of linguistic
repertoire, potentially placing different language demands on L2 readers
across disciplines. However, except for the studies of cohesion (e.g.,
Dodigovic, 2005), not much previous research has focused on the effect of
vocabulary command (as measured by vocabulary size) on writing quality.
Cohesion as an aspect of vocabulary related to L2 writing proficiency
has been sufficiently explored (Dodigovic, 2013). The lack of ability to
write cohesively has also been identified as one of the factors contributing
to plagiarism (Matalene, 1985). In their seminal work on cohesion,
Halliday and Hasan (1976) examine the use of cohesive devices,
demonstrating how much depends on the mastery of this linguistic subset.
Cohesion requires writers to skilfully use the elements of lexico-grammar,
such as conjunctions, lexical substitution or omission and grammatical
indexing, successfully in order to achieve cohesively composed prose
(Dodigovic, 2005; Halliday & Hasan, 1976). Halliday and Hasan (1976)
claimed that it is possible to measure one’s ability to write cohesively by
identifying and measuring the use of certain discourse markers (Halliday
& Hasan, 1976; Lieber, 1981; Meisuo, 2000; Witte & Faigley,1981; Yang,
1989). In line with this, Leo (2012) investigated Chinese students’
language at an English-medium University in Canada in comparison with
Canadian-born Chinese. The study revealed that the Chinese language
incorporates the use of entirely different cohesive devices from those used
in English. Therefore, the NNS Chinese students cannot base their English
language development on a positive transfer from their mother tongue
(L1) (Leo, 2012). She also identified significant lexical problems of
Chinese learners of English as a second language (ESL) related to the use
of synonymy and content words.
Similar observations were made by Shi (2006) who investigated the
causes of plagiarism based on students’ survey responses. Shi (2006)
found that one of the most common problems was deemed to be
insufficient lexical knowledge, which was also flagged as the core reason
responsible for ‘cross-textual borrowings’ or plagiarism. A similar study
conducted by Yu (2013) yielded comparable results, as students frequently
blamed the inadequate vocabulary command for their poor paraphrasing
skills which eventually resulted in unintentional plagiarism. Finally,
226 Chapter Eleven
Dodigovic (2013) also found that poor paraphrasing skills were closely
associated with plagiarism.
Other aspects of lexis have not been commonly associated with
plagiarism. In particular lexical error, which should be an indicator of
lexical accuracy or depth of lexical knowledge (Folse, 2004; Nation,
2006), has barely been examined in the context of L2 writing. According
to Augustin Llach (2011), despite the fact that lexical errors emerge as the
most numerous in the available studies, research in this area is still scarce.
The lack of accuracy, otherwise known as language error, is significant in
three respects: it informs the teacher about what should be taught; it
informs the researcher about the course of learning; and it is an outcome of
the learner’s target language hypothesis testing (James, 1998).
Vocabulary size is another aspect of vocabulary knowledge that might
be associated with plagiarism. While research focus in this area has
predominantly been on the receptive command, which enables learners to
read and listen with comprehension, not much is known about the
productive vocabulary command, which enables them to speak and write
proficiently, and its relationship with plagiarism. According to Nation
(2006), the size of productive vocabulary required for successful speaking
or writing is much smaller that the receptive vocabulary size required for
successful reading or listening. However, there might be some indications
that in L2 contexts there is little difference between the productive and
receptive vocabulary knowledge (Schmitt, 2001), which suggests that any
measure of receptive vocabulary knowledge could be helpful as a
productive vocabulary knowledge indicator.
Another important parameter in the context of ESL plagiarism might
be the academic vocabulary (Coxhead, 2000) or Academic Word List
(AWL) and the ESL student writer’s ability to use this vocabulary in
writing. Studies by Augustin Llach (2011), Storch and Tapper (2009), and
Deng, Lee, Varaprasad, and Leng (2010) tracked the development of
academic vocabulary in the writing of ESL students over the duration of
an academic English course and found evidence of significant
improvement. However, the impact of this improvement on the amount of
plagiarism has largely remained unexplored. Similarly, Dodigovic, Li,
Chen and Guo (2014) examined a range of academic vocabulary errors
committed by Chinese learners of English. However, they did not conduct
their investigation in the context of textual borrowing or plagiarism.
Therefore, examining lexical insufficiency as a possible cause of
plagiarism emerges as a worthwhile research goal. To this end, the study
reported here focused on Chinese learners of English at an English-
medium University in China and investigated the relationship between the
rate of lexical error and vocabulary command on the one hand and the
amount of plagiarism in the students’ writing on the other.
The Study
The research question that guided this investigation was: To what
extent is plagiarism related to Chinese students’ English vocabulary
command? The participants in this study were 221 Chinese students at an
English Medium Instruction (EMI) University. All of the students were in
their first year, aged between 18 and 20, speakers of Chinese as a first
language and majoring in English. All of the participants had completed
their secondary education in China.
The writing task used for the purpose of this study required expressing
opinions and a critical review of literary sources. Taking into account the
extensive need for quoting and referencing in that particular genre, this
task required the student writers to apply advanced paraphrasing
techniques in order for the writing to maintain its originality. The length of
the writing samples ranged from 800-1,200 words and represented a
typical Anglo-American academic genre commonly found at tertiary
educational level in English-speaking countries (Dodigovic, 2005).
Instruments
Grammarly
Grammarly is a plagiarism and error detection software package. Its

plagiarism detection engine operates using an ‘extrinsic’ type of analysis,
i.e., by identifying similarities between one’s writing and the sources
commonly available on the Internet. Conversely, ‘intrinsic’ analysis
focuses on the stylistic inconsistencies within the text itself, which are
being used to determine the plagiarised content (Sousa-Silva, 2014).
Worth mentioning is the fact that apart from identifying copy-pasted parts
of the text, Grammarly is capable of pinpointing weak paraphrases with
only minor text modifications.
Unlike Turnitin or other plagiarism detection tools, Grammarly has a
feature crucial for this research, viz. the error detection or writing
enhancement engine. According to the developer, Grammarly is capable
of identifying up to 250 error types. Although the accuracy of the
formative feedback offered by the software can be debated, the precision
of the enhancement engine was estimated to be fairly high. For the
purpose of this estimate, 20 Grammarly reports were randomly selected
228 Chapter Eleven
and manually reassessed revealing that approximately only 5% of the

suggestions made by Grammarly turned out to be falsely labelled as
errors. A similar test was performed using the plagiarism detection tool,
with a similar outcome.
Vocabulary Size Test
The Vocabulary Size Test (VST) was used to measure the size of the
participants’ vocabulary (Beglar & Nation, 2007). This test has been
specifically developed to “provide a reliable, accurate, and comprehensive
measure” (Beglar, 2010, p. 103) of NNS English learners’ receptive
vocabulary in its written form, including the 14,000 most frequent word
families in English. This test, available in both electronic and hard-copy
format, was used in its paper-based format and it was corrected manually.
Procedure
Student writing was analysed using the Grammarly web-based engine
from the perspective of plagiarised content and lexical error. The
identified errors were entered into a database following a manual
identification accuracy check. VST was administered in class, two weeks
after the writing samples were collected, in hard-copy format, which was
filled out manually by the participants. It was also manually marked and
moderated by two independent markers using the answer key. All of the
activities were carried out with adherence to the ethical standards called
for in the Belmont Report, Declaration of Helsinki and Nuremberg Code.
Data Analysis
The Pearson product-moment correlation coefficient (r), commonly
used to reveal a possible linear association between two variables and for
calculations on larger samples, in which normal distribution can be
expected (Stoynoff & Chapelle, 2005), was used to calculate the
correlation between the lexis related variables and plagiarism rate. This
correlation coefficient is one of the commonly used measures of effect
size, although many who use it may not be aware that it is an effect size
index (Ellis, 2010). In the discussion of r values below, reference to ‘effect
size’ is made; this serves as a response to the recent calls, from researchers
in China (e.g., Wei, 2012) and abroad (e.g., Ellis, 2010; Larson-Hall,
2012), for paying more attention to effect size, which according to the
Publication Manual of the American Psychological Association (APA) is
as important as the significance level (namely the p value) in inferential

statistical procedures. To describe the strength of the correlation, Cohen’s
(1988, cited in Ellis, 2010) frequently-used guidelines were employed (e.g.,
for r, 0.5, 0.3 and 0.1 respectively representing the cutting points of the so-
called ‘large’, ‘medium’ and ‘small’ effects).
Results
Lexical Errors
Based on the Grammarly output, all errors were divided into three
categories: lexical, grammar and punctuation. For the purpose of this
paper, only lexical errors are of interest. Using Grammarly’s
categorisation, lexical errors were divided into four categories: confused
words, spelling mistakes, wordiness and colloquial speech. To arrive at a
more comprehensive picture, the correlation results for two types of
scenarios are presented here: results with the outliers and without them
(see Table 11-1).
Table 11-1: Correlation coefficient (effect size) values for lexical errors.
Wordiness
Colloquial
Combined
Confused
Mistakes
Spelling
Speech
Words
Cut-off
Point N
2% 157 -0.0827 -0.0912 0.1623 -0.1242 -0.0792

2% no
154 -0.0495 -0.0968 0.0232 -0.1763 -0.1539
outliers
3% 108 -0.0722 -0.1082 0.1804 -0.1753 -0.0872
3% no
105 -0.0289 -0.1029 -0.0187 -0.2556 -0.2023
outliers
5% 66 -0.0694 -0.0418 0.2307 -0.115 0.0103
5% no
63 -0.0018 -0.0085 -0.0382 -0.2304 -0.1682
outliers
According to Table 11-1, the results for the lexical errors with the cut-
off point for plagiarism rate set at 5% revealed that Pearson’s correlation
coefficient values were the lowest for the confused words, spelling
mistakes and wordiness: r = -0.0018 (p < .05), r = -0.0085 (p < .05), r = -
0.0382 (p < .05) respectively. However, the statistical analysis of the use
230 Chapter Eleven
of colloquial speech and its relation to plagiarism revealed the existence of

a weak but statistically not significant negative correlation r = -0.2304 (p
= .069). After adding up all the values for lexical issues and analysing
them with respect to citation audit using correlation analysis, the results
demonstrated the existence of a weak correlation at r = -0.1682 (see Table
11-1), albeit this result was not statistically significant (p = .187).
Establishing the cut-off point for plagiarism rate at 3% did not
significantly affect the correlation coefficient for lexical errors.
Specifically, the value remained relatively stable at r = -0.2023, only
slightly increasing from r = -0.1682 (see Table 11-1). However, these two
effect size values still fell within the small-to-medium category according
to Cohen’s (1988, cited in Ellis, 2010) criteria for strength of association.
As Table 11-1 indicates, the highest value of Pearson’s correlation
coefficient for added up lexical errors was achieved in a group of 105
participants with the lowest (below 3%) and the highest (31%, 39%, 49%)
outstanding values for the citation audit excluded from the analysis. It was
observed that the inclusion of the outliers at 31%, 39% and 49% threshold
always had a positive effect on the correlation coefficient value for the
lexical errors by significantly increasing it. It was also observed that the
exclusion of papers with a high percentage of textual borrowing (31%,
39% and 49%) drastically increased the value of the correlation coefficient
using the sum of all lexical error types.
In summary, based on the data examined in the study, the overall
lexical error rate does not seem to correlate strongly with the plagiarism
rate. However, a weak and statistically not significant relationship was
observed for lexical errors due to wordiness (r = 0.2307, cut-off point 5%,
outliers included) and colloquial speech (3% and 5%, no outliers).
Vocabulary Size
Vocabulary Size Test (VST) results were obtained for 107 out the 221
research participants. Hence, the 107 pairs of VST and plagiarism rate
results were correlated, without excluding any data sets. The value of the
correlation between the plagiarism rate and vocabulary size was high (r = -
0.7791). This is a strong negative correlation, representing a ‘large’ effect
and meaning that high vocabulary scores indicate low plagiarism levels
(and vice versa). Based on this outcome, it is safe to assume that the larger
the vocabulary size of second language writers, the less the chance they
will resort to plagiarism when engaging in academic writing.
Discussion
Unlike the reviewed literature (Shi, 2006; Yu, 2013) which purports
that lexical errors might be a cause of plagiarism in in higher education,
the results of the current study revealed a weak and not significant
correlation between plagiarism and lexical error rate, suggesting that this
might not be the case. While one might argue that the result might be
statistically significant once the sample size is increased, the afore-
reported findings concerning the strength of correlation (effect size)
remain stable across studies (including those with much larger samples)
because effect size measures, unlike the p values, are unaffected by sample
size (Meline & Wang, 2004).
Since the writing that was analysed in this study was academic essays
in which the students were using the lexically correct verbatim text from a
variety of academic sources which was lexically correct, it is also possible
that the results of the study were distorted by the presence of verbatim
borrowings which could have significantly reduced the proportion of the
students’ original writing and in turn might have masked the real error
rate.
Nevertheless, based on the results of this study, it might be safe to
assume that lexical error or the absence of lexical accuracy is not a major
cause of plagiarism among Chinese students at EMI universities in China.
Furthermore, it seems that vocabulary depth, as the construct underlying
lexical accuracy, might not be directly related to plagiarism. On the other
hand, vocabulary size due to its strong negative correlation with the
textual borrowing rate, suggests that a large vocabulary might be
negatively related to the level of plagiarism. In other words, NNS writers
with a large vocabulary might be less likely to plagiarise, regardless of
how well they know the words they are familiar with. This is consistent
with Dodigovic (2013), in which the plagiarism rate was reduced by
focusing on paraphrasing skills. This skill requires both receptive
vocabulary knowledge and a large vocabulary size, both tested using VST.
Similarly, the present study indicates that in the case of a limited
vocabulary, having a good command of the entire depth of the vocabulary
known is less likely to result in plagiarism in free writing.
Conclusion
The objective of this study was to explore the relationship between
Chinese English learners’ lexical errors and vocabulary size on the one
hand and the amount of borrowed content in their written prose on the
232 Chapter Eleven
other. The study was carried out with a group of 221 Chinese students
majoring in English at an EMI university located in China. Learners’
lexical errors and unoriginal content were identified using Grammarly’s
enhancement and plagiarism detection engine, while the vocabulary size
was determined using the VST. The data was then statistically analysed
using Pearson’s correlation coefficient.
The results of the study suggest that vocabulary size might be a factor
requiring more attention in the context of fighting plagiarism in the higher
education sector, but the depth of vocabulary, which has a bearing on the
lexical accuracy, might not. The outcome indicates that pedagogical effort
should be invested in the vocabulary growth of the target learner
population. This can be done either by stimulating deliberate vocabulary
learning through the use of vocabulary cards, games and fun activities or
through extensive reading programs which rely on a combination of
graded readers and authentic texts. Vocabulary size testing as well as other
methods of vocabulary assessment should become a more common
practice (Schmitt, 2001), so that through washback they might positively
impact educational practice.
Acknowledgement
This paper is a part of the output from the Jiangsu Higher Education
Learning and Teaching Reform project #2015JSJG253 entitled
Computational methods of lexical transfer detection in the English writing
of Chinese-English bilinguals, funded by the Jiangsu Department of
Education, China.
References
Amsberry, D. (2009). Deconstructing plagiarism: International students
and textual borrowing practices. The Reference Librarian, 51(1), 31-44.
Augustin Llach, M. P. (2011). Lexical errors and accuracy in foreign
language writing. Bristol: Multilingual Matters.
Bacha, N. N., & Bahous, R. (2010). Student and teacher perceptions of
plagiarism in academic writing. Writing and Pedagogy, 2(2), 251-280.
Ballard, B., & Clanchy, J. (1984). Study abroad: A manual for Asian
students. Kuala Lumpur: Longman
Beglar, D. (2010). A Rasch-based validation of the vocabulary size test.
Language Testing, 27(1), 101-118.
Beglar, D., & Nation, P. (2007). A vocabulary size test. The Language
Teacher, 31(7), 9-13.
Biber, D. (2012). Register as a predictor of linguistic variation. Corpus
Linguistics and Linguistic Theory, 8(1), 9-37.
Chuo, T.-W.I., & Wenzao, U. (2007). The effects of WebQuest writing
instruction program on EFL learners’ writing performance, writing
apprehension and perception. TESL-EJ, 11(3), 1-27.
Coxhead, A. (2000). The academic word list. TESOL Quarterly, 34(2),
213-238.
de Jaeger, K., & Brown, C. (2010). The tangled web: Investigating
academics’ views of plagiarism at the University of Cape Town.
Studies in Higher Education, 35(5), 513-528.
Deng, X., Lee, K. C., Varaprasad, C., & Leng, M. L. (2010). Academic
writing development of ESL/EFL graduate students in NUS.
Reflections on English Language Teaching, 9(2), 119-138.
Dodigovic, M. (2005). Artificial intelligence in second language learning:
Raising error awareness. Clevedon: Multilingual Matters.
—. (2013). The role of anti-plagiarism software in learning to paraphrase
effectively. CALL-EJ, 14(2), 23-37.
Dodigovic, M., Li, H., Chen, Y., & Guo, D. (2014). The use of academic
English vocabulary in the writing of Chinese students. English
Teaching in China, 5, 13-20.
Ellis, P. D. (2010). The essential guide to effect sizes: Statistical power,
meta-analysis, and the interpretation of research results. Cambridge:
Erkaya, O. R. (2009). Plagiarism by Turkish students: Causes and
solutions. Asian EFL Journal, 11(2), 86-103.
Evans, F. B., & Youmans, M. (2000). ESL writers discuss plagiarism: The
social construction of ideologies. Boston University Journal of
Education, 182(3), 49-65.
Folse, K. (2004). Vocabulary myths. Ann Arbor: University of Michigan
Press.
Halliday, M. A. K., & Hasan, R. (1976). Cohesion in English. London:
Longman.
Hyland, K. (2001). Bringing in the reader: Addressee features in academic
articles. Written Communication, 18(4), 549-574.
James, C. (1998). Errors in language learning and use: Exploring error
analysis. London, England: Longman.
Lankamp, R. (2009). ESL student plagiarism: Ignorance of the rules or
authorial identity problem? Journal of Education and Human
Development, 3(1), 1-8.
234 Chapter Eleven
Larson-Hall, J. (2012). Our statistical intuitions may be misleading us:

Why we need robust statistics. Language Teaching, 45(4), 460-474.
Leo, K. (2012). Investigating cohesion and coherence discourse strategies
of Chinese students with varied lengths of residence in Canada. TESL
Canada, 29(6), 157 -179.
Lieber, P. E. (1981). Cohesion in ESL students’ expository writing: A
descriptive study. Doctoral thesis, New York University, USA.
Liu, D. (2005). Plagiarism in ESOL students: Is cultural conditioning truly
the major culprit? RLT Journal, 59(3), 234-241.
Lund, J. R. (2004). Plagiarism: A cultural perspective. Journal of
Religious & Theological Information, 6(3-4), 93-101.
Matalene, C. (1985). Contrastive rhetoric: An American writing teacher in
China. College English, 47(8), 789-808.
Meisuo, Z. (2000). Cohesive features in the expository writing of
undergraduates in two Chinese universities. RELC Journal, 31(1), 61-
95.
Meline, T., & Wang, B. (2004). Effect-size reporting practices in AJSLP
and other ASHA journals, 1999-2003. American Journal of Speech-
Language Pathology, 13(3), 202-207.
Mudraya, O. (2006). Engineering English: A lexical frequency
instructional model. English for Specific Purposes, 25(2), 235-256.
Nation, I. S. P. (2006). Language education-Vocabulary. In K. Brown
(Ed.), Encyclopaedia of language and linguistics (Vol. 6, 2nd ed., pp.
494-499). Oxford: Elsevier.
Nation, I. S. P., & Waring, R. (1997). Vocabulary size, text coverage, and
word lists. In N. Schmitt & M. McCarthy (Eds.), Vocabulary:
Description, acquisition, pedagogy (pp. 6-19). New York: Cambridge
University Press.
Nergis, A. (2013). Exploring the factors that affect reading comprehension
of EAP learners. Journal of English for Academic Purposes, 12(1), 1-9.
Shi, L. (2006). Cultural backgrounds and textual appropriation. Language
Awareness, 15(4), 264-282.
Schmitt, N. (2001). Vocabulary in language teaching. Cambridge:
Sousa-Silva, R. (2014). Investigating academic plagiarism: A forensic
linguistics approach to plagiarism detection. International Journal for
Educational Integrity, 10(1), 31-41.
Storch, N., & Tapper, J. (2009). The impact of an EAP course on
postgraduate writing. Journal of English for Academic Purposes, 8(3),
207-223.
Stoynoff, S., & Chapelle, C. (2005). ESOL tests and testing. Alexandria:
Teachers of English to Speakers of Other Languages.
Ward, J. (2009). EAP reading and lexis for Thai engineering undergraduates.
Journal of English for Academic Purposes, 8(4), 294-301.
Wei, R. (2012). Zaitan waiyu dingliang yanjiu zhong de xiaoying fudu
[Effect size in L2 quantitative research revisited]. Xiandai Waiyu
[Modern Foreign Languages], 35(4), 416-422.
Wessa, P. (2014). Free Statistics Software. Office for Research Development
and Education, version 1.1.23-r7. Retrieved from:
http://www.wessa.net/
Wilkins, D. A. (1972). Linguistics in language teaching. London: Edward
Arnold.
Witte, S. P., & Faigley, L. (1981). Coherence, cohesion, and writing
quality. College Composition and Communication, 32(2), 189-204.
Yang, W. (1989). Cohesive chains and writing quality. Word, 40(1-2),
235-254.
Yu, T. (2013). Relationship between the EAP classroom approach and
plagiarism. Unpublished manuscript, Final Year Project, Xi’an
Jiaotong-Liverpool University, Jiangsu, China.
CHAPTER TWELVE
WHAT IS THE IMPACT OF LANGUAGE

LEARNING STRATEGIES ON TERTIARY
STUDENTS’ ACADEMIC WRITING SKILLS?
A CASE STUDY IN FIJI
ZAKIA ALI CHAND
Abstract
This study has assessed the language learning strategies used by a
group of undergraduate students at a tertiary institute in Fiji to find out if
there are any correlations with their academic writing proficiency. Data for
language learning strategy use were collected through a standard
questionnaire, using Oxford’s (1990) Strategy Inventory for Language
Learning (SILL). In-depth interviews were also conducted to further
explore the students’ language learning strategies (LLS) in early
childhood. An error analysis of students’ written texts was undertaken to
determine proficiency in academic language. The Statistical Package for
the Social Sciences (SPSS) was used for quantitative data analysis. The
results of this study showed that students used language learning strategies
with a medium frequency. Metacognitive strategies were used most
frequently followed by cognitive ones while affective and memory
strategies were used the least frequently. There was no significant
difference in the number and type of errors made in students’ written texts
both before and after writing strategies were taught. In the final analysis,
using Pearson’s correlation coefficient, there was no significant correlation
found between strategy use and the academic language proficiency of the
participants. Both successful and unsuccessful English language learners
used the same strategies with almost the same frequency. This study
concludes that proficiency in the academic writing of Fiji students is not
influenced by their use of language learning strategies.
What is the Impact of Language Learning Strategies 237
Introduction
In the 21st century, English has become the dominant global language
and it can be established that today English is used as a medium of
communication by more non-native than native speakers (Crystal, 1997;
Graddol, 1997). Globalization, the current advances in technology and
social media all have fueled a demand for English. In Fiji and the Pacific,
as elsewhere in the world, English has taken on an increasingly important
role and the individual reasons for this vary widely: from personal growth
and enhancement to higher education and better employment opportunities.
This is evident by the increasing number of enrolments in primary,
secondary and tertiary institutes where English has become a mandatory
subject. People with good communication skills and qualifications in
English are sought after in most work places and educational institutes. In
Fiji, as well as in most Pacific Islands, people use English as a lingua
franca (ELF) to communicate amongst themselves. For the majority of the
urban dwellers, English is the language of business, education,
entertainment, politics, and everyday living. However, in the rural areas,
the use of English as the language of daily living is not as high.
In Fiji, students go through learning English as a compulsory subject
for thirteen years of their primary and secondary school life. However,
when they enroll in tertiary institutions, it becomes apparent that
proficiency in their academic writing skills has not developed much over
the years. Educators in Fiji tertiary institutes find that in spite of eight
years of primary and four to five years of secondary school education with
English as the medium of instruction (EMI), students who enroll in the
local universities have weak academic writing skills. Though no
comprehensive research data is currently available on the exact areas of
weaknesses in academic writing of Fiji students, the following errors are
most commonly found in their written texts: tense, subject-verb
agreement, weak sentence structures, mechanics (in particular punctuation
and spelling), usage of articles, vocabulary, connectives, participles, word
forms, word choice, and direct and reported speech. Apart from
weaknesses in grammar and punctuation, students lack appropriate skills
and knowledge of the structure of formal letters, essays and reports. This
study investigates to what extent the use of language learning strategies
can enhance the academic writing proficiency of Fijian undergraduate
students who may not be aware of language learning strategies and
therefore do not use appropriate strategies to enhance their language
learning.
238 Chapter Twelve
Background
Language Learning Strategies
Researchers in second language learning and acquisition have long
recognized the role of the learner in the learning process and this subject is
the object of enquiry in much research. According to Ehrman and Oxford
(1995), the role of the learner is complex and determined by certain
variables which might correlate with successful language learning. Since
Rubin’s (1975) article on the good language learner, there has been much
interest and discussion on what makes some people successful at learning
a second or foreign language (Ellis, 1994; Grenfell & Harris, 1999;
Naimen, Frolich, Stern, & Todesco, 1978; Nakatani, 2006; O’Malley &
Chamot, 1990).
Much research has been done over the last three decades on the
characteristics and traits of successful language learners which can be
taught to less successful learners in ways that would benefit them. It now
appears that there is a multitude of factors that can affect language
learning and these include: personality type, learning style, aptitude,
motivation, and, the focus of this research, language learning strategies
(Ehrman & Oxford, 1995; Rubin, 1975).
The field of language learning strategies, which can be defined as the
methods learners use to aid their learning of a second or foreign language,
is complex. Focused research on this subject began in the 1970s (Naimen
et al., 1978; Rubin, 1975) with identifying and classifying good language
learning strategies. However, there is still much discussion going on about
the classification of these strategies and their relevance to language
learning and acquisition (Hsiao & Oxford, 2002). What has become clear
is that there can be effective methods or techniques that learners can use to
learn a second or foreign language successfully (Lessard-Clouston, 1997;
Oxford, 1990). But the challenge still remains for many teachers and
researchers on how to isolate the language strategies and teach them to
learners in a way that can improve their ability to use the second or foreign
language and put them on a path to a more self-directed and independent
learning (Chamot, 2005).
According to Scollon and Scollon (2004), language is “a multiple,
complex and kaleidoscopic phenomenon” (p. 272). When one thinks about
the intricacies of a language, its design, structure, grammatical systems,
phonology, and how it is used according to audience, purpose and context,
the challenges of learning a second language become overwhelming. At
this point, it is important to distinguish between English as a second
language (ESL) and English as a foreign language (EFL) as there are
important ramifications in terms of teaching, learning and researching in

the two contexts. ESL is acquired and learnt in an environment, such as
Fiji, where English is the main language of wider communication and the
learners are immersed in it (Ellis, 1994). On the other hand, EFL is a term
used when students learn a foreign language in an environment where
“they do not have ready-made contexts for communication beyond their
classroom” (Brown, 2007, p. 134). Students learning English in countries
like China, Japan and Iran are classified as EFL learners.
Oxford (1990) came up with six groups of learning strategies, namely:
memory, cognitive, compensation, metacognitive, affective, and social
strategies. The first three are direct strategies and the latter three are
indirect. Direct strategies are used in formal and informal settings where
learners use the target language in order to improve their language skills,
while indirect strategies help learners to manage and support their learning
without involving the target language directly (Wu, 2008). These six
strategies are measured using a questionnaire designed by Oxford (1990),
namely the Strategy Inventory for Language Learning (SILL). The SILL
employs a 5-point Likert scale to rate an individual’s use of the different
strategies. These strategies are interrelated and interact with one another
(Rahimi, Riazi, & Saif, 2008). According to Ellis (1994), Oxford’s
classification of language learning strategies was the most comprehensive
at the time. Twenty years may have lapsed since Ellis made that comment,
however, even more recent studies such as Hsiao and Oxford (2002) and
Chamot (2004) have testified on the superiority of the SILL as an
instrument for measuring language strategy use.
Although strategy use is often an unobservable event as it relates more
to the mental processes of a learner, the vast literature on factors
influencing it points to gender, ethnicity, motivation, proficiency, learning
styles and learners’ level of language proficiency. Of these, motivation,
proficiency, learning style and gender seem to have a strong correlation
with strategy use (Rahimi et al., 2008). Learning style differs from
learning strategies as it deals with the personality of the learners and their
naturalistic and habitual ways of acquiring a second language, while
learning strategies are conscious steps used by learners to develop their
linguistic competence (Lessard-Clouston, 1997).
Research done on strategy use over the last twenty years suggests that
language performance is related to language learning strategy (Boyce,
2009; Dreyer & Oxford, 1996; Roehr, 2004; Tsan, 2008) and that
strategies can be taught. In a study by Wu (2008) at the National Chin-Yi
University of Technology in Taiwan, it was found that students who had a
higher proficiency in English used these strategies more frequently than
240 Chapter Twelve
those with lower proficiency. Peacock and Ho (2003) also found a positive
correlation between twenty-seven strategies and learner proficiency. The
most frequent strategies used were compensation followed by cognitive,
metacognitive, social, memory and affective strategies. Higher proficiency
learners used cognitive and metacognitive strategies more frequently than
those with lower proficiency. Similar results on the correlation between
high levels of proficiency and an increased use of both direct and indirect
strategies were found in earlier research by Green and Oxford (1995).
Early research, from the 1970’s and 1980’s, found that successful
language learners had “a strong desire to communicate, were willing to
guess when unsure, and were not afraid of being wrong or appearing
foolish” (Rubin, 1975, p. 43). However, these learners were mindful of
correctness, form and meaning and monitored their own language as well
as that of those surrounding them. These strategies were not employed
universally by all successful language learners. It depended on the
learners’ target language proficiency, age, situation and cultural differences.
Fillmore (1982) reported similar findings in her research on individual
differences. She found that successful learners also used social strategies
as they “spent more time...socializing” (p. 285). By and large, research has
shown that a number of variables, such as gender, ethnicity, proficiency
level, socio-economic background, and level of motivation affect the type
and frequency of strategy use by second/foreign language learners
(Ehrman & Oxford, 1990; O’Malley, Chamot, Stewner-Manzanares,
Russo, & Kupper, 1985; Oxford & Nyikos, 1989).
Academic Language Proficiency and Error Analysis

Academic language is an essential and important component of tertiary
education (Scarcella, 2008). It is associated with academic success and
being proficient in the use of academic language motivates and empowers
students and gives them credibility in their chosen professions. At the
tertiary level, academic writing is probably the most important aspect of
teaching and learning. Good reading skills are also synonymous with good
writing skills. In fact, reading is one of the most important tasks required
for academic success. According to Grabe (1991), “literacy in academic
settings exists within the context of a massive amount of print
information” (p. 389). Research has found a strong correlation between
reading and proficiency in academic language (Bharuthram, 2012;
Lukhele, 2009; Oberholzer, 2005; Pretorius, 2002).
Much of the research on correlations between strategy use and
academic proficiency employed final exam scores, language proficiency
test results, and written and spoken tasks done in the classroom (Bremner,
1999; Ketabi & Mohammadi, 2012; Tam, 2013). There is little literature
on error analysis and its correlation with academic language proficiency.
However, researchers have mentioned the importance of error analysis and
its links to academic English proficiency (Michaelides, 1990; Richards,
Plott, & Platt, 1996).
Research done by Cohen (1998), Ehrman and Oxford (1989) and
Oxford (1990, 1993) showed that more frequent use of language learning
strategies is often related to higher levels of academic language
proficiency. However, according to Green and Oxford (1995), the picture
is not crystal clear as “it shows prominent features of the landscape but
only gives hints as to what the trees and buildings in the picture would
look like up close” (p. 261). In their study of university students studying
at different course levels in Puerto Rico, Green and Oxford (1995) found
that there was a positive correlation between strategy use and academic
proficiency. Bremner (1999), working with students from the City
University of Hong Kong, investigated strategy use and its correlation
with language proficiency. The results showed that the participants were
medium users of the learning strategies. The most frequently used strategy
was compensation, followed by metacognitive, cognitive, social, memory,
and affective strategies. The correlations between proficiency and strategy
use showed positive relations with cognitive and compensation strategies,
while there was a negative correlation with affective strategies. Goh and
Kwah (1997) had similar results in their study of Singaporean learners,
while Green and Oxford (1995) found that, in addition to these two
strategies, metacognitive and social strategies also showed positive
variation. As for the negative correlation between proficiency and
affective strategies, it could be that as learners become more proficient in
their language skills, they have less use of such strategies because their
confidence, knowledge and motivation have all increased.
The Study
The aim of this research study was to identify the second language
learning strategies used by tertiary students from the Republic of Fiji, and
investigate the impact these strategies have on their academic writing
skills. The research subjects were 95 first year undergraduate students and
10 final year students in a Bachelor of Arts in Literature/Language
program. Even though it was planned to have a balance of gender and
ethnicity in the sample, on the day of the data collection the females (67%)
outnumbered the males (33%), and the Fijian Indian students (64%)
242 Chapter Twelve
outnumbered the indigenous Fijians or I-Taukei students (32%). The

remaining 4% were made up of students from non-Fijian Indian and non-I-
Taukei backgrounds.
Three methods were used for data collection: firstly, the SILL, version
7.0, designed by Oxford (1990), was distributed to the participants during
class time. Secondly, interviews were conducted with students on their
language learning experience from childhood. Thirdly, data for
proficiency in academic language were collected from three sources: a
diagnostic test given at the beginning of the semester before any writing
strategies were taught, assignments, and final exam answer scripts.
The SILL was distributed to 105 students. Of these, 95 students were
first year undergraduate students enrolled in the English for Academic
Purposes course, and 10 were final year undergraduates majoring in
linguistics and literature. Data were analyzed using the Statistical Package
for the Social Sciences (SPSS), version 19. The qualitative data were
analyzed using coding techniques. Assignments were marked using a
software package called Markin 4, and final examination scripts were
marked manually using the same rubrics.

Use of Language Learning Strategies
The study reported an average frequency of 2.76 (out of 5) hence

students in this study were medium users of language learning strategies.
Metacognitive strategies were the most frequently used strategies,
followed by cognitive, social, compensation, memory, and affective
strategies. Similar results have been found in other studies (Boyce, 2009;
Deneme, 2008; Green & Oxford, 1995; Griffiths, 2004; Hong, 2006; Tsan,
2008).
Four strategies were found to have the highest use among the research
subjects. Three were cognitive or direct strategies and one was a
metacognitive or an indirect strategy:
Cognitive Strategies:
• Item 6: I watch English language TV shows or go to English
movies (Mean = 3.6).
• Item 8: I write notes, messages, letters, or reports in English (Mean
= 3.6).
• Item 9: I first skim an English passage (read over the passage
quickly) then go back and read carefully (Mean = 3.6).
Metacognitive Strategy:
• Item 2: I notice my English mistakes and use that information to
help me do better (Mean = 3.5).
Ten strategies were found to be ‘unpopular’ with means below 2.5.

There were five direct and five indirect strategies in this category.
Interestingly, none of the metacognitive strategies are in this category
while only one strategy from the cognitive group has made it to this list.
The rest of the ‘unpopular’ strategies belonged in the memory and
affective categories. Overall, memory and affective strategies recorded
significantly lower usage than the other four categories, and the majority
of these strategies were unpopular with the study participants. The
significantly low usage of memory strategies is surprising as quite often
teachers are under the notion that Fijian students rely heavily on rote
learning, especially since there is great importance given to success in the
national examinations. These results contradict such stereotypes.
Even though metacognitive strategies had the highest overall usage
with a mean of 3, the results for cognitive strategies showed items 6, 8 and
9 had the highest usage with a mean of 3.6. This is an important result. It
shows that Fijian students are high users of certain cognitive strategies,
which involve watching English TV programs and movies as a means to
learn English. Writing notes, letters and reports in English and skim
reading are also used with a high frequency. It is also interesting to note
that the overall means for cognitive strategies (Mean = 2.9) and
metacognitive strategies (Mean = 3) were very close.
Results also showed that social strategies were equally popular (Mean
= 2.8). Social strategies are used when interacting with other learners and
opportunities to improve and practice the language are available. Items 1
and 5 achieved a mean of 3.2. These results show that students socialize
using the English language in their daily lives and use opportunities to
learn from each other; however, they are reluctant to be corrected by their
peers or do not correct each other. One of the reasons for this may have to
do with cultural protocols and rules of politeness in their culture. In Fiji,
among both the major ethnic groups, indigenous Fijians and Fijians of
Indian descent, certain customary rules prevent people from correcting
each other’s mistakes in spoken discourse. Nevertheless, social strategies
had the third highest usage among the six strategy categories, only
preceded by cognitive and metacognitive strategies.
244 Chapter Twelve
Relationship between Strategy Use and Gender/Ethnicity

Results showed that there is no relationship between gender, ethnicity
and strategy use. In Table 12-1 it can be seen that the correlation
coefficient (r) between memory strategy use and gender had a value of
0.169 while p > .05. The coefficient was not statistically significant and
the weak correlation shows a negligible relationship between gender and
memory strategy use. Likewise, the correlation coefficients between
gender and the rest of the strategies were close to 0 and declining to a
negative figure in metacognitive, affective and social strategies, with p >
.05. Ethnicity showed overall a negative correlation with all the strategy
categories. Both gender and ethnicity did not show any significant
correlation with strategy use.
Table 12-1: Correlations between strategy use, gender and ethnicity.
Strategies
Communication
Metacognitive
Cognitive
Ethnicity
Affective
Memory
Gender
Social
r 1 .075 .169 .187 .050 .137 -.020 -.085
Gender
Sig. .448 .084 .056 .616 .165 .839 .386
r .075 1 -.054 -.187 -.165 -.032 -.099 -.045

Ethnicity
Sig. .448 .582 .055 .092 .746 .317 .649
Note: N = 105; r = Pearson correlation coefficient; Sig. = Significance value.
Interview Analysis
The interview data confirmed the results from the analysis of the SILL
questionnaire. Most of the participants were not high users of language
learning strategies. Social strategies were seldom used by the participants
to learn English, both within the family and with relatives and friends.
Often social activities were conducted using the participants’ first
language (L1). Hence, social, memory and affective strategies once again
are at the bottom of strategy use.
Error Analysis of Academic Writing

Essay samples were collected from a diagnostic test, a course
assignment and the final examination. Essays were examined for
adherence to grammatical rules of Standard English as well as structure
and formatting. Every error was identified and labeled using the rubrics
from Markin 4. Table 12-2 below provides a breakdown of all the errors
from highest to lowest that occurred in all the texts that were analyzed.
When all the texts were combined, a total of 10,313 errors were found in
the diagnostic tests, assignments and final exam scripts of the research
subjects.
Table 12-2: Results of the error analysis.
Error % Error %
Punctuation 14.3 Article 2.7
Word Choice 10.9 Verb Form 2.6
Cut (Unnecessary text) 7.9 Conjunction 2.3
Repetition 7.7 Vague Reference 1.9
Agreement (subject/verb) 7.1 Sentence Fragment 1.9
Plural (singular/plural) 6.3 Capitalization 1.3
Preposition 6.1 Word Order 0.7
Incomprehensible Text 5.3 Inaccurate Quotation 0.7
Missing Word/s 4.8 Parallel Construction 0.3
Word Form 4.1 Missing Space 0.3
Modifier (misplaced) 3.9 Count/Non-Count 0.2
Spelling 3.3 Paragraphing 0.2
Verb Tense 3.1 Formatting 0.01
The analysis shows that the greatest number of errors occurred in

punctuation followed by word choice, unnecessary text (cut), repetition (of
information or text), agreement, singular/plural, preposition, and
incomprehensible text. More than half of all the errors were made in the
first six categories. The most common errors made were in punctuation,
word choice, relevancy of information, cohesion and coherence, subject-
verb agreement, use of singular and plural, and use of prepositions.
246 Chapter Twelve
Academic Language Proficiency and Strategy Use
It was hypothesized in this study that the higher the use of language
strategies, the fewer the errors in students’ academic writing. The Pearson
correlation coefficient was used for this analysis because the data were
parametric. In parametric correlations, the correlation coefficient (r) shows
the strength of the relationship between two variables. Table 12-3 below
shows the results of the Pearson correlation between language learning
strategies and overall errors from all the written texts analyzed. According
to Cohen and Cohen (1983), a correlation coefficient of 0.22 indicates a
small positive linear correlation. The significance was 0.04, which is p <
.05. Therefore, the results are statistically significant. There was a small
positive linear correlation between errors and learning strategies. As more
writing strategies were used by students over the semester, the number of
errors in their written work also increased. Therefore, the results show that
the language learning strategies used by the students did not have any
significant impact on their academic language proficiency.
Table 12-3: Correlation between strategy use and language

proficiency.
Errors Total LLS Total

Errors Total Pearson Correlation 1 .220*
Sig. (2-tailed) .040
LLS Total Pearson Correlation .220* 1
Sig. (2-tailed) .040
Note: N = 88; LLS = Language Learning Strategies; * = Significant at the .05
level.
Conclusion
The research study reported in this chapter found that university
students in Fiji used language learning strategies at a medium level. The
most frequent strategies used were metacognitive followed by cognitive
and social strategies. Affective strategies were used the least.
Undergraduate students in Fiji are aware of the strategies they use to learn
English and are taking control of their learning, albeit at a medium, level.
The study also found that ethnicity did not have a significant influence on
strategy use. Students’ ethnic background, i.e., whether they were
indigenous Fijian or Fijians of Indian origin, was not significantly
correlated with strategy use. Both major ethnic groups displayed
similarities in their use of language learning strategies. Overall, gender

and ethnicity had no significant correlation with strategy use.
The study also found that the strategies students used were not
significantly correlated with their academic language. This is contrary to
previous findings as many studies, including Green and Oxford (1995),
Bremner (1999) and Al-Hebaishi (2012), have found that cognitive and
metacognitive strategies had a positive impact on academic language
proficiency while affective strategies had an inverse relationship. This
study has shown that no single strategy had a direct impact on the
participants’ proficiency in academic writing.
The research study also found that a high percentage of Fiji students at
the tertiary level do not have the required proficiency for academic
writing. Students make a large number of errors in grammar, punctuation,
sense and style. In the students’ essays, there were several types of
recurrent errors in word choice and form, wrong use of prepositions,
confusion with verb tenses and forms, and subject-verb agreement. The
most common types of errors occurred in punctuation, followed by
vocabulary, use of singular and plural, redundancies, incomprehensible
text, and subject-verb agreement. The analysis has shown students’
difficulties in understanding correct grammatical structures resulting in
weak expressions, poor choice of vocabulary, repetition and redundancies.
Both successful and unsuccessful learners used the language strategies
with equal frequency. No evidence was found that more successful
students used more language learning strategies than the less successful
ones. The results showed that the language strategies were used at a
medium level, and the number of errors in academic writing increased in
spite of strategy use. Memory strategies were one of the least frequently
used strategies. Furthermore, the skills of application, analysis, synthesis
and evaluation were not mastered by university undergraduates who
participated in the study; hence, the strategies which involve more
analytical activities were not used frequently by students.
Many factors influence the success of language learners. However, a
focus on cognitive and language skills is required for academic success.
Research indicates that successful language learners are aware of the
strategies they use and why they use them (Green and Oxford, 1995;
O’Malley & Chamot, 1990), and they generally tailor their learning
strategies to the language task and to their own personal needs as learners
(Wenden, 1986). Ellis and Sinclair (1989) also suggested that learners can
achieve their goals by focusing their attention on the process so that they
can become more effective learners and take on more responsibility for
their own learning.
248 Chapter Twelve
This study has shown that second language learners in Fiji are not quite
aware of their learning strategies. There is a need for further research into
the language learning strategies of Fijian students with a larger sample size
and from institutions at all levels: primary, secondary and tertiary. Other
factors must be explored to determine what is impacting academic
language proficiency (or lack of it) among undergraduate students of Fiji.
There is a need to consciously teach language learning strategies in the
teacher training courses so teachers can then integrate strategy use and
training in their lessons. All teachers, irrespective of the subjects they
teach, should be able to identify strategies by name, describe them and
model them. Strategy training should be integrated within the curriculum
rather than taught as a separate entity and it should start with beginner
students even if this means providing the training in the students’ first
language. Students need to have experience with a variety of strategies to
be able to choose the one that works well with them. In case of failure in
language learning, students need to be assured that their failure may not be
due to lack of intelligence, but to the inability to choose appropriate
strategies.
References
Al-Hebaishi, S. M. (2012). Investigating the relationships between
learning styles, strategies and the academic performance of Saudi
English majors. International Interdisciplinary Journal of Education,
1(8), 510-520.
Bharuthram, S. (2012). Making a case for the teaching of reading across
the curriculum in higher education. South African Journal of
Education, 32, 205-214. Retrieved from:
http://www.ajol.info/index.php/saje/article/viewFile/76602/67051
Boyce, A. (2009). The effectiveness of increasing language learning
strategy awareness for students studying English as a second
language. Master’s dissertation, Auckland University of Technology,
New Zealand.
Bremner, S. (1999). Language learning strategies and language
proficiency: Investigating the relationship in Hong Kong. Asia Pacific
Journal of Language in Education, 1(2), 490-514. Retrieved from:
http://utpjournals.metapress.com/content/d27w088833436k7x/
Brown, H. D. (2007). Principles of language learning and teaching (5th
Edition). White Plains, NY: Person Education.
Chamot, A. U. (2004). Issues in language learning strategy research and

teaching. Electronic Journal of Foreign Language Teaching, 1(1), 14-
26. Retrieved from: http://e-flt.nus.edu.sg
—. (2005). Language learning strategy instruction: Current issues and
research. Annual Review of Applied Linguistics, 25, 112-130.
Cohen, A. D. (1998). Strategies in learning and using a second language.
Essex, UK: Longman.
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation
analysis for the behavioral sciences (2nd edition). Hillsdale, NJ:
Erlbaum.
Crystal, D. (1997). The Cambridge encyclopedia of language (2nd edition).
Cambridge: Cambridge University Press.
Deneme, S. (2008). Language learning preferences of Turkish students.
Journal of Language and Linguistic Studies, 4(2), 83-93.
Dreyer, C., & Oxford, R. (1996). Learning strategies and other predictors
of ESL proficiency among Afrikaans speakers in South Africa. In R.
Oxford (Ed.), Language learning strategies around the world: Cross-
cultural perspectives (pp. 61-74). Honolulu: University of Hawai‘i at
Manoa.
Ehrman, M., & Oxford, R. (1989). Effects of sex differences, career
choice, and psychological type on adult language learning strategies.
The Modern Language Journal, 73(1), 1-13.
Ehrman, M., & Oxford, R. (1990). Adult language learning styles and
strategies in an intensive training setting. The Modern Language
Journal, 74(3), 311-327.
Ehrman, M., & Oxford, R. (1995). Cognition plus: Correlates of language
learning success. The Modern Language Journal, 79(1), 67-89.
Ellis, R. (1994). The study of second language acquisition. Oxford:
Oxford University Press.
Ellis, G., & Sinclair, B. (1989). Learning to learn English. Cambridge:
Fillmore, L. W. (1982). Instructional language as linguistic input: Second
language learning in classrooms. In L. C. Wilkinson (Ed.),
Communication in the classroom (pp. 283-296). New York: Academic
Press.
Goh, C. C. M., & Kwah, P. F. (1997). Chinese ESL students’ learning
strategies: A look at frequency, proficiency and gender. Hong Kong
Journal of Applied Linguistics, 2(1), 39-54.
Grabe, W. (1991). Current developments in second language reading
research. TESOL Quarterly, 25(3), 375-406.
250 Chapter Twelve
Graddol, D. (1997). The future of English? Retrieved from:

http://www.britishcouncil.org/learning-elt-future.pdf
Green, J. M., & Oxford, R. (1995). A closer look at learning strategies, L2
proficiency, and gender. TESOL Quarterly, 29(2), 261-297.
Grenfell, M., & Harris, V. (1999). Modern languages and learning
strategies: In theory and practice. London: Routledge.
Griffiths, C. (2004). Language learning strategy use and proficiency: The
relationship between patterns of reported language learning strategy
(LLS) use by speakers of other languages (SOL) and proficiency with
implications for the teaching/learning situation. Doctoral thesis,
University of Auckland, New Zealand. Retrieved from:
https://researchspace.auckland.ac.nz/bitstream/handle/2292/9/02whole.
pdf?sequence=6
Hong, K. (2006). Beliefs about language learning and language learning
strategy use in EFL context: A comparison study of monolingual
Korean and bi-lingual Korean-Chinese university students. Doctoral
thesis, University of North Texas, USA.
Hsiao, T., & Oxford, R. (2002). Comparing theories of language learning
strategies: A confirmatory factor analysis. Modern Language Journal,
86(3), 368-383.
Ketabi, S., & Mohammadi, A. M. (2012). Can learning strategies predict
language proficiency? A case in Iranian EFL context. International
Journal of Linguistics, 4(4), 407-418. Retrieved from:
http://www.macrothink.org/journal/index.php/ijl/article/view/2914
Lessard-Clouston, M. (1997). Language learning strategies: An overview
for L2 teachers. The Internet TESL Journal, 3(12). Retrieved from:
http://iteslj.org/Articles/Lessard-Clouston-Strategy.html
Lukhele, B. S. B. (2009). Exploring relationships between reading
attitudes, reading ability and academic performance among teacher
trainees in Swaziland. Doctoral thesis, University of South Africa,
South Africa. Retrieved from:
http://uir.unisa.ac.za/bitstream/handle/10500/3435/dissertation_lukhele
_bs.pdf
Michaelides, N. N. (1990). Error analysis: An aid to teaching. English
Teaching Forum, 28(4), 28-30.
Naiman, N., Frohlich, M., Stern, H., & Todesco, A. (1978). The good
language learner. Research in Education Series No. 7. Toronto: The
Ontario Institute for Studies in Education.
Nakatani, Y. (2006). Developing an oral communication strategy
inventory. Modern Language Journal, 90(2), 151-168.
Oberholzer, B. (2005). The relationship between reading difficulties and

academic performance among a group of foundation phase learners
who have been: (1) identified as experiencing difficulty with reading
and (2) referred for reading remediation. Master’s dissertation,
University of Zululand, South Africa. Retrieved from:
http://uzspace.uzulu.ac.za/bitstream/handle/10530/238/
O’Malley, J. M., Chamot, A. U., Stewner-Manzanares, G., Russo, R. P., &
Kupper, L. (1985). Learning strategy applications with students of
English as a second language. TESOL Quarterly, 19(3), 557-584.
O’Malley, J. M., & Chamot, A. U. (1990). Learning strategies in second
language acquisition. Cambridge: Cambridge University Press.
should know. Boston: Heinle and Heinle Publishers.
—. (1993). Instructional implications of gender differences in
second/foreign language learning styles and strategies. Applied
Language Learning, 4(1-2), 65-94.
Oxford, R., & Nyikos, M. (1989). Variables affecting choice of language
learning strategies by university students. The Modern Language
Journal, 73(3), 291-300.
Peacock, M., & Ho, B. (2003). Student language learning strategies across
eight disciplines. International Journal of Applied Linguistics, 13(2),
179-200.
Pretorius, E. J. (2002). Reading ability and academic performance in
South Africa: Are we fiddling while Rome is burning? Language
Matters: Studies in the Languages of Africa, 33(1), 169-196.
Rahimi, M., Riazi, A., & Saif, S. (2008). An investigation into the factors
affecting the use of language learning strategies by Persian EFL
learners. Canadian Journal of Applied Linguistics, 11(2), 31-60
Retrieved from:
http://www.aclacaal.org/Revue/vol-11-no2-art-rahimi-riazi-saif.pdf
Richards, J. C., Plott, J., & Platt H. (1996). Dictionary of language
teaching and applied linguistics. London: Longman.
Roehr, K. (2004). Exploring the role of explicit language in adult second
language learning: Language proficiency, pedagogical grammar and
language learning strategies. Working paper no. 59, Centre for
Research in Language Education, Lancaster University.
Rubin, J. (1975). What the “good language learner” can teach us. TESOL
Quarterly, 9(1), 41-51.
Scarcella, R. (2008). Defining academic language. Paper presented at the
National Clearinghouse for English Language Acquisition Web
252 Chapter Twelve
Conference, August 21st, University of California, Irvine, USA.

Retrieved from:
http://www.ncela.us/files/webinars/1/scarcella_8-21-08.pdf
Scollon, R., & Scollon, S. W. (2004). Nexus analysis: Discourse and the
emerging Internet. London: Routledge.
Tam, K. C. (2013). A study on language learning strategies (LLSs) of
university students in Hong Kong. Taiwan Journal of Linguistics,
11(2), 1-42.
Tsan, S.-C. (2008). Analysis of English learning strategies of Taiwanese
students at National Taiwan Normal University. Educational Journal
of Thailand, 2(1), 84-94. Retrieved from:
http://www.edu.buu.ac.th/journal/journalinter/Second%20EJT_08/9suz
an.pdf
Wenden, A. L. (1986). Incorporating learner training in the classroom.
System, 14(3), 315-325.
Wu, Y.-L. (2008). Language learning strategies used by students at
different proficiency levels. Asian EFL Journal, 10(4), 75-95.
CHAPTER THIRTEEN
SPEAKING PRACTICE IN PRIVATE CLASSES

FOR THE TOEFL IBT TEST:
STUDENT PERCEPTIONS
RENATA MENDES SIMÕES
Abstract
This chapter presents a research study conducted at an English for
Specific Purposes (ESP) one-to-one course focusing on speaking skills, in
order to find out if the course met the students’ learning needs and
prepared them to take the Test of English as a Foreign Language–Internet-
based Test (TOEFL iBT). The study was grounded on the theoretical
principles of teaching ESP, needs analysis, task-based teaching, and
language assessment. The instruments for the data collection were: initial
and final questionnaires; an audio recording of two speaking tasks on the
first and last day of class; and the teacher-researcher’s diaries at the end of
every class containing the students’ perceptions of their performance in
class. The results revealed the students’ satisfaction regarding the course
methodology and material, as well as the students’ perception of
improvement in their speaking and writing skills. The students’ narratives
also indicated the importance of teacher-student interaction and praised the
attention given by the teacher to their emotional aspects. The study
contributes to the field of ESP and language assessment, and fills the
research gap that exists in the teaching of speaking skills in private classes.
Introduction
The increasing number of students seeking to study graduate courses in
English-speaking countries has led to an unprecedented demand for the
Test of English as a Foreign Language–Internet-based Test (TOEFL iBT).
Many universities in English-speaking countries require international
254 Chapter Thirteen
candidates to show proof of language proficiency through a minimum

TOEFL iBT score.
The current literature on teaching and learning in private lessons in
preparation for standardized language proficiency tests is greatly lacking.
Although there is a wide range of books, academic papers, and courses in
language schools aimed at preparing students for the TOEFL iBT, there is
little research on the preparation of students in private lessons.
According to Dudley-Evans and St. John (1998), private classes can be
considered the purest form of teaching English for Specific Purposes
(ESP), because the need of each student determines their learning.
Moreover, Hutchinson and Waters (1987) suggest that ESP should be
considered an approach for language teaching, and they emphasize that the
learning should be based on the students’ needs. In other words, all the
decisions as to the content and methods of teaching in an ESP course
should be based on the students’ needs.
This chapter presents the findings of a research study carried out in Sao
Paulo, Brazil, with 17 students attending private preparatory classes for the
TOEFL iBT proficiency test. The course was developed by the researcher
and it was adapted and modified for each student.
Background
English for Specific Purposes (ESP)
According to Hutchinson and Waters (1987), ESP courses aim at

helping students perform adequately in the target-situation, i.e., the
situation in which they are going to use the language being learned. The
main characteristics of this teaching approach are: the student’s awareness
of why he/she is learning the language and satisfying the student’s needs
for the language use in the new context. Similarly, Dudley-Evans and St.
John (1998) consider ESP an approach. They claim that ESP should reflect
teaching with a different methodology from those used in general English
teaching. The greatest concern regarding ESP teaching should be the needs
analysis and the preparation of students to communicate efficiently in
tasks related to their studies or work (Dudley-Evans & St. John, 1998).
This view is also corroborated by Basturkmen (2010) who states that ESP
teaching involves the discussion of the students’ needs and the role these
needs will play in their working and studying environments. Basturkmen
(2010) also emphasizes that ESP courses aim at teaching the language and
communication abilities that specific groups of students will need to make
themselves understood in working, studying and social contexts.
Speaking Practice in Private Classes for the TOEFL iBT Test 255
Therefore, ESP courses need to focus on teaching language and

communicative abilities.
Needs Analysis
According to Hutchinson and Waters (1987), a needs analysis should
gauge the students’ learning needs and not the teachers’ teaching needs.
For them, the difference between an ESP course and a general English
course is not so much the nature of the need, but the awareness of such
need. This is one of the most important aspects for ESP course design,
which should be divided between the target-situation needs (what the
student must do in the target-situation), that can be further subdivided into
necessities, lacks and wants, and the student learning needs (what the
student should do to learn).
Dudley-Evans and St. John (1998) also consider needs analysis
extremely important for ESP courses, as it allows for a much more focused
course. They claim that needs analysis is the process to determine “what to
do” and “how to do” a course (Dudley-Evans & St. John, 1998, p. 121).
The data collection for the needs analysis can be carried out through
questionnaires, interviews, surveys, assessments and discussions. Long
(2005, p. 19) states that “in changing times, educators are increasingly
relying on their needs analysis results in order to develop new courses.”
But he also warns that the respondents are usually the very same students
who are not always aware of what they will need in the target language
(L2). One example is the international students who are preparing to attend
graduate courses in English-speaking countries.
Tasks in ESP
For Willis (1996), task-based teaching should: stimulate students to use
the target language collaboratively and meaningfully; allow students to
participate in a complete interaction and to use different communication
strategies; and help students develop self-confidence to reach their
communicative goals. Based on the consensus of several researchers and
educators, Skehan (1998) suggested four criteria to define a task: (i) the
meaning is essential; (ii) the focus is on the objective; (iii) the task product
must be assessed, and (iv) there must be a relation to the real world. A
similar concept was also proposed by Willis (1996), as for her, tasks are
activities in which the target language is used by the learners with a
communicative objective in order to reach a result. Willis (1996)
highlights that task-based teaching should give the learners the freedom to
choose their linguistic form in order to reach an objective, that is, to

convey their ideas. This way, language is used as a vehicle to reach the
objectives of the task with emphasis on meaning and communication, not
on correct production of linguistic forms.
Timing is an important aspect because one of the main characteristics
of the TOEFL iBT is the limited time (few seconds) test-takers have to
prepare the answers before carrying out the tasks. Willis (1996) mentions
that learners must know how to start a task and how long they have to
prepare and carry it out. Ellis (2003, p. 127) discusses the effects of
planning, which he calls “strategic planning”, in the learners’ oral
production during communicative tasks. He defines strategic planning as:
“The process by which learners plan what they are going to say or write
before commencing a task. Pre-task planning can attend to prepositional
content, to the organization of information or to the choice of language.
Strategic planning is also referred to as pre-task planning.” (Ellis, 2005a, p.
50)
Based on Skehan (1998), Ellis (2003) recommends a set of criteria to

determine the level of fluency, accuracy and complexity of foreign
language production in communicative tasks (Table 13-1).
Table 13-1: Classification of production variables in oral

communication tasks (Adapted from Ellis, 2003, p. 117).
Aspects Measures
Number of words and syllables per minute
Number of pauses (one/two seconds or longer)
Fluency
Number of repetitions and reformulations
Number of words per turn
Number of self-corrections
Percentage of error-free clauses
Accuracy
Use of verb tenses/articles/vocabulary/plurals/negatives
Ratio of definite and indefinite articles
Number of turns per minute
Lexical richness
Complexity
Amount of subordinate clauses
Frequency of use of prepositions and conjunctions
Ellis (2005a) recommends that, when the learner has the opportunity to
make use of the strategic planning before the task, his language production
will be more fluent and show more complexity. Although there are many
studies about strategic planning in language laboratories and classrooms,

Ellis highlights the need for more studies to verify the benefit of such
planning in exam contexts. The effects of strategic planning in exam
contexts can be a little different from those in classroom settings. Ellis
(2005b) mentions that one reason might be that learners feel pressure in
assessment contexts and the results might not be the expected ones. The
preparation and adjustment to the restricted timing for answering the test
tasks receives special attention in a preparatory ESP course.
Tests and Assessments

Language assessments play an essential role in Applied Linguistics,
operationalizing its theories and supplying researchers with data for their
analysis of language use (Clapham, 2000). McNamara (2000) also
discusses this issue, as language tests are of great importance to many
people, working as gateways in key educational and employment
moments. This is the case of the TOEFL iBT test, through which
international students may or may not be accepted in English-speaking
universities. Therefore, this is a high-stakes test as it is a pre-requisite for
student admission in graduate studies. High-stakes exams require a long
preparation because the result will be critical for the academic future of
these learners.
Proficiency tests such as the TOEFL iBT or the International English
Language Testing System (IELTS) are not based on a course content.
Instead, they assess the adequacy of the students’ language for future use
in the target situation and their ability to attend academic courses delivered
through the medium of English (Jordan, 1997).
The concept of washback (or backwash) is another very important
aspect related to preparatory courses discussed by Scaramucci (2011),
among many other scholars. She explains that the concept refers to how
outside exams–especially high-stakes ones, such as college entrance
exams and some language proficiency tests–can potentially have an impact
on the teaching and learning process, the curriculum, materials
development and on the attitudes of students, teachers and the school
involved in the exams. Washback in teaching and learning is undeniable
and it is out of the control of test developers (Scaramucci, 2009).
This chapter looks at a preparatory course for the TOEFL iBT which
was influenced by the test as it was developed taking into account the test
constructs, how the test is applied, and above all, the students’ needs. The
aim of the research study was to investigate how this ESP course met the
learning needs of the students and prepared them to take the TOEFL iBT
in a one-to-one class environment.
The Study
This is a qualitative exploratory research study. The main research
questions that guided the study were:
1. What are the students’ needs with regards to taking the TOEFL
iBT?
2. How do students perceive their language development during the
TOEFL iBT preparatory course?
3. How do students perceive the TOEFL iBT preparatory course?
The investigation strategy underlying the research questions was the

case study, as proposed by Stake (1988) and Johnson (1992), as it is a
research approach which allows the investigation of a specific situation
within a specific context. Stake (1988) considers the case study not only a
methodological choice, but mainly the choice of the object to be studied.
He adds that the main feature of case studies is the presence of the
researcher in the context, the contact and the direct involvement of the
researcher with the activities of the case, always reviewing and reflecting
about the events.
The research context was the preparatory course for the TOEFL iBT,
developed by the researcher. The course comprised private lessons
focusing on the speaking tasks of the test. The data collection was carried
out over a period of 24 months.
Participants
The participants of this study were 17 students attending the
preparatory course for the TOEFL iBT. They were 7 female and 10 male
young adults, mostly (71%) between 21 and 30 years old. All but one were
graduate students, most of them had advanced levels of English, and only
three were at an intermediate proficiency level. Their language proficiency
level (basic, intermediate, intermediate/advanced or advanced) was
classified informally on the first day of class, taking into consideration
their grammar level, their capacity to express themselves without
hesitations and their vocabulary mastery. In order to protect the students’
identity, their names have been omitted and each student is identified by
the letter ‘A’ followed by a number, from 1 to 17.
Data Collection
The study data collection instruments and the procedures for the data
collection helped to answer the three research questions. The data
collection was divided into three consecutive phases. Table 13-2 provides
a summary of the data collection procedures in each phase of the project,
and the procedures in each of these phases is explained in detail in the
following sections.
Table 13-2: Data collection procedures in each phase of the study.
Phase Procedure
1st x initial questionnaire
x recording of student’s initial speaking task
x assessment of student’s initial speaking task
2nd x interview at the end of each class
3rd x recording of student’s final speaking task

x assessment of student’s final speaking task
x final questionnaire
x student’s score on the TOEFL iBT test
Phase 1
The initial questionnaire was used to find out and analyze the students’
learning needs. From the tabulation of the data, it was possible to
understand the target audience profile, their needs, and tailor the course
content for each student.
The initial oral production recording of each student responding to
sample Task 1 and 2 of the TOEFL iBT Speaking Section, aimed to
provide some data for each student, such as fluency, lexical richness,
accuracy, time used to formulate answers, attitude and reaction to the
limited time for responses. Task 1 of the TOEFL iBT Speaking Section
asks the test-taker to give a personal opinion about a topic, and Task 2
asks the test-taker to make a personal choice between two options and give
reasons and examples.
The recordings were assessed according to similar criteria used by the
Educational Testing Service (ETS) who are responsible for the TOEFL
iBT, and the theoretical framework by Ellis (2003) (see Table 13-1 above)
regarding fluency, complexity and accuracy of oral production. Based on
these criteria, an evaluation form was developed for the speaking task
output. As in the TOEFL iBT, the scores ranged from 1 to 4 (1 = weak, 2

= fair, 3 = good, and 4 = excellent).
Phase 2
The second phase of data collection occurred at the end of each class.
An interview consisting of three open-ended questions was carried out
with each participant, in order to understand and evaluate the perceptions
of students with regards to their oral production difficulties and the
activities they performed during that lesson. The three interview questions
were:
1. What did you think of today’s lesson activities?

2. What difficulties did they present for you?
3. How was your performance today compared to the last class?
With the transcription of all the answers, the data were classified into
three categories elaborated a priori; i.e., activities, difficulties and
performance (see Bardin, 2011). The themes emerged after the analysis of
all the responses which were initially grouped by similarity of content. The
topics that were most often mentioned and later on guided the analysis
were: cognition, affection and methodology.
Phase 3
In the third and final phase of data collection, the final questionnaire
was used in order to find out if the course had met the students’ specific
needs raised in the beginning of the course, and the level of support the
course offered them to take the TOEFL iBT.
Also, the students’ performance on sample Task 1 and 2 of the TOEFL
iBT Speaking Section was recorded. The content of this recording was
compared with the initial speaking tasks recording and provided
information for analysis of the development of students’ speech
production. As with the initial oral production, the evaluation of these final
tasks was performed using the same evaluation form.
The analysis of the participation and performance of the 17 students
aimed at evaluating the adequacy of the course from the students’
perspective and assessing their linguistic ability.

Students’ Needs
The data from the initial questionnaire showed that the majority of the
participants indicated that the reason for choosing the ESP course was due
to its specific focus (44%), followed by the fact that they considered one-
to-one tutoring more efficient than classroom teaching and learning (26%).
In terms of their course expectations, most students reported that they
wanted to obtain the minimum score required for entering their desired
study programs (71%), and enhance their speech production (41%). These
data confirm the importance of the ESP course for preparing candidates
for a language proficiency test.
With regards to the test constructs, the study investigated how students
felt when speaking English under time pressure. This information was
taken into account during the lessons as it was necessary to ensure that
students were able to adapt to the time limitations of the test. Out of the 17
students, 11 confirmed in the initial questionnaire that they could not
express themselves well under pressure. Students reported that they felt
nervousness, discomfort, anxiety, and panic.
From the answers to the initial questionnaire, it was possible to detect
that fluency to express themselves in the English language and the
development of complex ideas were two items with which most students
had difficulty (29%).
Students were also asked to report how they evaluated each of their
English language skills using the TOEFL iBT scale. The data from this
question helped to better tailor the course to the needs of the students.
Table 13-3 presents the results of the students’ answers.
Table 13-3: Student perceptions of their language skills before the

course.
Language skills before the course

Excellent Good Fair Weak Total
Reading 9 6 2 0 17
Listening 7 8 2 0 17
Speaking 0 11 6 0 17
Writing 3 7 4 3 17
Note: N = 17
The results showed that all students considered their oral production
either fair (6 students) or good (11 students) before the course started. That
was a good indication that they needed to improve this skill during the
course. It should be noted that as all the students needed to reach a
minimum score of 85% in the test, considering their speaking ability was
just ‘good’ was not enough to reach the desired score. This need was also
highlighted in the initial questionnaire as their main reason for attending
the course was the enhancement of their speaking ability. Writing skills
were also worked on extensively throughout the ESP course, as this was
the only skill rated as weak (3 students). Interestingly, students reported
having greater difficulty in language production (speaking and/or writing)
and less difficulty in language comprehension (listening and/or reading).
Students’ Performance
With regards to student language development and performance, the
final questionnaire data at the end of the course showed that fluency and
vocabulary were still a problem for students. Although 35% of students
mentioned fluency as the greatest difficulty in speaking English even after
they attended the course, and 29% of them signaled a lack of vocabulary
as something that would still hinder their oral production, the vast majority
(71%) felt more confident at the end of the course (see Figure 13-1).
The perceptions of students regarding factors that are more measurable,
such as lack of vocabulary or grammar, cause them less concern or are
even minimized when compared to the more subjective factors, such as
fluency and objectivity when describing details and reasons in the strictly
timed answers. Interestingly, all these factors are interrelated, because the
grammar and the vocabulary level will influence the fluency and the
objectivity of the answers within the 45 seconds allowed in the TOEFL
iBT.
At the end of the course 47% of students claimed to feel more
confident and fluent in English (see Figure 13-2). The high number of
answers related to greater confidence (71%) shows that one of the main
initial difficulties of the students was overcome by the end of the course.
Sppeaking Practicee in Private Claasses for the TO
OEFL iBT Test 263
7
6
Number of answers
5
4
3
2
1
0
Figure 13-1: D he course (N = 17).

Difficulties witth language at th
Student percepttions of their language ability aat the end of th

Figure 13-2: S he course
(N = 17).
The incrrease in studennts’ self-esteeem was alwayys one of the objectives

o
and concernns during thhe course, ass the emotioonal factor in nfluences
students’ pperformance. Furthermoree, helping students gain more
confidence m meets the assuumptions of task-based
t leaarning as Willlis (1996)
claims it is essential to develop
d studeents’ confidennce so that th
hey reach
their commuunicative goalls.
T
In the finnal questionnnaire, students were asked tto rate (on a scale
s of 0
to 5) their level of learnning as a ressult of the acctivities in thhe regular
classes. Figuure 13-3 showws the results.
10
8
Number of answers
0
5 4 3 2 1 0
5 = highest 0 = lowest
Figure 13-3: S
Student self-perrceived learning
g (N = 17).
Results showed that 7 students (4 41%) rated thheir learning with the
highest scorre (5) and 9 students
s (47%%) each gave a 4 for their language
ability at thhe end of thee course. Acccording to thhese results, it i can be
inferred thatt the ESP couurse met the needsn mentionned by the sttudents in
the initial quuestionnaire att the beginnin
ng of the coursse.
In both tthe initial andd final questio
onnaires, studeents were askeed to rate
their four laanguage skillls. This questtion offered ffour responsee options:
excellent, goood, fair and poor.
p The folllowing two figgures (Figure 13-4 and
Figure 13-5)) show the evoolution of thiss perception frrom the studen nts’ point
of view, whhile Table 133-4 comparess the initial w with the finaal student
perceptions..
12
10
8 Reading
6 Listening
4 Speaking
2 Writing
0
Weak Fair Good Excellent
Figure 13-4: Students’ perceptions of their language skills before the course (N =
17).
12
10
Reading
8
Listening
6
Speaking
4
Writing
2
0
Weak Fair Good Excellent
Figure 13-5: Students’ perceptions of their language skills after the course (N =
17).
Table 13-4: Change in students’ perceptions (N = 17).
Perception
Skill Same Better Worse
Reading 12 4 1
Listening 12 4 1
Speaking 9 7 1
Writing 5 11 1
Looking at the results above, it can be noted that students perceived a

general improvement in all four skills. The number of students (11) who
felt an improvement in their written production was 63% higher than the
number of students (7) who noticed an improvement in their oral
production. With regards to speaking skills, which was the focus of the
ESP course, among the 17 students, eight of them perceived their speaking
skills as “good” in the beginning and at the end of the course, while two of
them considered their speaking skills excellent at the end of the course.
Both the performance on Task 1 and the performance on Task 2 were
assessed according to criteria set a priori (general description, delivery,
language, topic development, and organization), on a scale from 1 (weak)
to 4 (excellent). Thus, with regards to their outcome on both Tasks at the
start and at the end of the course, Figure 13-6 shows the average
improvement for each student (A1 to A17).
Overall Progress
A17
A16
A15
A14
A13
A12
A11
A10
A9
A8
A7
A6
A5
A4
A3
A2
A1
0% 10% 20% 30% 40% 50% 60% 70%

Figure 13-6: Student improvement in Task 1 and Task 2 (N = 17).
In terms of the extent to which the course had met students’

expectations, almost all students (94%) had their expectations exceeded or
totally met. To examine to what extent the course contributed to the
students’ language development according to their expectations, students’
initial expectations and needs were contrasted with their perceptions at the
end of the course. Initially, 41% of students needed and wished to improve
their speaking skills and at the end of the course 33% of the students stated
that their oral production had improved significantly, and three of the
students were positively surprised with how much they had improved.
At the end of the study, and from a careful analysis of the data (see
Figure 13-4 and 13-5), it was possible to quantify the improvement
perceived by students in all four language skills. All findings were
compiled into a single spreadsheet; the data were then compared and
analyzed with the goal of finding the most usual pattern of improvement
among the different students, as well as the most common correlations
among the skills studied. Table 13-5 summarizes the results of this
analysis.
Surprisingly, the improvement in the perception of written production
was much higher (38.7%) when compared to the improvement of oral
production (17.6%), although the latter was the main focus of the
preparatory course for the TOEFL iBT.
Table 13-5: Average perceived overall improvement in language skills.
Before After Increase

Reading 1.65 1.47 12.0%
Listening 1.71 1.53 11.5%
Speaking 2.35 2.00 17.6%
Writing 2.53 1.82 38.7%
Note: N = 17
The average perception of all students for each one of their skills was
taken into account, both at the beginning of the course and at the end of it.
From these values, the median variation was calculated. As the figures
show in the table above, in the comprehension skills, i.e., reading and
listening, there was only an increase of 12.0% and 11.5% respectively in
the way students viewed their improvement.
The course in question, having a narrower focus, was able to help
students improve more than one communicative ability, because even
though it focused on the practice of the speaking skills, the wide variety of
materials and extra writing content to support this focus may have also
helped students develop their other language skills such as writing.
TOEFL iBT Test Scores

The total score for the TOEFL iBT Test is 120 points, composed of a
maximum of 30 points for each skill (Reading, Listening, Speaking and
Writing). The Table 13-6 below shows the final scores obtained by each of
the students who attended the ESP course.
Table 13-6: Students’ final TOEFL iBT score.
Students' Final Scores

Score Score
Student Reading Listening Speaking Writing
aimed obtained
A1 90 94 24 25 23 22
A2 90 106 26 26 26 28
A3 90 102 24 29 27 22
A4 80 85 22 22 21 21
A5 100 104 24 26 26 28
A6 100 102 25 27 24 26
A7 100 106 29 29 25 23
A8 90 89 23 24 20 22
A9 80 84 20 21 21 22
A10 100 100 25 27 26 22
A11 100 94 26 21 23 24
A12 100 99 26 24 26 22
A13 109 114 30 30 27 27
A14 100 85 20 18 20 20
A15 100 111 29 28 26 28
A16 80 92 24 28 20 20
A17 80 91 19 25 23 24
According to the students’ scores above, it can be said that there was
indeed much learning along the course as the majority of the students (13
of them, or 76%) obtained a higher score than what they needed to be
accepted into their graduate courses. Among the four students who did not
reach the desired score, two of them had an intermediate level of English
with grammar difficulties, and the two others reported serious anxiety
issues during the test.
Every year since 2006, ETS publishes a report with the average scores
of students around the world. The figures reflect each skill and the total
score. The average score in the period when the present study took place
was 80 points (see Table 13-7).
Table 13-7: Comparative table of the TOEFL iBT scores.
Total Reading Listening Speaking Writing

Score
Overall
Average 80 20 20 20 21
(2006-2013)
Brazil
Average 85 21 22 21 21
(2006-2013)
ESP
Course 98 24 25 24 24
Average
As can be seen in Table 13-7 the Brazilian test-takers average was 85

points, 6% above the world average. When comparing the Brazilian test-
takers’ scores between 2006 and 2013 with the scores of the 17 students of
the ESP course, it is apparent that the average for the participants in this
study (98 points) is 15% higher than the average for the Brazilian test-
takers (85 points). Thus, it can be concluded that the course met its goal,
i.e., to offer high-quality teaching with a focus on the specific needs of the
students.
Conclusion
Through the data obtained in the initial needs analysis, I had the
opportunity to investigate students’ needs and tailor the ESP course to
achieve certain goals, aligned with the needs and weaknesses of each
student. Based on the initial questionnaire, it was possible to detect that for
most students (53%) oral fluency was the greatest difficulty. After the
course, it was found that this problem was indeed overcome, not only in
terms of students’ perceptions, but also in terms of the evaluation of their
speaking skills in the initial and final speaking tasks. The study also
showed that students experienced an increase in their self-esteem and self-
confidence to express themselves in English. The feedback and the
encouragement given to students during the lessons proved to be effective

as it generated confidence, comfort and lowered students’ stress levels.
The data on general development as perceived by the students were
essential in determining the appropriateness of the course. The skills of
written and oral comprehension showed a large equivalence (12.0% and
11.5%, respectively). Surprisingly, the results also showed an increase of
38.7% in the students’ writing skills which was not the primary focus of
the ESP course.
Research in the ESP approach in the form of individual one-to-one
lessons is scarce, especially those that combine one-to-one tutoring and
preparation for standardized proficiency tests, which is why this research
study makes a significant contribution to the field of language teaching
and assessment. It is hoped that this work can be an important tool for all
those involved in the teaching and learning of ESP courses; not only those
responsible for course design, but also for teachers who are in direct
contact with students, as it provides data on course design focusing on the
development of speaking skills and methodological procedures in the
context of teaching and learning languages.
Acknowledgement
This research study received support from CAPES, the Brazilian
Agency for the Development of Graduate Studies.
References
Bardin, L. (2011). Análise de Conteúdo. São Paulo: Edições 70.
Basturkmen, H. (2010). Developing courses in English for specific
purposes. Great Britain: Palgrave Macmillan.
Clapham, C. (2000). Assessment and testing. Annual Review of Applied
Linguistics, 20, 147-161.
Dudley-Evans, T., & St. John, M. (1998). Developments in English for
specific purposes: A multi-disciplinary approach. Cambridge:
Ellis, R. (2003). Task-based language learning and teaching. Oxford:
—. (2005a). Instructed second language acquisition - A literature review.
Auckland: The University of Auckland.
—. (2005b). Planning and task performance in a second language.
Language Learning & Language Teaching Series. Amsterdam: John
Benjamins.
English Testing Service. (2010). Linking TOEFL iBT scores to IELTS

scores – A research report. Retrieved from:
http://www.ets.org/s/toefl/pdf/linking_toefl_ibt_scores_to_ielts_scores.
pdf
ETS. (2009). Official guide to the TOEFL iBT test. USA: McGraw-Hill.
—. (2011). TOEFL iBT research insight. Series 1, Volume 6. Retrieved
from: http://www.ets.org/s/toefl/pdf/toefl_ibt_insight_s1v6.pdf
Hutchinson, T., & Waters, A. (1987). English for specific purposes – A
learning-centered approach. Cambridge: Cambridge University Press.
Johnson, D. M. (1992). Approaches to research in second language
learning. New York: Longman.
Jordan, R. R. (1997). English for academic purposes: A guide and
resource book for teachers. Cambridge: Cambridge University Press.
Long, M. H. (2005). Second language needs analysis. Cambridge:
McNamara, T. (2000). Language testing. Oxford: Oxford University
Press.
Scaramucci, M. V. R. (2009). Avaliação da leitura em inglês como língua
estrangeira e validade de construto. Caleidoscópio, 7(1), 30-48.
—. (2011). Validade e consequências sociais das avaliações em contextos
de ensino de línguas. Linguarum Arena, 2(1), 103-120.
Skehan, P. (1998). A cognitive approach to language learning. Oxford:
Stake, R. E. (1988). Case studies. In N. K. Denzin & Y. S. Lincoln (Eds.),
Strategies of qualitative inquiry (pp. 86-108). Thousand Oaks: Sage
Publications.
Willis, J. (1996). A framework for task-based learning. Malaysia:
Longman.
CHAPTER FOURTEEN
ASSESSING THE LEVEL OF GRAMMAR

PROFICIENCY OF EFL AND ESL
FRESHMAN STUDENTS:
A CASE STUDY IN THE PHILIPPINES
SELWYN A. CRUZ
AND ROMULO P. VILLANUEVA JR.
Abstract
The Far Eastern University (FEU) has one of the highest numbers of
international undergraduate students among Philippine universities. A
considerable number of these students are Korean students of English as a
foreign language (EFL) who are enrolled in general education courses,
such as English language classes which are also attended by Filipino
learners for whom English is a second language (ESL). Recognizing the
constantly increasing population of international students in the university,
this research study intended to compare the English grammar proficiency
of learners from two Asian English variants, namely Philippine English
and Korean English. In fulfilling the objectives of the study, 30 Korean
and 30 Filipino students were randomly selected to answer a 130-item
grammar test based on the syllabus of their course, namely, Introduction to
Language Arts English (Eng AN). Recommendations and implications for
English language teaching and learning are also discussed in the study.
Introduction
The term ESL (English as a Second Language) is attributed to the use
of English in countries like the Philippines and India where English is
used in daily activities but not as the main language. On the other hand,
Level of Grammar Proficiency of EFL and ESL Freshman Students 273
English is said to function as a foreign language (EFL) in countries where

it is not used in everyday situations. China, Japan and Korea are said to
speak EFL (Kirkpatrick, 2007).
Kachru (2005) redefined the concept of EFL and ESL through three
concentric circles. The Inner Circle refers to the sociolinguistic origins of
English where it is used as the native language. The United States, Canada
and the United Kingdom are examples of countries in the Inner Circle. The
Outer Circle represents countries like the Philippines and Singapore,
which were previously colonized by native English speaking countries and
are now using English as their second language. Lastly, the Expanding
Circle pertains to the users of English as a foreign language. Countries in
the Expanding Circle include those in eastern Asia such as Korea and
China.
In the Philippines, 93.5% of its population is said to speak and
understand English well (Magno, 2010), and Philippine English is
considered a legitimate variant of World Englishes with distinct features
significant enough to be an area of research. Llamzon’s (1969) monograph
is believed to be the pioneer study to conceptualize what he termed as
Standard Filipino English. It highlighted how the speech forms of the
speakers of Filipino English are considered a sub-variety of English. More
importantly, it stressed that Filipino English is the type of English used by
the educated Filipino. Subsequently, Gonzalez (1985) and Bautista (2000)
have greatly contributed to the rich literature on the existence of Philippine
English. At present, it seems unquestionable that Philippine English exists
not just as a sub-variety of American English but a legitimate World
English variety (Borlongan, 2011; Martin, 2012).
Korean English, like Philippine English, has also received overwhelming
attention (Kachru, 2006). English in South Korea developed in the 1990s
when it was made a mandatory subject at schools starting with elementary
education as a result of the English Education Policies of Korea and the
arrival of native English speakers as educators in the 1990s (Chang, 1990).
At present, there are prominent studies concerning the existence of
Konglish, though criticised by some as merely codified Korean English
(Shim, 1999), and these include the corpus based works of Jung and Min
(1993; 1999). Overall, it appears that language contact resulting in
apparent changes in its language system could be a major contributor in
the prominence of the identity of Korean English (Young, 2008).
274 Chapter Fourteen
Background
Looking at the concept of grammatical proficiency, Chomsky posited
that grammatical competency involves one’s knowledge of grammar.
Hymes (1972), however, thought that this concept was inadequate; thus,
the elaboration that the grammatical proficiency that Chomsky was
referring to, alongside the meaning or value of one’s utterance, is part of
what is termed as communicative competence. Canale and Swain (1980)
supported this idea and added that:
“a synthesis of knowledge of basic grammatical principles, knowledge of

how language is used in social settings to perform communicative
functions, and knowledge of how utterances and communicative functions
can be combined according to the principles of discourse.” (p. 20)
In view of the recognition of grammatical competency as important in

achieving communicative competency, the current study focused on
examining ESL and EFL learners’ grammar knowledge.
Haegman (1991) defines grammar as the rules and principles of a
language that provide a systematic description of sentence formation.
Fromkin, Rodman and Hyams (2003) state that knowledge of grammar
describes a learner’s linguistic knowledge involving the rules and the
unconscious linguistic knowledge. Meanwhile, Taylor (1988) defines
proficiency as one’s ability to use competence. Overall, it could be said
that grammatical proficiency as a term refers to the explicit awareness of
grammatical rules. As Shanklin (1995) proposed, grammatical proficiency
is the “ability to make judgments about the acceptability and appropriateness
of an utterance with specific reference to grammatical notions” (p. 149).
Several studies were conducted to investigate the grammatical
proficiency of learners. For instance, Diab (1998) used error analysis in an
attempt to reveal what kind of grammatical, lexical semantic and syntactic
errors were caused by the negative transfer of the learners’ mother tongue
(Arabic) into the learners’ second language (English). With 73 Lebanese
native speakers of Arabic who were taking an intermediate level of
English course as participants, the study found that the use of prepositions
posed the most difficulty among Arabic-speaking students in Beirut,
Lebanon (247 errors out of 558 grammar errors made by the participants).
In addition, he argued that those above-mentioned errors might have been
caused by negative transfer from the students’ mother tongue.
Bautista (2000) identified certain deviated grammatical features of
Philippine English in comparison to American Standard English. For
instance, it was noted that Filipinos tend to choose results to, consist in
and specialized on instead of results in, consist of and specialized in. It

was stressed in the study that the variations could be broadly understood
by the fact that Filipino language does not possess distinctive items to
represent different prepositions such as on, in, at, to, and toward, as there
is basically one generic preposition, sa, used to equivalently signify
English prepositional concepts. The five grammatical aspects, which
Bautista found to be prevalent in the mistakes committed by Filipino
speakers of English in their speech or writing, were included in the current
study.
Bauman (2010), conducting a study in Korean English, collated the
prominent pronunciation, grammatical and syntactic errors of Korean
students of English by means of classroom observation. By analysing the
speech utterances and written production of Korean students, Bauman
found that Koreans have difficulty with their use of tenses, and definite
and indefinite articles.
Filipino and Korean students’ explicit and implicit knowledge were
studied (Cruz, 2011) to confirm Bautista’s (2000) findings on the
grammatical features of Philippine English. After a study with sixty
randomly selected first year college students, it was concluded that
Filipino and Korean learners of English appeared to have identical features
in their grammatical errors. Further, it was found that the students had the
ability to use their explicit and implicit knowledge when a certain type of
grammar test was given to them.
In Nezami and Najafi’s (2012) study of common error types of Iranian
learners of English, it was revealed that the group with low scores in the
Test of English as a Foreign Language (TOEFL) also got low scores in
terms of grammar in their essays where difficulties in the use of articles,
verb formation, and singular and plural noun formation were found.
Meanwhile, in Li’s (2005) study of EFL and ESL college learners’
grammatical and lexical collocational errors in essays, it was found that
the students had almost identical grammatical errors that were mostly
concerned with verbs followed by certain prepositions and prepositions
followed by a noun, and these errors outnumbered the lexical collocational
errors.
The present study aims to contribute to the field of English language
teaching (ELT) and to the study of Asian Englishes. Specifically, the
current study aims to compare the grammar proficiency of Korean and
Filipino students through an objective grammar test in order to gain
understanding of the areas of grammar that are easy and those that are
difficult for students. Furthermore, the current research could provide
empirical evidence for revising the English language syllabus intended for
first year students.
The Study
A total of 60 first year students, 30 Filipino and 30 Korean, from
various disciplines, enrolled in Eng AN or Comm Arts 1 at the Far Eastern
University (FEU), were the participants of this study. Only those taking
the course for the first time were chosen to take part in the study in order
to ensure the reliability of results since a prior exposure to the course
materials could result in apparent differences in performance. The first
year students were chosen because the syllabus for first year students
mainly concentrates on grammar. Convenience sampling was used in the
study because filtering the entire population of freshman students in terms
of class standing and proficiency level was not plausible at the time that
the study was conducted.
The Filipino participants consisted of 9 males and 21 females from the
Institute of Arts and Sciences enrolled in the Bachelor of Science Major in
Medical Technology. All of the participants were students of one of the
researchers. The Korean participants were all part of a Filipino for
Foreigners class, which was handled by a colleague. The Korean
participants comprised 19 males and 11 females who were taking different
courses. The researchers were not able to gather a homogenous group of
Korean participants in terms of the courses they were enrolled in because
of the relatively small number of foreigners in each class compared to
Filipinos. The Filipino participants’ age ranged from 15 to 18 years old
while the Koreans were from 17 to 20 years old. There were no specific
requirements for students to be part of the study apart from being enrolled
in the Eng AN class. At the time the data were gathered, the students were
about to have their midterm examination; hence, a good portion of the
syllabus was expected to have been taught in the class already.
Instruments
A 130-item grammar test covering the parts of speech and other
aspects of the English language that require the use of rules was used to
collect data. All grammatical aspects in the test were mostly based on the
syllabus of the Basic English course (Eng AN or Comm Arts 1) which is a
course for all first year students in FEU. There was an average allocation
of five test items per grammatical aspect as suggested by Brown (2005) on
language testing. The researchers took the questions from the grammar
book of Folse, Ivone and Pollgreen (2005) and modifications were made
for contextualization. There were numerous targeted grammatical aspects
to be examined in the current study; hence, an objective type test was
needed for convenience in marking and analysis. The test contained
multiple choice questions, cloze tests, gap fill exercises, and fill in the
blank type of questions. A pilot test was administered with 2nd year
English Language students (21 Filipinos, 7 Koreans and 2 Chinese). The
Filipino students obtained a general mean of 101.64 while the Koreans had
a 96.08 mean score. Minor modifications were done in the test to make the
questions more suitable for the first year students.
Procedures
The test was administered separately to each group. The Filipinos were
given the test during their Eng AN class. The class was composed of 45
students, so 30 students were randomly selected to take the test, while the
remaining 15 students were asked to perform a classroom task. The
Korean learners, on the other hand, were given the test during their
Filipino for Foreign Students class. At that time, there were exactly 30
students enrolled in the class.
The students were given forty minutes to answer the test since the
participants of the pilot test were able to accomplish it between 25 to 35
minutes. After the test, the researchers marked all of the test papers in two
separate days. Two colleagues helped in verifying the reliability of the
marks. The marks were re-checked for possible errors and a recount for
the test scores was also conducted.
Data Analysis
The researchers made use of descriptive statistics to analyze the data. In
addition, the researchers devised a scale in order to measure the level of
grammar proficiency of the participants (see Table 14-1). The researchers
also analyzed the participants’ prominent mistakes. The mistakes
committed were collated and tabulated.
Table 14-1: A scale for measuring grammar proficiency.
Score 5 Items 10 Items 15 Items Interpretation

105-130 5 9-10 14-15 Highly proficient
79-104 4 7-8 11-13 Proficient
53-78 3 5-6 7-10 Average
27-52 2 3-4 3-6 Less proficient
1-26 1 1-2 0-2 Deficient

Table 14-2 shows that there was a difference between the mean scores of
the participants with the Filipino students scoring higher than the Korean
students. A t-test was performed to examine the mean difference and it
was found to be statistically significant (p = .26).
Table 14-2: Mean and standard deviation of participants’ total score

based on nationality.
Standard Standard
Nationality Mean Deviation Error t-test p-
Mean value
Korean 92.90 11.174 2.066 2.28 .026
Filipino 100.67 14.615 2.672
Tables 14-3 and 14-4 show the means and standard deviations per
grammar topic for each group of participants (ESL and EFL) with the
corresponding interpretations from the devised scale. Based on the total
mean scores of each group in each area, it can be seen that the ESL
students achieved a proficient level in 12 areas of grammar (Table 14-3),
while the EFL students achieved a proficient level in six areas of grammar
(Table 14-4). On the other hand, there are 10 and 11 areas in grammar
where the ESL and EFL students achieved an average level respectively.
It is also interesting to note that the EFL participants seem to be less
proficient in the present progressive tense, modals and articles, while the
ESL participants appear to be less proficient in adverbs of frequency.
Sentence errors seem to be the most difficult area of grammar for EFL
learners based on their total mean score.
Table 14-3: Means and standard deviations for the ESL participants.
ESL (Filipino Students)

# of items Mean Standard Interpretation
Deviation
Nouns 15 9.93 2.638 Average
Verbs 5 4.57 0.728 Proficient
Adjectives 5 4.07 1.202 Proficient
Identification of 10 6.47 1.737 Average
Parts of Speech
Present tense of be 5 4.97 0.182 Proficient
Present tense of 10 7.9 2.006 Proficient
regular verbs
Past tense of be 5 3.47 2.08 Average
Past tense of 5 4.13 1.252 Proficient
irregular verbs
Present 5 3.63 1.866 Average
progressive tense
Past progressive 5 3.9 1.029 Average
tense
Perfect tenses 5 3.27 1.413 Average
Troublesome 5 4.5 0.682 Proficient
verbs
SVA 5 4.3 0.794 Proficient
Prepositions 5 3.13 1.252 Average
Correlative 5 4.63 0.556 Proficient
Conjunctions
Articles 5 3.77 1.223 Average
Adverbs of 5 2.87 0.73 Less proficient
frequency
Modals 5 3.67 1.061 Average
Demonstratives 5 3.67 1.322 Average
Possessive 5 4.53 1.042 Proficient
Adjectives
Sentence Errors 5 4.13 1.042 Proficient
Subordinate 5 4.07 1.337 Proficient
Conjunctions
Total Score 130 100.67 14.615 Proficient
Table 14-4: Means and standard deviations for the EFL participants.
EFL (Korean Students)

# of items Mean Standard Interpretation
Deviation
Nouns 15 11.67 2.869 Proficient
Verbs 5 4.63 0.765 Proficient
Adjectives 5 4.4 1.276 Proficient
Identification of 10 6 1.287 Average
Parts of Speech
Present tense of be 5 4.77 0.43 Proficient
Present tense of 10 8.7 2.322 Proficient
regular verbs
Past tense of be 5 3.5 1.757 Average
Past tense of 5 3.77 1.357 Average
irregular verbs
Present progressive 5 2.43 2.269 Less proficient
tense
Past progressive 5 3.73 1.202 Average
tense
Perfect tenses 5 3.07 1.337 Average
Troublesome verbs 5 4.13 0.629 Proficient
SVA 5 3.07 1.048 Average
Prepositions 5 3.57 1.331 Average
Correlative 5 3.83 1.367 Average
Conjunctions
Articles 5 2.03 1.189 Less proficient
Adverbs of 5 3.3 1.149 Average
frequency
Modals 5 2.8 0.61 Less proficient
Demonstratives 5 3.87 1.383 Average
Possessive 5 4.73 0.521 Proficient
Adjectives
Sentence Errors 5 1.83 1.315 Deficient
Subordinate 5 3.07 1.202 Average
Conjunctions
Total Score 130 92.9 11.174 Proficient
Based on the findings, both EFL and ESL learners achieved a

proficient level in the following areas of grammar: present tense of be,
possessive adjectives, identification of verbs, adjectives, and troublesome
verb pairs (i.e., confusing verbs such as ‘sit’ and ‘set’, ‘rise’ and ‘raise’).
However, they seemed to have difficulty with the use of articles,
identifying sentence errors, subject-verb agreement, use of present
progressive and perfect tenses, and prepositions.
The results of the current study confirmed the findings in the studies of
Najafi (2012), Bauman (2010), Li (2005), Baustista (2000), and Diab
(1998) that articles and the use of prepositions seem to be the most
common mistakes committed by learners of English.
Analysis of EFL and ESL Learners’ Mistakes

Articles
The majority of the participants in the current study had difficulty in
using definite and indefinite articles. This could be supported by the
studies of Bauman (2010), Nezami and Najafi (2012) and Baustista
(2000). Nezami and Najafi (2012) found that EFL learners (Iranian) in
their study had difficulty with the use of articles, while Bautista found that
one of the prevalent mistakes committed by Filipino speakers of English
was the incorrect use of articles. These were some of the most frequent
mistakes in the use of definite and indefinite articles:
1. When is the football game? Oh! It’s on the/a Tuesday.

2. Where’s an apple? I put it on the table.
3. I put it in a refrigerator.
The learners seem to interchange the with a and it also appears that
they tend to use an article when it is not needed.
Present Progressive Tense

Both EFL and ESL learners had difficulty in terms of using the present
progressive tense. Here is a list of the mistakes committed by the
participants based on their answers in the items on present progressive
tense.
1. He can’t repair your car now. He eats lunch.

2. Sheng writes a letter to her parents now.
3. Joan uses the computer now.
4. The examinee takes the test now.
Based on the participants’ answers, it could be gleaned that the learners

tend to mistake an on-going action for a habitual action. It is probable that
they may not be able to recognize the time marker of the present progressive
tense, which is obvious in the sentences. The participants seem to have
applied the rule that a subject must agree with its verb; however, they
seem to have forgotten the role of tense in sentence construction.
Adverbs of Frequency
The test items on adverbs of frequency represented the highest number

of mistakes committed by both the ESL (57.4%) and the EFL (66.0%)
participants. Students were asked to match the adverbs of frequency with the
phrase that has the same meaning. Most of the ESL participants understood
the meaning of ‘rarely’ as some of the time while EFL learners understood it
as much of the time. The adverb of frequency ‘often’, which means most of
the time, was understood by both EFL and ESL learners as usually.
Identification of Parts of Speech
Participants were given a set of sentences and were asked to identify

the part of speech for the underlined words. Based on the results of the
test, both EFL and ESL participants seemed to confuse nouns with verbs
and EFL students tended to interchange adjectives with adverbs.
Sentence Participants’ answers

1. Dina wants to water her plants. Noun
2. I hate to sit in the dark. Adj(EFL)/adv(ESL)
3. He runs fast. Adj
4. She works hard for her family. Adj
Modals
Based on the results, modals could be considered as an area of

grammar which both EFL and ESL learners had difficulty with. Here are
some of the mistakes committed by the participants with regard to the use
of modals:
1. Excuse me. Shall you pass the salt? (EFL)

2. This cake is too sweet. It couldn’t/wouldn’t have so much sugar in it.
(EFL)
3. This cake is too sweet. It didn’t have so much sugar in it. (ESL)
4. When can you like to go shopping? Anytime is OK with me. (EFL)
EFL learners used shall instead of will and can instead of would. Both
EFL and ESL learners used couldn’t/wouldn’t or didn’t have instead of
shouldn’t ignoring the previous sentence that gave them the clue.
Prepositions
The findings of the current study confirmed the study by Bautista

(2000) that prepositions are one aspect of grammar ESL learners have
difficulty with. Based on the results of the test, the Filipino students scored
lower (62.6%) than the Korean learners (71.4%) in the use of prepositions.
The following examples illustrate some of the errors students made in the
use of prepositions:
1. Marilyn was born at 1990. (EFL)

2. Marilyn was born on 1990. (ESL)
3. We lived at Green Street when we were children. (EFL and ESL)
4. …but never on the same time. (ESL)
Subject-Verb Agreement
Subject-verb agreement was problematic for EFL learners with the

majority of them committing errors in the items that tested this
grammatical rule as shown in the examples:
1. One of the students in the class are a scholar.

2. Either the teacher or the adviser are going to attend the workshop.
3. The poor is blessed.
4. Down the street lives the poor flower vendors.
Students had difficulty locating the real subject of the sentence. For
example, in sentence (4) they mistook ‘street’ as the subject rather than
‘vendors’.
Overall, the Korean EFL learners achieved a less proficient level for a
greater number of test items than the Filipino ESL learners. This could be
attributed to the fact that Korean EFL learners studying in the Philippines
have a lesser need to speak the English language after their classes finish
because they would interact almost exclusively with their fellow Korean
students using their mother tongue.
Conclusion
This small-scale study supports Kachru’s (2005) model in which
learners of English differ in various aspects. The study highlights how
Korean students studying in the Philippines need to be closely monitored
in their English language learning progress since they are in an
environment which may not be too conducive for learning English due to
the fact that the majority of the learners in the class they attend come from
the Outer Circle. A special English class for the Korean students is needed
in order to effectively address their English language needs. However, this
might pose problems since the Korean students are learning two foreign
languages simultaneously (i.e., English and Filipino).
Despite its limitations, this study may offer some insights to English
teachers to modify their course to further address the grammar deficiencies
of both EFL and ESL learners. Knowing the grammar areas where both
EFL and ESL learners are less proficient would give English teachers a
better idea as to how to design their lessons and grammar activities in
order to address these issues. Finally, it is recommended that teachers
incorporate in their teaching strategies activities that would highlight the
use of the target grammar in communicative situations.
Acknowledgement
This study received financial support from the University Research
Center of the Far Eastern University.
References
Bauman, N. (2010). A catalogue of errors made by Korean learners of
English. Paper presented at the Annual Conference of the Korea
Teachers of English to Speakers of Other Languages (KOTESOL),
October 26th-28th, Seoul, South Korea.
Bautista, M. L. S. (2000). Defining standard Philippine English: Its status
and grammatical features. Manila: De La Salle University Press.
Borlongan, A. M. (2010). On the management of innovations in English
language teaching in the Philippines [Editorial commentary]. TESOL
Journal, 2(2), 1-3.
Canale, M., & Swain, M. (1980). Theoretical bases of communicative
approaches to second language teaching and testing. Applied
Linguistics , 1(1), 1-47.
Chang, B. (2011). The origin and development of Asian Englishes.

Retrieved from:
http://paaljapan.org/conference2011/mate/PAAL2011Program.pdf
Cruz, S. (2011). Implicit and explicit knowledge of College of Education
Majors. Master’s dissertation, De La Salle University, Philippines.
Diab, N. (1998). The transfer of Arabic in the English writings of
Lebanese students. The ESP, Sao Paulo, 18(1), 71-83.
Folse, K., Ivone, J., & Pollgreen, S. (2005). 101 clear grammar tests:
Reproducible grammar tests for ESL/EFL classes. Michigan:
University of Michigan Press.
Fromkin, V., Rodman, R., & Hyams, N. (2003). An Introduction to
language. Boston: Wadsworth.
Gonzalez, A. (1985). Studies on Philippine English. Singapore: SEAMEO
Regional Language Centre.
Haegman, L. (1991). Introduction to government and binding theory.
Oxford: Blackwell.
Jung, K., & Min, S.J. (1999). Some lexico-grammatical features of Korean
English newspapers. World Englishes, 18(1), 23-37.
Hymes, D. H. (1972). On communicative competence. In J. B. Pride, & J.
Holmes. (Eds.), Sociolinguistics (pp. 269-293). Baltimore, USA:
Penguin Education, Penguin Books Ltd.
Kachru, B. (2005). Asian Englishes. Hong Kong: Hong Kong University
Press.
—. (2006). Asian Englishes today: World Englishes in Asian context.
Hong Kong: Hong Kong University Press.
Kirkpatrick, A. (2007). World Englishes: Implications for international
communication and English language teaching. Cambridge: Cambridge
University Press.
Llamzon, T. (1969). Standard Filipino English. Quezon City: Ateneo de
Manila University Press.
Li, C. (2005). A study of collocational error types in ESL/EFL college
learners’ writing. Master’s dissertation, Ming Chuan University,
Taiwan.
Magno, C. (2010). Korean students’ language learning strategies and years
of studying English as predictors of proficiency in English. TESOL
Journal, 39(2), 39-61.
Martin, I. (2012). Identity and communication: A framework for teaching
English in the Philippines. Retrieved from:
http://www.dlsu.edu.ph/library/newsette/201009_10.pdf
Nezami, A., & Najafi, M. S. (2012). Common error types of Iranian

learners of English. English Language Teaching, 5(3), 160-170.
Retrieved from:
http://www.ccsenet.org/journal/index.php/elt/article/view/15276
Shanklin, M. (1995). The communication of grammatical proficiency. In
L. Varga (Ed.), The even yearbook 1994 (pp. 147-173). Dept. of
Linguistics, SEAS: ELTE.
Shim, R. J. (1999). Codified Korean English: Process, characteristics and
consequence. World Englishes, 18(2), 247-258.
Taylor, D. (1988). The meaning and use of the term ‘competence’ in
linguistics and applied linguistics. Applied Linguistics, 9(2), 148-168.
Young, R. F. (2008). English and identity in Asia. Asiatic, 2(2), 1-13.
CHAPTER FIFTEEN
METHODOLOGY IN WASHBACK STUDIES
GLADYS QUEVEDO-CAMARGO
AND MATILDE VIRGINIA RICARDI SCARAMUCCI
Abstract
This chapter reviews studies on the washback of language assessment
from 2004, when Cheng, Watanabe and Curtis published the first book on
washback methodology, to 2012. Taking into account Alderson and Wall’s
(1993) admonition for the search for empirical evidence about the
phenomenon and the use of a more ethnographic approach in the
investigations, this review aimed at investigating the researchers’
methodological options during this period. By means of an electronic
search, 78 studies from 31 countries were identified. The analyses showed
that Alderson and Wall’s words were heard, as the identified works
significantly diversified the ways to investigate washback by involving
different stakeholders, using a variety of data collection instruments such
as document analysis, questionnaires, interviews and classroom
observation, as well as by adopting quantitative and mixed approaches of
investigation.
Introduction
Washback in language learning, that is, the impact or influence that
external exams, particularly high-stakes exams, as well as achievement
tests may potentially have on language teaching and learning processes,
the curriculum, material design and stakeholders’ attitudes (Scaramucci,
2004), is a relatively new concept (Cheng, 2008). Studies carried out
mainly after the 1990s consolidated the idea that washback is a frequent,
complex and highly important phenomenon that involves several
288 Chapter Fifteen
stakeholders that are affected differently by the same assessment

instrument.
In 1993, Alderson and Wall called attention to the fact that the studies
on washback conducted so far were methodologically limited, particularly
due to the absence of classroom evidence. Their article, Does washback
exist?, was the first to question washback more deeply. The authors
considered it a highly complex phenomenon treated in a naïve and
simplistic way with little empirical evidence. This was the first article to
discuss methodological alternatives for washback investigations and put
forth a research agenda and fifteen hypotheses to guide future research and
provide the basis for the development of a theory about the concept.
In 2004, Alderson acknowledged a substantial accumulation of empirical
evidence produced from 1993 to the early 2000s which confirmed the
existence of washback and of a theory about it. Although the findings
confirmed the complexity of the phenomenon, they also raised new
questions. According to Alderson, “the question today is not ‘does
washback exist?’ but rather what does washback look like? What brings
washback about? Why does washback exist?” (Alderson, 2004, p. ix). We
might also add ‘How does washback work in the different contexts?’.
Bearing the authors’ words in mind, a literature review was conducted
to investigate whether their words had or had not been taken into account
by researchers of the washback phenomenon. Aiming at identifying the
researchers’ methodological options to investigate language assessment
washback, this review builds on the state of the art washback studies from
2004, when the first book on the topic was published (Cheng, Watanabe,
& Curtis, 2004), to 2012. The next section briefly explains the methodology
used in this review, followed by data presentation and discussion. This
chapter culminates with some thoughts on the contributions for a better
understanding of the concept brought about by the methodological options
used in the studies and for future research.
Literature Review Methodology

This review was carried out by means of search engines such as
Google Scholar (http://scholar.google.com.br), websites such as Language
Testing (http://ltj.sagepub.com), among others, and a Brazilian Academic
Journal website called CAPES (http://www.periodicos.capes.gov.br). The
following key words were used, both in English and Portuguese: impact,
washback and backwash, combined with assessment, foreign languages,
testing, language, test and exam. Overall, 78 studies published from 2004
to 2012 were identified.
Methodology in Washback Studies 289
Data Presentation and Discussion

In order to provide an overview of the most used instruments to tackle
washback, the studies reviewed will be presented as follows: (1)
geographical distribution; (2) year of publication and number of research
studies per year; (3) most investigated exams; and (4) methodological
issues.
Geographical Distribution
Table 15-1 presents the distribution of the studies by continent. It

should be noted that some of them collected data in more than one
country. In this case, the country where the researchers worked was
considered.
Table 15-1: Studies distributed by continent.
Continent Studies %
Asia 12 38.7
Europe 11 35.5
South America 3 9.7
North America 2 6.4
Oceania 2 6.4
Africa 1 3.3
Total 31 100
Note: Mean = 5.16; Standard Deviation = 4.95.
As observed, the concern with language assessment washback is

present in all the continents, mainly in Asia and Europe, where several
countries have official compulsory language exams. In Asia, the places
where the studies were conducted were China and Hong Kong, Israel,
Iran, Japan, Jordan, Korea, Palestine, Pakistan, Taiwan, Turkey and the
United Arab Emirates. In Europe, works from Cyprus, England, Germany,
Greece, Hungary, Ireland, Poland, Romania, Slovenia, Spain and Sweden
were identified. In South America, the studies were carried out in Brazil,
Colombia and Uruguay, whereas in North America, research from both
countries, Canada and the United States, was present. In Oceania, the
countries where research was conducted were Australia and New Zealand,
and in Africa, only one work, produced in Egypt, was found.
290 Chapter Fifteen
Year of Publication and Studies per Year
As Figure 15-1 shows, the year with the highest number of

publications on washback was 2009.
14
12
10
8
6 13
11 10 11
4 7 8 7
5 6
2
0
2004 2005 2006 2007 2008 2009 2010 2011 2012
Figure 15-1: Studies per year of publication.
We could speculate that the growing number of research from 2005 to

2009 was due to the publication of the book by Cheng, Watanabe and
Curtis in 2004, which may be considered the first book on washback in
language assessment. Till then, only journal articles had been published.
Most Investigated Exams
The 78 studies investigated 63 high- and low-stakes exams. As for the

high-stakes exams, the most frequent were the International English
Language Testing System (IELTS), with eight studies, and the Test of
English as a Foreign Language (TOEFL), with five studies. The other
studies focused mainly on the official compulsory exams in their
respective countries.
Methodological Issues
In 1993, Alderson and Wall stated:
“There has been a tendency to date to rely upon participants’ reported

perceptions of events through interview and questionnaire responses, or to
examine the results and relationships of test performances. […] In

addition, we believe it important […] to triangulate the researcher’s
perceptions of the events with some account of participants’ of how they
perceived and reacted to events in class, as well as outside class—this
amounts to an advocacy of a more ethnographic approach to the topic than
has been common to date.” (Alderson & Wall, 1993, p. 127)
It was, therefore, necessary to expand the methodological options in

washback studies in relation to both participants and instruments so as to
obtain more information about the characteristics of the washback
phenomenon and possible factors for its existence. The 78 studies that
were reviewed revealed that Alderson and Wall’s words were heard
because the works demonstrated that the researchers went much beyond
measuring the participants’ perceptions. Some of them are classified as
qualitative, like Hung (2012), and others as quantitative, such as Green
(2006). However, several of them used a mixed-methods research
methodology, i.e., a mixture of qualitative and quantitative methods (e.g.,
Cresswell, 2009; Dörnyei, 2007). This may have been a consequence of
the growing importance, in the 2000s, of quantitative methods in Applied
Linguistics (Lazaraton, 2005) as well as of the understanding that
qualitative and quantitative approaches are not antagonistic, but
complementary (Scaramucci, 1995).
Alderson and Wall (1993) suggested a more ethnographic approach. It
is possible to state that almost all the studies identified are ethnographic.
The only exceptions were the ones that used document analysis
exclusively. Thus, ethnography is relevant for washback research as it
allows the investigation of phenomenological data that represent the
participants’ perspectives and the use of such perspectives to describe and
reflect upon the phenomenon (Fetterman, 2008; Watanabe, 2004). In
addition, the majority of the studies triangulate researchers’ and
participants’ perceptions, i.e., by using multiple methods to obtain more
precise conclusions from a variety of angles (Lodico, Spaulding, &
Voegtle, 2006). The triangulation involved stakeholders, sources of
information (or documents) and data-collection instruments. As far as the
stakeholders are concerned, ten different ones were identified: teachers,
students or examination candidates, teacher supervisors, school principals,
coordinators, school supervisors, educational authorities, exam designers,
parents and material writers. Retorta (2007), for example, is a study that
involved eight different stakeholders. This variety of perspectives surely
provides a much more comprehensive picture of the complexity of
washback and gives voice to stakeholders who had not been heard in
previous research.
292 Chapter Fifteen
As for sources of information, the following were identified: exams or

previous exam guidelines and other publications about the exams,
documents about previous applications of the exams, candidates’ statistical
data and teaching materials.
With respect to triangulation by instruments, the studies included from
one to five instruments, as shown in Table 15-2.
Table 15-2: Number of instruments used in the studies.
Year of Study
Nr. of
Total
instru-
2004
2005
2006
2007
2008
2009
2010
2011
2012
%
ments
1 0 0 0 1 1 3 0 0 0 5 6.4
2 6 2 4 4 9 3 1 1 2 32 41.0
3 3 4 3 0 1 5 2 4 2 24 30.8
4 1 1 1 3 0 2 2 0 2 12 15.4
5 1 0 0 2 0 0 0 1 1 5 6.4
Total 78 100
In total, the researchers used ten instruments, as shown in Figure 15-2,

along with their frequency of use. These instruments will be discussed
below.
Figure 15-2: Data-collection instruments in the reviewed studies.

Document Analysis
Documents are the basis for the majority of qualitative research studies
(Schensul, 2008). Document analysis can be a main source of data or a
complementary instrument, depending on the study object and aims
(Lüdke & André, 1986). All 78 studies reviewed used information
obtained from documents, even when the researcher does not explicitly
mention this. This is the case with studies like Caine (2005), for instance.
Shawcross (2007) and four other studies reported document analysis as the
sole source of data collection. Others such as Barletta and May (2006)
mentioned it as a secondary source. In the studies that reported document
analysis as one of the instruments used, the way the analyses were
conducted varied considerably, including quantitative approaches using
data codification, and qualitative approaches, with the predominance of
selection of relevant information made by the researcher, followed by
possible interpretations of the data.
Therefore, document analysis is an instrument inherent to all research
that aims at investigating exam washback, since a thorough understanding
of the exam as well as of the educational context where the study is carried
out are essential conditions for any conclusions about this phenomenon.
Questionnaire
According to Brown (2001, p. 6), “questionnaires are any written

instruments that present respondents with a series of questions or
statements to which they are to react either by writing out their answers or
selecting from among existing answers.” In Dörnyei’s words (2003, p. 1),
“the popularity of questionnaires is due to the fact that they are easy to
construct, extremely versatile, and uniquely capable of gathering a large
amount of information quickly in a form that is readily processable.”
However, questionnaires are easy to process mainly when the format
chosen allows the use of statistical procedures such as multiple-item
scales, generally known as Likert scales. According to Fawcett (2008, p.
836), “a questionnaire adopting a quantitative orientation is concerned
with systematically collecting quantifiable data relating to a number of
variables. The purpose is to statistically examine the data to discover
associations and possible patterns or trends.” Open- or semi-open-ended
questions are also possible. In these cases, the answers can be analyzed
qualitatively, by using content analysis for example, or quantitatively, by
codifying the information to be processed by computer software (Dörnyei,
2003).
294 Chapter Fifteen
The review of the selected studies confirmed the popularity of

questionnaires, as they were used by 64.1% of the works (50 out of 78).
The questionnaires in these studies were administered to a very high
number of participants in different locations, which constitutes one of the
advantages of questionnaires (Cresswell, 2009; Dörnyei, 2007). Thus, they
allowed the investigation of the washback phenomenon from the
perspectives of different stakeholders and in many aspects of the teaching-
learning process in contexts such as public secondary schools, universities
and language schools. Besides, studies like Muñoz and Alvarez (2010),
among others, expanded the concept of washback, which originally
referred to the consequences of high-stakes exams (Alderson & Wall,
1993), to issues related, for example, with how much students understand
their school assessment systems and their perception on the connections
between educational goals and assessment.
The importance of questionnaires for the development of such studies
is undeniable. However, it is relevant to mention that, like any instrument,
they also have disadvantages. For example, the data obtained may not be
totally reliable, as the respondents might answer what they think may
please those in charge, particularly when respondents are identified or
when the questionnaire is administered officially, i.e., by request of the
school principal or by a researcher known to the respondents. When there
are questions related to classroom teaching, for example, the teacher
respondents may not want to expose themselves and they end up
describing what does not correspond to what really happens in class.
One cannot deny that social relations between the researcher and the
participants are always asymmetrical (Bourdieu, 1998) and in the search
for a neutral approach, it is necessary
“to consider, in detail, the situation in which the texts resulting from such
procedures (interviews and questionnaires) are produced; as well as the
(illocutionary and perlocutionary) values of the act of asking and the ways
of asking which favour the diffusion of assumptions of the researcher
about the required information.” (Machado & Brito, 2009, pp. 140-141 –
our own translation)
Thus, it is crucial that researchers look for means to minimize such

risks by making the participants aware of the important role they play in
the study and the relevance of the information they provide. Participants
need to be assured of confidentiality by avoiding, for instance,
handwriting recognition or identification of places where the questionnaire
was administered. In this sense, the use of electronic tools has helped
researchers as they allow participants to answer without being identified.
In addition, studies should also consider the participants’ views about

how their answers will be used, who is conducting the study and the genre
used to collect data, which may or may not allow a certain degree of
negotiation between researcher and participant (Machado & Brito, 2009).
This leads us to what seems to be one of the main problems in the
conception of questionnaires in a research context: the (highly
disseminated) idea that researchers are neutral elements, exempt of bias or
influences by the participants or the research context of production. This
notion exists clearly in relation to language, as it is wrongly believed that
language is a transparent vehicle, a semiotic costume that would allow the
direct achievement of the social actors’ views (Machado & Brito, 2009, p.
157).
Interview
Out of 78 studies reviewed, 41 (52.6%) used interviews which,

according to Dörnyei (2007), provide versatility in data-collection and
reproduce a communicative routine that is familiar to people. Many were
conducted via internet, by means of synchronous or asynchronous
communication, such as in Wall and Horák (2006; 2008), although the
majority of the interviews happened face to face, like in Erfani (2012).
Informal conversations with the participants as shown in Li (2009)
were considered as interviews, as suggested by Brinkmann (2008). The
author also states that the majority of the interviews in qualitative research
are of the semi-structured variety as this format allows the researcher to
guide the conversations according to his/her interests and leaves room for
a more spontaneous interaction and the researcher’s intervention in the
event that a topic needs further discussion.
The advantages of interviews include the possibility of clarification of
specific issues, a more personal approach towards the interviewee and the
possibility of obtaining non-verbal clues relevant to the study. The
disadvantages are related to difficulties in making appointments with the
participants, the need to travel to the place where the interviewee is when
online interviews are not possible, time consuming data collection and
analysis (in the case of manual transcription of recorded interviews), and
limitations in the number of interviewees to be used in the research. The
comments made above with respect to the risks and precautions related to
questionnaires apply to interviews as well.
296 Chapter Fifteen
Classroom Observation
Observation entails the gathering of impressions about particular

aspects of the world around us in a systematic way, aiming at learning
about a specific phenomenon. This instrument can be the main source of
data, but is often used together with interviews and questionnaires. It is
used in qualitative, quantitative or mixed-methods approaches
(McKechnie, 2008). The use of classroom observation to collect
information about washback was strongly advocated by Alderson and
Wall (1993), who stated that “it is increasingly obvious that we need to
look closely at classroom events in particular, in order to see whether what
teachers and learners say they do is reflected in their behaviour” (p. 127).
Messick (1996) reinforces the importance of classroom observation
when he says that such an instrument could be used to record changes in
teachers’ and students’ behaviours associated with the introduction of a
test, thus contributing for collecting evidence of its consequential validity.
However, observing lessons is not an easy task. Watanabe (2004) states
that:
“the observation task is divided into several subtasks, typically involving

construction of observation instruments, pre-observation interviews,
recording classroom events, and post-observation interviews. […] Before
entering the classroom, a variety of information needs to be gathered about
the school (e.g., educational policy, academic level, etc.) and the teacher
whose lesson is to be observed (e.g., education, age/experience, major field
of study, etc.).” (p. 30)
As shown in Figure 15-2, 30 (38.5%) out of the 78 studies reviewed

used classroom observation. Nikoopour and Farsani (2012) assert this
instrument became highly important to investigate washback as its biggest
advantage was to provide access to rich data on what really happened in
the research context. The disadvantages are similar to those mentioned for
interviews. Other issues to be considered include the need to obtain the
educational authorities’ consent to enter the classroom and limitations
related to observation time and the researcher’s own availability.
It is also important to take into account the possibility of occurrence of
the so-called Hawthorne effect, which is a change in a group’s behaviour
due to the awareness they are being observed (Coombs & Smith, 2003).
This effect can be minimized when the observation period is longer so that
the group gets more and more familiar with the observer.
Pre- and Post-Tests
Pre- and post-tests are typical of quantitative research and are

consequently very common in education when the objective is to
investigate the implementation of educational innovations or to compare
students’ performance after some kind of intervention (Dugard & Todman,
1995; Lodico, Spaulding, & Voegtle, 2006). In the studies reviewed,
10.2% (8 studies) mentioned pre- and post-tests, such as Saif (2006). This
study is a good example of how productive the use of these instruments
can be to collect both qualitative and quantitative data about students’
performance before and after instruction, which is a great advantage,
particularly when high-stakes exams are used. It allows one to ensure that
the level of difficulty of the questions is appropriate for the students’ level
and the purpose of the research study, while avoiding the difficulty of
having to design tests that should be equivalent so as to be used in
different administrations.
Focus Group
Focus groups consist of collective interviews, generally in groups of 6

to 12 people, conducted by a moderator (usually the researcher). This
format produces a lot of data, as the participants are led to think together
and share their thoughts and impressions, which results in people’s
reactions and raises rich discussions about the topic (Dörnyei, 2007). This
is a powerful means of data collection as it allows the participants to
engage in deeper conversations about issues relevant to their reality and
express different views, thus giving important information to the
researcher (Morgan, 2008). This is one of its biggest advantages, besides
others similar to those found with interviews. The disadvantages include
more difficulties with transcriptions and data analysis, problems with
finding a mutually convenient time slot, an imbalance in participants’
contributions, and the intimidation of some participants in front of others.
In this review, 7.7% (6) of the studies used focus groups. Fox and
Cheng (2007) reported using only focus groups, whereas others like El-
Ebyary (2009) combined focus groups with other instruments. Such
combinations are very productive because different instruments will not
necessarily provide identical data about the same participants (Morgan,
2008).
298 Chapter Fifteen
Diary
Diaries consist of participants’ reports of their daily activities and

objective experiences (Smith-Sullivan, 2008). They were used in 5.1% (4)
of the studies. The studies show that diaries have the merit of capturing
data that are normally inaccessible or unobservable by other means and
can be used together with other instruments (Cresswell, 2009; Dörnyei,
2007), as can be seen in Tsagari (2006). In terms of the disadvantages, one
should consider that the responsibility of producing data relies heavily on
the participants, which requires their awareness and preparation to write
frequently and carefully. In addition, data accuracy may be at risk, and this
requires the researcher to carefully plan both the choice of participants and
the guidelines that the participants must adhere to.
Researcher’s Writing
This instrument refers to the daily record of impressions and

reflections made by the researcher based on his/her observations (Smith-
Sullivan, 2008). Only one study, Li (2009), collected data with
researcher’s writing. Researcher’s writing is analogous to field notes
normally produced in classroom observations. These notes are generally
not considered a separate instrument, but a kind of by-product of the
observation.
Student Follow-up
Used in Li’s (2009) study only, student follow-up corresponds to the

close observation of classroom extension activities developed by students.
The use of this instrument is not very common as it demands closer
contact between the researcher and the participants. This is probably its
main limitation, along with the fact that it is only possible with a small
number of participants or with a large research team. This data collection
method can be considered a parallel technique to classroom observation as
it focuses on what is visible in terms of students’ study habits, which is
highly relevant when it comes to investigating washback effects in
students’ learning. Student follow-up seems to be appropriate for
ethnographic research. The data it generates are predominantly qualitative
from criteria that may or may not be pre-established, which can
complement information stemming from other data sources.
Conversational Analysis
Conversational analysis (Eggins & Slade, 1997; Ervin-Tripp, 1979) has

been used in studies on oral production or the assessment of oral
production (Andrews, Fullilove, & Wong, 2002; Lazaraton, 2002) as it
provides, among other data, statistical analyses of turn-taking and pauses.
Among the studies reviewed, the only one to use this instrument was
Harwood (2007).
Conclusion
Based on Alderson and Wall’s (1993) article, in which the authors call
the attention of language assessment researchers’ to the need to adopt
ethnography in the search for empirical evidence on washback, this study
aimed at investigating whether the authors’ admonition had any effect on
subsequent works. By means of an electronic literature review, 78 studies
conducted in 31 countries were identified.
The analysis of the studies revealed that Alderson and Wall’s (1993)
words had an impact on the research community, as the number of
methodological options to investigate language assessment washback had
significantly increased. Evidence of this is the use of mixed-methods
research in which qualitative and quantitative perspectives merged to
produce a more complete picture of the studied phenomenon. Furthermore,
the use of ethnography and triangulation by means of a variety of
stakeholders, sources of information (or documents) and data-collection
instruments was found in the great majority of the studies reviewed.
Taking into consideration Alderson and Wall’s (1993) suggestion for
triangulating the researcher’s perceptions with those of the participants,
inside and outside classroom, in an attempt to capture the complexity of
the phenomenon, the reviewed studies gave voice to different
stakeholders: teachers, students or examination candidates, teacher
supervisors, school directors, coordinators, school supervisors, educational
authorities, exam designers, students’ parents, and material writers.
As far as the sources of information are concerned, researchers used
different kinds of documents such as previous exams, guidelines and other
exam publications, statistical data on test takers’ performance, and
teaching materials in order to understand more deeply the construct and
the history of the exam they were working with as well as to characterize
the research context.
In relation to triangulation, ten instruments were identified and are
listed here from the most to the least frequently used: document analysis,
300 Chapter Fifteen
questionnaire, interview, classroom observation, pre- and post-tests, focus

group, diary, researcher’s writing, student follow-up and conversational
analysis. The number of instruments used in the studies varied from one to
five. All of the studies used document analysis as at least one means of
data collection. Only five studies reported the sole use of document
analysis and they were the only ones to be classified as using one
instrument. The other 73 works presented different combinations of the
instruments such as the use of document analysis and interviews or pre-
and post-tests and interviews. Therefore, document analysis functions as
an important background that provides the researcher with information
about the research context and potential washback. Future research needs
to consider that in order to have a better view of what is happening with
washback, more data collection instruments and stakeholders need to be
involved in the washback studies.
The increased use of methodological options in the reviewed studies
surely represents a considerable advance in washback research. By
conducting this literature review and focusing on the methodological
options of the studies, this work also aimed at contributing to future
research design and more conscious and informed research development
based on the experience of investigations carried out by researchers
worldwide.
References
Alderson, J. C. (2004). Foreword. In L. Cheng, Y. Watanabe, & A. Curtis
(Eds.), Washback in language testing: Research contexts and methods
(pp. ix-xii). New Jersey: Lawrence Erlbaum Associates.
Alderson, J. C., & Wall, D. (1993). Does washback exist? Applied
Linguistics, 14(2), 115-129.
Andrews, S., Fullilove, J., & Wong, Y. (2002). Targeting washback-a case
study. System, 30(2), 207-223.
Barletta, N., & May, O. (2006). Washback of the ICFES Exam: A case
study of two schools in the Departamento del Atlántico. Íkala revista
de lenguaje y cultura, 11(17), 235-261.
Brinkmann, S. (2008) Interviewing. In L. M. Given (Ed.). The Sage
encyclopedia of qualitative research methods (pp. 470-472). Thousand
Oaks, CA: Sage Publications, Inc.
Bourdieu, P. (1998). Compreender. In P. Bourdieu (Ed.), A miséria do
mundo (2a ed., pp. 693-732). Petrópolis: Vozes.
Brown, J. D. (2001). Using surveys in language programs. Cambridge:
Caine, N. A. (2005). EFL examination washback in Japan: Investigating

the effects of oral assessment on teaching and learning. Doctoral
thesis, University of Manchester, Manchester, UK. Retrieved from:
http://www.asian-efl-journal.com/Thesis_Washback_in_Japan_
Caine.pdf
Cheng, L. (2008). Washback, impact and consequences. In E. Shohamy, &
N. H. Hornberger (Ed.), Encyclopedia of language and education.
(Vol.7: Language Testing and Assessment, 2nd ed., pp. 349-364). NY:
Springer.
Cheng, L., Watanabe, Y., & Curtis, A. (2004). Washback in language
testing: Research contexts and methods. New Jersey: Lawrence
Erlbaum Associates.
Coombs, S. J., & Smith, I. D. (2003). The Hawthorne effect: Is it a help or
a hindrance in social science research? Change: Transformations in
Education, 6(1), 97-111.
Cresswell, J. W. (2009). Research design: Qualitative, quantitative, and
mixed methods approaches (3rd ed.). Thousand Oaks, California: Sage
Publications, Inc.
Dörnyei, Z. (2003). Questionnaires in second language research:
Construction, administration and processing. Mahwah, NJ: Lawrence
Erlbaum Associates Inc.
—. (2007). Research methods in applied linguistics: Quantitative,
qualitative and mixed methodologies. Oxford: Oxford University
Press.
Dugard, P., & Todman, J. (1995). Analysis of preǦtest postǦtest control
group designs in educational research. Educational Psychology, 15(2),
181-198.
Eggins, S., & Slade, D. (1997). Analysing casual conversation. London:
Cassell.
El-Ebyary, K. (2009). Deconstructing the complexity of washback in
relation to formative assessment in Egypt. Research Notes, 35, 2-5.
Erfani, S. S. (2012). A comparative washback study of IELTS and TOEFL
iBT on teaching and learning activities in preparation courses in the
Iranian context. English Language Teaching, 5(8), 185-195.
Ervin-Tripp, S. M. (1979). Children’s verbal turn-taking. In E. Ochs, & B.
Schieffelin (Eds.), Developmental pragmatics (pp. 391-414). New
York: Academic Press.
Fawcett, B. (2008). Structuralism. In L. M. Given (Ed.), The Sage
302 Chapter Fifteen
Fetterman, D. M. (2008). Ethnography. In L. M. Given (Ed.), The Sage

Fox, J., & Cheng, L. (2007). Did we take the same test? Differing accounts
of the Ontario secondary school literacy test by first and second
language testǦtakers. Assessment in Education, 14(1), 9-26.
Green, A. (2006). Washback to the learner: Learner and teacher
perspectives on IELTS preparation course expectations and outcomes.
Assessing Writing, 11(2), 113-134.
Harwood, C. (2007). Washback and the Cambridge ESOL Key English
Test speaking component: A study from Japan. Master’s dissertation,
University of Leicester, Leicester, UK.
Hung, S. T. A. (2012). A washback study on e-portfolio assessment in an
English as a foreign language teacher preparation program. Computer
Assisted Language Learning, 25(1), 21-36.
Lazaraton, A. (2002). A qualitative approach to the validation of oral
language test. (Studies in Language Testing 14). Cambridge:
—. (2005). Quantitative research methods. In E. Hinkel (Ed.), Handbook
of research in second language teaching and learning (pp. 209-224).
Mahwah, NJ: Lawrence Erlbaum.
Li, Y. (2009). A preparação de candidatos chineses para o exame Celpe-
Bras: Aprendendo o que significa “uso da linguagem”. Master’s
dissertation, Universidade Federal do Rio Grande do Sul, Porto Alegre,
Brazil.
Lodico, M. G., Spaulding, D. T., & Voegtle, K. H. (2006). Methods in
educational research: From theory to practice. San Francisco, CA:
Jossey-Bass.
Lüdke, M., & André, M. E. D. A. (1986). A pesquisa em educação:
Abordagens qualitativas. São Paulo: EPU.
Machado, A. R., & Brito, C. (2009). O agir linguageiro em questionário de
pesquisa. In A. R. Machado, E. Cols, L. S. Abreu-Tardelli, V. L. L.
Cristovão (Eds.), Linguagem e educação: O trabalho do professor em
uma nova perspectiva (pp. 137-160). Campinas, SP, Mercado de
Letras.
McKechnie, L. E. F. (2008). Observational research. In L. M. Given (Ed.),
The Sage encyclopedia of qualitative research methods (pp. 573-575).
Thousand Oaks, CA: Sage Publications, Inc.
Messick, S. (1996). Validity and washback in language testing. Language
Testing, 13(3), 241-256.
Morgan, D. L. (2008). Focus groups. In L. M. Given (Ed.), The Sage

Muñoz; A. P., & Alvarez, M. E. (2010). Washback of an oral assessment
system in the EFL classroom. Language Testing, 27(1), 33-49.
Nikoopour, J., & Farsani, M.A. (2012). Depicting washback in Iranian
high school classrooms: A descriptive study of EFL teachers’
instructional behaviors as they relate to University entrance exam. The
Iranian EFL Journal, 8(1), 9-34.
Retorta, M. S. (2007). Efeito retroativo do vestibular da Universidade
Federal do Paraná no ensino de língua inglesa em nível médio no
Paraná: Uma investigação em escolas públicas,particulares e cursos
pré-vestibulares. Doctoral thesis, Universidade Estadual de Campinas,
Campinas, SP, Brazil.
Saif, S. (2006). Aiming for positive washback: A case study of
international teaching assistants. Language Testing, 23(1), 1-34.
Scaramucci, M. V. R. (1995). O papel do léxico na compreensão em
leitura em língua estrangeira: foco no produto e no processo. Doctoral
thesis, Universidade Estadual de Campinas, Campinas, SP, Brazil.
—. (2004). Efeito retroativo da avaliação no ensino/aprendizagem de
línguas: o estado da arte. Trab. Ling. Aplic., 43(2), 203-226.
Schensul, J. J. (2008). Documents. In L. M. Given (Ed.), The Sage
encyclopedia of qualitative research methods (p. 232). Thousand Oaks,
CA: Sage Publications, Inc.
Shawcross, P. (2007). What do we mean by the ‘washback effect’ of
testing? Paper presented at the 2nd ICAO Aviation Language
Symposium, May 7th-9th, Montreal, Canada. Retrieved from:
http://legacy.icao.int/icao/en/anb/meetings/ials2/Docs/15.Shawcross.pdf
Smith-Sullivan, K. (2008). Diaries and journals. In L. M. Given (Ed.), The
Sage encyclopedia of qualitative research methods (pp. 213-215).
Thousand Oaks, CA: Sage Publications, Inc.
Tsagari, K. (2006). Investigating the washback effect of a high-stakes EFL
exam in the Greek context: Participants’ perceptions, material design
and classroom applications. Doctoral thesis, Lancaster University,
Lancaster, UK.
Wall, D., & Horák, T. (2006). The impact of changes in the TOEFL
examination on teaching and learning in Central and Eastern Europe:
Phase I, the baseline study. TOEFL Monograph Series, Report Number
RR.06.18. Princeton, NJ: Educational Testing Service.
Wall, D., & Horák, T. (2008). The impact of changes in the TOEFL
examination on teaching and learning in Central and Eastern Europe:
304 Chapter Fifteen
Phase 2, coping with change. TOEFL Monograph Series, Report

Number RR.08.37. Princeton, NJ: Educational Testing Service.
Watanabe, Y. (2004). Methodology in washback studies. In L. Cheng, Y.
Watanabe, & A. Curtis (Eds.), Washback in language testing:
Research contexts and methods (pp. 19-36). Mahwah, NJ: Lawrence
Erlbaum Associates.

CONTRIBUTORS
Zakia Ali-Chand is currently the Head of School of Communication,

Language & Literature at the Fiji National University in Suva, Fiji. She
has a PhD in Linguistics from the University of the South Pacific. Her two
publications in 2015 include Using SQ3R Method to Improve Reading
Comprehension Abilities and The Significance of Intercultural Communication
Studies for Second Language Teaching. Her main research interests
include language learning strategies and academic language skills of
second language learners. She is also the current President of the Fiji
Association of Women Graduates, an affiliate of Graduate Women
International based in Geneva, Switzerland.
Kazuo Amma, PhD, Professor of applied linguistics and TEFL at Dokkyo

University (Saitama, Japan), has worked in statistical analysis of language
test data. His particular interest in the qualitative difference in language
processing among learners of different proficiency levels resulted in a
paper which obtained Best Research Paper Award at the SAS Forum 2004
(Tokyo). He also developed a computer programme for the partial scoring
of sequential items (2010). His current engagements include test projects
for secondary school students at the Ministry of Education and
accreditation of airline pilots’ English proficiency at the Ministry of Land
and Transportation in Japan.
James Dean Brown (“JD”) is currently Professor of Second Language

Studies at the University of Hawai‘i at MƗnoa. He has spoken and taught
in many places ranging from Brazil to Venezuela. He has published
numerous articles and books on language testing, curriculum design,
research methods, and connected speech. His most recent books are:
Mixed methods research for TESOL (2014 from Edinburgh University
Press); Cambridge guide to research in language teaching and learning
(2015 with C. Coombe from Cambridge University Press); and Teaching
and assessing EIL in local contexts around the world (2015 with S. L.
McKay from Routledge).

306 Contributors
Sarah Chabal (PhD from Northwestern University) is a Research

Psychologist at the Naval Submarine Medical Research Lab in Groton,
Connecticut. As a former member of the Bilingualism and Psycholinguistics
Research Group at Northwestern University, she has conducted research
on bilingualism, language processing, and interactions between language
and perceptual systems. Her work has been published in peer reviewed
scientific journals and has been presented at both national and international
conferences.
Douglas Altamiro Consolo is an Associate Professor of Applied

Linguistics and of English as a Foreign Language at the State University
of Sao Paulo (UNESP), Brazil. He teaches undergraduate and postgraduate
courses, and supervises students’ projects in initial teacher education, as
well as MA dissertations, PhD theses and postdoctoral studies. He is the
co-ordinator of the ENAPLE-CCC research group and one of the senior
researchers in the development of EPPLE, a Brazilian proficiency
examination for foreign language teachers. He has published extensively
in the areas of EFL and language assessment. He has co-authored and co-
edited four books. He has recently joined a project on Portuguese as
foreign language (PFL) at UNESP, in which he is responsible for online
assessment in PFL. His main research interests include assessment, foreign
language learning and teaching, language testing and teacher development.
Christine Coombe has a PhD in Foreign/Second Language Education

from The Ohio State University. She is currently on the English/General
Studies faculty of Dubai Men’s College, UAE. Christine is co-
author/editor of numerous professional volumes including A Practical
Guide to Assessing English Language Learners (2007, University of
Michigan Press); The Cambridge Guide to Research in Language
Teaching and Learning (2015, Cambridge University Press) and
Reigniting, Retooling and Retiring in English Language Teaching (2012,
University of Michigan Press). Most recently, Christine served as
President of the TESOL International Association (2010-2013) and
received the British Council’s International Assessment Award (2013).
Selwyn C. Cruz is a full-time Lecturer at the Enderun Colleges. He

finished his Master of Arts in English Language Education at the De La
Salle University where he is currently taking his PhD in Applied
Linguistics. He has presented papers in conferences, published articles in
journals, delivered lectures, and provided trainings on language learning

Current Issues in Language Evaluation, Assessment and Testing: 307
related courses. His research interests are in the areas of Discourse

Analysis, Sociolinguistics and Educational Technology.
Marina Dodigovic is the Director of both the MA TESOL program and

the Research Centre for Language Technology at XJTLU. She is a
member of the editorial boards of two refereed academic journals, TESOL
International and Voices in Asia. In the past, she served the TESOL
community as a Ruth Crymes Fellowship Academy Award reader. Marina
is the author of Artificial Intelligence in Second Language Learning, the
first English-language monograph on Intelligent CALL, and the editor of
Attitudes to Technology in ESL/EFL Pedagogy. She has received multiple
extramural and intramural research grants and has published a number of
articles in refereed academic journals. Her recent research interests have
gravitated toward vocabulary teaching, learning and assessment.
Carol J. Everhard was formerly a Teaching Fellow in the Department of

Linguistics of the School of English, Aristotle University of Thessaloniki
in Greece. She both taught and coordinated undergraduate language
courses, was organizer of the School’s Resource Centre and taught a
specialist course in Self-access and Foreign Language Learning. Her
doctoral studies investigated the relationship between learner-centred
assessment and autonomy in language learning. She was Coordinator of
the IATEFL Learner Autonomy Special Interest Group between 2006-
2008 and she co-edited Autonomy in Language Learning: Opening a Can
of Worms (2011) and Assessment and Autonomy in Language Learning
(2015).
Christina Gitsaki is an Associate Professor and Research Coordinator at

the Center for Educational Innovation, Zayed University, UAE. Dr Gitsaki
has previously served as the UNESCO Chair in Applied Research in
Education designing professional development programs for teachers in
the K-12 and higher education sectors. Dr. Gitsaki has served on the
executive boards of professional associations such as the Applied
Linguistics Association of Australia, the Gulf Comparative Education
Society, TESOL Arabia, and she is currently the Secretary General of the
International Association of Applied Linguistics. Her research interests are
in second language acquisition and pedagogy, the use of educational
technology in the second language classroom, teacher professional
development, and the scholarship of teaching and learning.
308 Contributors
Gerriet Janssen is currently finishing his doctorate at University of

Hawai‘i, MƗnoa, under the supervision of Dr. James Dean Brown. He has
simultaneously maintained his position as Assistant Professor at
Universidad de los Andes–Colombia, a position he has held since 2006.
There, and within a team of colleagues, he headed the development of the
language program Inglés para Doctorados, a program that he currently
teaches within and coordinates. His main research interests include:
language assessment, academic writing, curriculum development, and
program evaluation. His research papers have been published in refereed
journals and books both within the Colombian and international context.
Liudmila Kozhevnikova received her PhD from Moscow State

University and is currently an Assistant Professor at Samara State
University, Russia. She is Vice President of SETA and Board Member of
the National Association of Teachers of English. She is on the Board of
Experts for the Foreign Languages Council on Methodology and Research
at the Russian Ministry of Education and Science, the Volga Region
Affiliate which she chaired for seven years (1998–2005). She co-authored
Exam Success (2013). Her main research interests include testing and
assessment, TESOL and CLIL.
Caroline Larson (M.Ed. from University of Virginia) is a practicing

speech-language pathologist at The Communication Clubhouse in
Chicago, Illinois. She is certified by the American Speech-Language-
Hearing association and Illinois Early Intervention, and state licensed in
Illinois. She is a translational researcher for the Bilingualism and
Psycholinguistics Research Group at Northwestern University, and has
presented research on topics related to bilingualism and child language at
state and national conferences.
Persephone Mamoukari is a PhD researcher in the field of Applied

Linguistics in Democritus University of Thrace. She holds a degree in
English Language and Literature from the Aristotle University of
Thessaloniki and an MA in Black Sea studies from the Democritus
University of Thrace. She has attended a number of congresses and has
written papers in refereed proceedings. She speaks English fluently and is
a proficient speaker of French. Her interests focus on language teaching,
communication strategies, learning strategies, school psychology, and
psycholinguistics.
Viorica Marian (Ph.D. from Cornell University) is the Ralph and Jean
Sundin Professor of Communication Sciences and Disorders and Professor
of Psychology and Cognitive Science at Northwestern University in the
United States. Since 2000, she has directed the Bilingualism and
Psycholinguistics Research Group, with funding from the National
Institutes of Health and the National Science Foundation. Her research
centers on bilingualism and its consequences for linguistic, cognitive, and
neural function, with a focus on language processing, learning, and
memory. Her research has been disseminated in over 100 publications,
over 200 conference and invited presentations, and receives extensive
press coverage (http://www.bilingualism.northwestern.edu/).
Jessica Midraj holds a PhD in Curriculum and Instruction (English

Education) from Indiana State University, USA. Currently, she is a faculty
member in the College of Arts and Sciences at the Petroleum Institute in
Abu Dhabi, UAE. Dr. Midraj has over 18 years of experience in the field
of English language education as a teacher, mentor, researcher, and
manager. She has published in the areas of self-assessment, reading,
parental involvement, educational standards, grammar teaching pedagogy,
and motivation and has authored numerous language tests and assessment
manuals. Her research interests include curriculum and assessment design,
academic language learning instruction, and assessment management.
Sadiq Midraj is an Associate Professor of language education and the

Quality Assurance Coordinator in the College of Education at Zayed
University. He has provided consultations to the Ministry of Education on
K-12 English standards, Bidaya Media on research, the UNESCO Office
in Beirut on quality assurance standards for education units and Khalifa
Award for Education as an arbitrator. Dr. Midraj also worked as the
Director for the Center for Professional Development. He has published on
assessing Arabic-English bilingual literacy, outcomes-based education,
and language-learner variables. His research interests include English
language assessment, bilingual education, and quality assurance.
Jacob Mlynarski, an EAP Lecturer at the Northern Consortium United

Kingdom (NCUK), has taught Academic English in China for the past 6
years. He holds a B.A. in English Philology and M.A. in TESOL. His main
interests include TESOL, more specifically plagiarism, the use of
technology in writing instruction and contrastive rhetoric. He is
particularly interested in differences between rhetorical styles placed
310 Contributors
within various cultural contexts. He is currently developing a Graduate

Foundation Programme in China in association with the NCUK.
Gladys Quevedo-Camargo is an Adjunct Professor in the Department of

Foreign Languages and Translation, University of Brasília, Brasília/DF,
Brazil. She worked for many years as an English teacher in private schools
and, since 2006, she has worked with teacher education in government-
funded higher education institutions. She has also been an oral examiner
for the Cambridge main suite exams and the IELTS, and coordinated the
testing development center in one of the universities where she worked.
Her main research interests include language teaching, learning and
assessment, national and international exams and teacher education.
Matilde V. R. Scaramucci, full Professor in the Department of Applied

Linguistics, University of Campinas, SP, Brazil, is past Dean of the
Institute of Language Studies, University of Campinas (2011-2014). She
is one of the developers of the Certificate of Proficiency in Portuguese as
a Foreign Language (Celpe-Bras, Brazilian Ministry of Education) and
editor-in-chief of the refereed journal Trabalhos em Linguística Aplicada
(2006-2014). She has published extensively in the areas of SL/FL
teaching, testing and assessment. She has co-edited Português para
Falantes de Espanhol (2008) and Pesquisas sobre Vocabulário em Língua
Estrangeira (2008). Her main research interests include the assessment of
integrated tasks, validity and washback. Her research papers have been
published in numerous refereed journals and books mostly in Brazil.
Vera Lucia Teixeira da Silva is an Associate Professor of English as a

Foreign Language at the State University of Rio de Janeiro where she
works as an educator and researcher. She is currently a member of a group
that is developing a Brazilian proficiency exam for teachers of foreign
languages. She has published several articles in the area of assessment and
EFL. She has co-authored Olhares sobre competências do professor de
língua estrangeira: da formação ao desempenho profissional (2007). Her
main research interests include assessment, teacher development and
second language acquisition.
Renata Mendes Simões holds a PhD in Applied Linguistics from Catholic

University of Sao Paulo (PUC-SP), Brazil, a Master’s in Education and
Arts from Mackenzie University, with specialization in Teaching in
Higher Education. She is an autonomous English teacher and English-
Portuguese translator-interpreter, a research member of the GEALIN
group (English for Specific Purposes, ESP, teaching and learning research
group - PUC-SP), and editorial assistant in the Academic Journal ‘the
ESPecialist’. She is responsible for designing and teaching ESP courses
mainly for language proficiency tests such as TOEFL iBT and IELTS. Her
main research interests include ESP, one-to-one classes, and language
proficiency assessments.
Maria Giovanna Tassinari is Director of the Centre for Independent

Language Learning at the Freie Universität Berlin, Germany. She is
currently committee member of LASIG, Learner Autonomy Special
Interest Group of IATEFL, and in the scientific committee of Mélanges
Pédagogiques (Université de Lorraine, France). Her PhD has been
published as Autonomes Fremdsprachenlernen: Komponenten, Kompetenzen,
Strategien (Peter Lang, 2010) and awarded with the Bremer
Forschungspreis 2011. Her research papers have been published in
numerous refereed journals and books in German, English and French. Her
research interests are in learner autonomy, language advising, affect in
language learning, and formal and informal learning.
Jonathan Trace is a PhD candidate at the University of Hawai‘i at

MƗnoa. He has co-authored several articles and book chapters on large-
scale assessment validation, rubric design, classroom assessment, and rater
negotiation in performance assessment. Currently, he is working towards
the completion of his doctoral dissertation on the construction of a
validation argument for cloze test function for second language assessment
purposes. His main research interests include second language assessment,
curriculum design, program evaluation, ecological approaches to learning,
multivariate quantitative methods, and listening and speaking pedagogy.
Romulo P. Villanueva Jr. finished his Master of Arts in Teaching major

in English Language at De La Salle University-Manila as Commission on
Higher Education (CHED) Scholar and has finished the academic
requirements for his PhD in English Language Studies at the University of
Santo Tomas. He also has a Diploma in Teaching English to Speakers of
Other Languages (TESOL) from the London Teacher Training College-
UK. Currently, he is the Assistant Program Head of the Department of
English and teaches English courses. He has conducted research on
language teaching techniques, grammar proficiency, language attitudes,
sociolinguistics and presented papers in International Conferences.
312 Contributors
Penelope Kambakis Vougiouklis is a Professor of Linguistics, Department

of Greek, Democritus University, Greece. She holds a degree from the
Aristotle University of Thessaloniki and an MA and a PhD from the
University of Wales. She has attended more than 80 congresses and she is
the author of more than 100 papers in journals and refereed proceedings,
as well as four monographs. She is fluent in English and can read and
communicate in French and Italian. Her interests focus on mathematical
models in language teaching, Greek as a second/foreign language,
communication strategies with emphasis on guessing as a processing
and/or a learning strategy, confidence as a factor of success in
communication, learning strategies, dictionary use as a communication
strategy, psycholinguistics and dialectology.
Beilei Wang, Associate Professor in the School of Foreign Languages,

Tongji University, is currently the Director of the China English for
Academic Purposes Association. She has been hosting and participating in
various ELT projects at primary, secondary and tertiary levels sponsored
by Tongji University, Shanghai International Studies University, Shanghai
Municipal Education Commission and the Ministry of Education. She has
published extensively in such journals as Foreign Language World,
Journal of Foreign Languages, contributed chapters to books on EFL
theory and practice like Teaching English the Other Way in China and
coauthored a serial textbook Reading Faster. Her main research interests
include learner autonomy, formative assessment, ESP and classroom
teaching.
Lee-Yen Wang has been as an Associate Professor for the English

department of Xiamen University Tan Kah Kee College since 2013. He
has done extensive research in CALL and ESL/EFL. He was a
professional computer consultant for high technology companies in the
US, such as HP, Alcatel-Lucent (DSC), Samsung Telecommunications
America, Fujitsu Network Communication, Cisco, and other medium to
small innovative technology companies all over the United States. He
started to tap his software prowess a few years ago to build a corpus and
write software to analyze language features. He has also created remedial
English programs for elementary school children and junior high school in
Taiwan. He is currently developing a medical translation program for
English major students.
Rining Wei, PhD, is a Lecturer at the Department of English, Culture and

Communication, Xi’an Jiaotong-Liverpool University (XJTLU), Suzhou,
China. He teaches courses related to bilingualism, sociolinguistics, and
research methods at undergraduate and postgraduate levels. He is an active
member of the Research Centre for Language Technology, and serves as a
core member respectively in the Centre’s advisory group and XJTLU’s
Language Policy Working Group. He specialises in EAP, Content and
Language Integrated Learning (CLIL), language policy, and quantitative
methodology. He has published in journals including Asian EFL Journal,
English Today, Journal of Multilingual and Multicultural Development,
and World Englishes.
INDEX
AARP (Assessment for Autonomy process 124

Research Project) 160-173 qualitative 125
academic questionnaire 162, 170-171
English proficiency 241 scales 161
language proficiency 240-242 summative 158-159, 171
Academic Vocabulary List 98-108, sustainable 158, 173
110-113 tool 128, 133, 160
academic writing skills 237, 240 triangulated 161-162, 164
accuracy 201, 203, 213 writing 161, 164
advanced proficiency students 85, attrition, language 161
92 autonomy
advising continuum 123
service 126 degrees of 171
session 124 fostering/promoting 159-160,
counselling sessions 160, 172 171, 173
analytical scale 211 learners’ understanding 127
AS-unit 212, 213 multidimensionality 120, 132
assessment 64-65, 75-77, 119, 201- awareness 119, 123, 128-129, 133
202, 204, 222-223, 232 raising 159-160
assessment reaching 172
alignment 164-169 backwash 288
as learning 158 bar (fuzzy) 80, 85-87, 89-93
criteria 125, 203, 215 beginners/low proficiency students
degrees of 171 85, 92
dynamic 122-123, 133 benefits of self-assessment 125
for autonomy 120 bilingual 64-65, 71, 75-77
for learning 123, 158 Brown Corpus 5
formative 125, 158, 159 capacity 120
iterative 125, 133 CEFR (Common European
learner-centred 158 Framework of Reference) 161
maturity 170 Centre for Independent Language
of language competencies 119 Learning (CILL) 126
of learner autonomy 119, 123 cheating 166
of learning 119, 158 checklists, criterial 162
of learning competencies 119 chi-square (fixed) 15
oral 161, 166 classical test theory (CTT) 7
peer- and self- 161 classroom
power 158, 173 observation 296
practices 159 setting 124
College Entrance and Examination dynamic model of learner autonomy

Center 99, 113 121-122
communicative competence 202, Early Intervention 64-66, 75-77
274 education, distance 159
comparing perspectives 124, 130, empirical, explorative-interpretative
133 approach 121, 125-127
competencies English
language 172 as a foreign language 98-99,
learning 172 272
confidence 80-82, 84, 87-93 as an additional language 178,
confidence interval 54 182-184, 187-188, 197
consciousness, raising 159 learning portfolios (ELP) 137
content English Reference Word List 99,
knowledge 179-180 101, 107-108, 110-111, 113
words 21 error 222-223, 226-227
conversational analysis 299 ESL 273
Corpus of Contemporary American ESP 253-255, 261
English 98-101, 111-112 estimation 41
correlations 239-241 ethnographic (research) 291
counselling meetings, small-group ethnography 291
172 evaluation process 124
criteria Expanding Circle 273
assessment 158, 161 explorative-interpretative approach
assimilation of 159 121
learner-created 162 extreme items 13
pre-determined/ready-made 158, FACETS 6
161 feedback 158
criterion-referenced 60 fitting items 14
CTT item analysis 11 focus group 297
data foreign language 201-203
collection instruments 291-292 formative assessment 137
frozen 163 frequency of use 80-82, 86-93
gathering 126 function words 21
qualitative 163, 170-171 general language proficiency 205
quantitative 163 General Scholastic English Ability
decision-making 125, 128, 172-173 Test 102
declarative knowledge 179 General Service List 99, 100
descriptive statistics 7 Germanic words 23
descriptors goal-setting 173
learner autonomy 121-122, 125 grade, oral 161
perspective 129 grammar 274
dialogue, pedagogical 173 grammatical
diary 298 competency 274
distancing from the task 161 complexity 212-213, 215
document analysis 293 proficiency 274
domain 201-202 Greek-speaking students 84, 85
316 Index
heteronomy 171 learning

high-stakes exam 287, 290 liberatory 158
higher education 158 lifelong 158
holistic scales 211 sustainable 172-173
impact 288 transformatory 172
independent learning 180-182, 197 Letters course 203-205
Inner Circle 273 lexical
interview 126, 295 competence 208, 211
item frequency 201, 203
analysis 11 variety 203
discrimination 12 Likert scales 79, 81, 85-86
Facility (IF) 11 linguistic knowledge 202
reliability 15 logistic regression 46
separation index 14 low-stakes exam 290
Item Response Theory 3 Many Faceted Rasch Measurement
Item Root Mean Square Standard (MFRM) 13
Error (RMSE) 14 measurement of learner autonomy
language 132
assessment 202, 253, 257 metacapacity 120
development 258, 260, 262, metacognition 120, 133, 159, 172-
266, 270 173
learning strategies 80-81, 238- metalanguage 204, 207
239, 241 Ministry of Education 99, 182
Latinate words 23 misfitting items 14
learner mixed methods 132, 291, 296
agency 173 mock tests 209-210
attitudes 121 model of autonomy
autonomy (LA) 138 Everhard’s 171
learner autonomy Tassinari’s dynamic 172
components 120 multiple-choice items 189
definition 120 natural cloze
dimensions 120-121, 131 needs analysis 253, 255, 260-261,
operationalization 119 269
learner objectivity 161
behaviours 121 OPT (Oxford Placement Test) 161
choice of components 124, 127 oral protocols 84
contracts 162 Outer Circle, 273
empowerment 119 parts of speech 21
feedback on self-assessment Pearson correlation coefficients
128-129 164-166
profile 126, 132 pedagogical
profile cards 162 dialogue 124-125, 129-130
resistance 130, 133 knowledge 179
strategies 121 peer-assessment 158-172, 181
strengths 159, 172 of mock samples 163
weaknesses 159, 172 training for 166
Philippine English 273 self-evaluation 180

placement 41 self-regulation 180-181, 197
plagiarism 222-226 sense of ‘being’ as a learner 159
Post-Study intervention 163 sense of ‘self’ 159
Post-Study, AARP 163, 166 small-scale 41
post-tests 297 social cognitive theory 180
Praxis 182, 184-187 SOE (School of English) 160
Pre-Study, AARP 162, 166 specialized proficiency 205
pre-tests 297 stakeholder 287
presentations, oral 161 Standard Filipino English 273
private classes 253-254, 258 statistical analysis 164
procedural knowledge 179 Strategy Inventory for Language
professional disposition 179 Learning (SILL) 80-82, 84, 87
proficiency 58, 83, 85, 93-94, 201- student follow-up 298
202, 215, 274 support for learner autonomy 119,
scale 207-208, 213 123, 126, 129
proficient level 278, 281 supporting/enhancing reflection
qualitative 131-132, 291, 296 128, 130, 133
quantitative 131-132, 291, 296 t-analysis 169-170
questionnaire 126, 293 T-unit 212
Rasch tailored cloze 3
analysis 3 teacher
vertical rulers 15 certification 179, 185, 187
rating scale 203, 208, 210, development 203, 205
rational deletion cloze 3 education 202, 204, 215
reflection 130, 133, 172 effectiveness 178, 179
reflexive methods 132 efficacy 178, 181
reliability 7 language 202, 205-206
research proficiency 201, 202-203
instruments 131 readiness 182-183
empirical 158 talk 202
researcher’s writing 298 TESOL
responsibility, assuming/taking 160, Standards 182, 184-185, 188
172 Teacher Readiness Inventory
rhetoric 159 (T-TRI) 196
rogue group 166 testing 201-202, 215
Rossetti 64-65, 67-68 diagnostic 161
self-assessment 158-173, 180-182, textual borrowing 222-226
197 theoretical-analytical approach 121
deviations 162, 167-170 thinking
contexts 129 criterial 158-159
steps 123-125 critical 173
self-directed language learning 123, TOEFL iBT Proficiency Test 253-
126 254, 256-257, 262, 269
self-direction 160 transactional modes of teaching 158
self-efficacy 178
318 Index
transformative modes of teaching vocabulary size 222, 223-226

158 Vocabulary Size Test (VST) 222,
transmissional modes of teaching 228
158 washback 287
triangulation 291 WINSTEPS 6
Tukey-Kramer tests 164-167, 169 word frequency 23
Turkish-Greek bilingual students 84 word origin 23
underfitting items 13 writing 222-226
United Arab Emirates 181 Yes/No Test 101-104, 106, 110
validation methods 121

Current Issues in Language Evaluation, Assessment and Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Current Issues in Language Evaluation, Assessment and Testing

Uploaded by

Copyright:

Available Formats

Current Issues

Research and Practice

Christina Gitsaki and Christine Coombe

Edited by Christina Gitsaki and Christine Coombe

This book first published 2016

Cambridge Scholars Publishing

Lady Stephenson Library, Newcastle upon Tyne, NE6 2PA, UK

British Library Cataloguing in Publication Data

Copyright © 2016 by Christina Gitsaki, Christine Coombe

ISBN (10): 1-4438-8590-8

List of Tables ............................................................................................ viii

List of Figures............................................................................................ xii

List of Appendices .................................................................................... xiv

List of Abbreviations ................................................................................. xv

Preface ...................................................................................................... xix

Issues in the Analysis and Modification of Assessment Tools and Tests

Chapter One ................................................................................................. 2

Chapter Two .............................................................................................. 40

Chapter Three ............................................................................................ 63

Chapter Four .............................................................................................. 80

Chapter Five .............................................................................................. 98

Issues in the Creation of Assessment and Evaluation Tools

Chapter Six .............................................................................................. 118

Chapter Seven.......................................................................................... 137

Chapter Eight ........................................................................................... 158

Chapter Nine............................................................................................ 177

Chapter Ten ............................................................................................. 201

Issues in Language Assessment and Evaluation

Chapter Eleven ........................................................................................ 222

Chapter Twelve ....................................................................................... 236

Chapter Thirteen ...................................................................................... 253

Chapter Fourteen ..................................................................................... 272

Chapter Fifteen ........................................................................................ 287

Contributors ............................................................................................. 305

Index ........................................................................................................ 314

Table 1-1: Descriptive statistics for 50 cloze passages and

Table 2-4: Summary report of a candidate (Adapted from Linacre

Table 7-5: Planning and reflecting.

Table 14-2: Mean and standard deviation of participants’ total score

Figure 2-1: Sample representation of logistic regression analysis of a

Figure 13-4: Students’ perceptions of their language skills before the

Appendix 1-A: List of participating universities.

A Self-Assessor (in Tukey-Kramer formulae)

ELF English as a Lingua Franca

N Number (of participants)

UNISEB/Estacio Centro Universitário do Sistema Educacional

Language assessment, whether formative or summative, plays an

with predefined item difficulty levels in order to properly assess a

in Chapter Eight. Greek EFL students in a university context participated

order to provide a deeper understanding of the effectiveness of English for

Thomaï Alexiou Aristotle University, Greece

The volume presents research studies conducted in a variety of

Christina Gitsaki and Christine Coombe

HOW WELL DO CLOZE ITEMS WORK

JAMES DEAN BROWN, JONATHAN TRACE,

involved in 50 cloze tests that were administered to university-level

1. How do the CTT descriptive statistics, reliability, and item analyses

countries, the proportions of young people who go to university, and so

Then, too, we found that the logarithmically transformed word

Descriptive Statistics. As most previous item analyses of cloze tests have

Table 1-1: Descriptive statistics for 50 cloze passages and reliability –

Table 1-2: Descriptive statistics for 50 cloze passages and reliability –

Table 1-3: Frequencies and percentages of tests that functioned well